From vlad at lists.openfabrics.org Sun Feb 1 03:11:43 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 1 Feb 2009 03:11:43 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090201-0200 daily build status Message-ID: <20090201111144.1E366E60EFD@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From Jie.Cai at cs.anu.edu.au Sun Feb 1 23:14:35 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Mon, 02 Feb 2009 18:14:35 +1100 Subject: [ofa-general] Multiports single HCA uDAPL program problem In-Reply-To: References: <20090129200005.20863E61234@openfabrics.org> <4982A3D8.5030503@cs.anu.edu.au> Message-ID: <49869D5B.9020004@cs.anu.edu.au> One more problem happened when trying to establish 1 connection per rail, as illustrated in the graph. node0 node1 rail0: psp0 <----------------> ep0 (port 0 on hca) rail1: psp1 <----------------> ep1 (port 1 on hca) rail0 got connected first and connection are always stable and correct. However rail1 sometime connected properly sometime doesn't. Following is the error message: 11836 Waiting for connect response 11836 Error unexpected conn event : DAT_CONNECTION_EVENT_NON_PEER_REJECTED 11836 Error connect_ep: DAT_ABORT The program establishes the connection for both rail exactly the same. What may caused this? Regards, -- Jie Cai Davis, Arlin R wrote: > This looks like an ARP issue across your IPoIB interfaces. > > Please see section 6 of the uDAPL OFED BKM. > > http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_ofed_testing_bkm.pdf > > 6. Multi IB port configuration, IPoIB arp reply issues > > When two interfaces running one interface may reply to an ARP > directed to the other interface on the system. The following > configuration will cause the interfaces to ignore ARP requests if > not specifically for their IP address. > > Add the following lines to /etc/sysctl.conf > net.ipv4.conf.all.arp_ignore=1 > net.ipv4.conf.ib0.arp_ignore=1 > net.ipv4.conf.ib1.arp_ignore=1 > > or use sysctl: > sysctl -w net.ipv4.conf.all.arp_ignore=1 > sysctl -w net.ipv4.conf.ib0.arp_ignore=1 > sysctl -w net.ipv4.conf.ib1.arp_ignore=1 > > -arlin > > >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jie Cai >> Sent: Thursday, January 29, 2009 10:53 PM >> To: general at lists.openfabrics.org >> Subject: [ofa-general] Multiports single HCA uDAPL program problem >> >> Hi All, >> >> I am kind of noob on IB and uDAPL program. Currently, I am trying to >> write a program with multirail that utilizes 2 ports on a >> single Mallenox >> ConnectX HCA on both nodes. >> >> OFED1.3 has been installed on a SUSE 10.3 linux system. >> >> The current problem is that IB connection via uDAPL are very unstable, >> and sometime the connection can't be established. >> Error message is usually like: >> >> 20350 Server waiting for connect request on port 45248 >> accept: ERR dev(0x61d0e0!=0x61d0e0) or port mismatch(1!=2) >> 20350 Error dat_cr_accept: DAT_INTERNAL_ERROR >> 20350 Error connect_ep: DAT_INTERNAL_ERROR >> >> The status of both port are active: >> hca_id: mlx4_0 >> fw_ver: 2.3.000 >> node_guid: 0003:ba00:0100:702c >> sys_image_guid: 0003:ba00:0100:702f >> vendor_id: 0x02c9 >> vendor_part_id: 25418 >> hw_ver: 0xA0 >> board_id: SUN0070000001 >> phys_port_cnt: 2 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 10 >> port_lid: 8 >> port_lmc: 0x00 >> >> port: 2 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 10 >> port_lid: 9 >> port_lmc: 0x00 >> >> >> I haven't done any specific configuration for multi-port. I assume that >> OFED1.3 can do it automatically. >> >> Would please any one help me on this? >> >> Regards, >> Jie >> >> -- >> Jie Cai >> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > From vlad at lists.openfabrics.org Mon Feb 2 03:11:52 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 2 Feb 2009 03:11:52 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090202-0200 daily build status Message-ID: <20090202111152.CADA8E60F0E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From halr at obsidianresearch.com Mon Feb 2 08:18:03 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Mon, 02 Feb 2009 09:18:03 -0700 Subject: [ofa-general] [PATCH] libibmad/(mad.h fields.c): Add support for PerfMgt ClassPortInfo Message-ID: <1233591483.8992.368.camel@bertha1.edm.orcorp.ca> Sasha, Attached is a patch to add support for PerfMgt ClassPortInfo attribute into libibmad. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-libibmad-mad.h-fields.c-Add-support-for-PerfMgt-C.patch Type: application/mbox Size: 5498 bytes Desc: not available URL: From halr at obsidianresearch.com Mon Feb 2 08:18:41 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Mon, 02 Feb 2009 09:18:41 -0700 Subject: [ofa-general] [PATCH] ibsim/sim_mad.c: Add sim support for PerfMgt ClassPortInfo Message-ID: <1233591521.8992.369.camel@bertha1.edm.orcorp.ca> Sasha, Attached is a patch to add simulator support for PerfMgt ClassPortInfo (subsequent to previous libibmad patch). -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-ibsim-sim_mad.c-Add-sim-support-for-PerfMgt-ClassPo.patch Type: application/mbox Size: 2157 bytes Desc: not available URL: From swise at opengridcomputing.com Mon Feb 2 08:25:14 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 02 Feb 2009 10:25:14 -0600 Subject: [ofa-general] dapl attribute bug Message-ID: <49871E6A.9000901@opengridcomputing.com> Hey Arlin, We've uncovered a problem with the DAPL attribute mappings to the linux rdma device attributes. The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl code maps this to the linux ib_device_attr->max_mr_size which is u64. This causes dapltest to fail in some cases when running over chelsio which sets max_mr_size to 0x100000000 (4GB). The dapl code truncates the value to 0. See dapl/openib_cma/dapl_ib_util.c. I'm not sure what the fix should be, but maybe the dapl code should set anything over 32 bits to 0xffffffff? Steve. From rdreier at cisco.com Mon Feb 2 09:00:53 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Feb 2009 09:00:53 -0800 Subject: [ofa-general] Re: [PATCH v2] RDMA/nes: Account for freed pbl after hw operation In-Reply-To: <20090123212445.GA6248@ctung-MOBL> (Chien Tung's message of "Fri, 23 Jan 2009 15:24:45 -0600") References: <20090123212445.GA6248@ctung-MOBL> Message-ID: > @@ -572,6 +573,8 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) > nesmr->ibmw.rkey = ibfmr->rkey; > nesmr->ibmw.uobject = NULL; > > + rc = nes_dealloc_mw(&nesmr->ibmw); > + > if (nesfmr->nesmr.pbls_used != 0) { > spin_lock_irqsave(&nesadapter->pbl_lock, flags); > if (nesfmr->nesmr.pbl_4k) { Can this be right? nes_dealloc_mw() fails, so the HW still thinks it owns the resources, and then the function just continues and releases the PBLs before returning? [And same issue seems to be there for the change to nes_dereg_mr] - R. From arlin.r.davis at intel.com Mon Feb 2 10:01:32 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 2 Feb 2009 10:01:32 -0800 Subject: [ofa-general] Multiports single HCA uDAPL program problem In-Reply-To: <49869D5B.9020004@cs.anu.edu.au> References: <20090129200005.20863E61234@openfabrics.org> <4982A3D8.5030503@cs.anu.edu.au> <49869D5B.9020004@cs.anu.edu.au> Message-ID: >One more problem happened when trying to establish 1 connection per >rail, as illustrated >in the graph. > > node0 node1 >rail0: psp0 <----------------> ep0 (port 0 on hca) >rail1: psp1 <----------------> ep1 (port 1 on hca) > >rail0 got connected first and connection are always stable and correct. >However rail1 sometime connected properly sometime doesn't. >Following is the error message: > >11836 Waiting for connect response >11836 Error unexpected conn event : >DAT_CONNECTION_EVENT_NON_PEER_REJECTED >11836 Error connect_ep: DAT_ABORT > >The program establishes the connection for both rail exactly the same. >What may caused this? rdma_cm is rejecting the connect request. Turn on warnings for more information: export DAPL_DBG_TYPE=0x0003 -arlin From halr at obsidianresearch.com Mon Feb 2 10:58:35 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Mon, 02 Feb 2009 11:58:35 -0700 Subject: [ofa-general] [PATCHv2] libibmad/(mad.h fields.c): Add support for PerfMgt ClassPortInfo Message-ID: <1233601115.8992.380.camel@bertha1.edm.orcorp.ca> Sasha, Attached is v2 of a patch to add support for PerfMgt ClassPortInfo attribute into libibmad. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-libibmad-mad.h-fields.c-Add-support-for-PerfMgt-C.patch Type: application/mbox Size: 5505 bytes Desc: not available URL: From halr at obsidianresearch.com Mon Feb 2 10:58:46 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Mon, 02 Feb 2009 11:58:46 -0700 Subject: [ofa-general] [PATCHv2] ibsim/sim_mad.c: Add sim support for PerfMgt ClassPortInfo Message-ID: <1233601126.8992.381.camel@bertha1.edm.orcorp.ca> Sasha, Attached is v2 of a patch to add simulator support for PerfMgt ClassPortInfo (subsequent to previous libibmad patch). -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-ibsim-sim_mad.c-Add-sim-support-for-PerfMgt-ClassPo.patch Type: application/mbox Size: 2164 bytes Desc: not available URL: From halr at obsidianresearch.com Mon Feb 2 11:06:50 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Mon, 02 Feb 2009 12:06:50 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_perfmgr_db.h: Remove unused typedef Message-ID: <1233601610.8992.389.camel@bertha1.edm.orcorp.ca> Sasha, Trivial patch to remove an unused typedef in perfmgr. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-opensm-osm_perfmgr_db.h-Remove-unused-typedef.patch Type: application/mbox Size: 1006 bytes Desc: not available URL: From halr at obsidianresearch.com Mon Feb 2 11:07:01 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Mon, 02 Feb 2009 12:07:01 -0700 Subject: [ofa-general] [PATCH][MINOR] opensm/osm_perfmgr.c: Eliminate memory leak on error Message-ID: <1233601621.8992.390.camel@bertha1.edm.orcorp.ca> Sasha, Minor patch to osm_perfmgr.c to eliminate a memory leak on error in osm_perfmgr_init. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-opensm-osm_perfmgr.c-In-osm_perfmgr_init-eliminate.patch Type: application/mbox Size: 1378 bytes Desc: not available URL: From sashak at voltaire.com Mon Feb 2 12:29:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 2 Feb 2009 22:29:04 +0200 Subject: [ofa-general] [PATCH 3/4] opensm/osm_log.c save log_max_size in subnet opt in MB In-Reply-To: <497DC9B6.5010200@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC9B6.5010200@gmail.com> Message-ID: <20090202202904.GD5910@sashak.voltaire.com> Hi Eli, On 16:33 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: > save log_max_size in subnet opt in MB > the max_size in the log object is converted to bytes. > > Signed-off-by: Eli Dorfman > --- > opensm/opensm/main.c | 5 ++--- > opensm/opensm/osm_log.c | 2 +- > 2 files changed, 3 insertions(+), 4 deletions(-) > > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 0f7b822..de38056 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -778,9 +778,8 @@ int main(int argc, char *argv[]) > break; > > case 'L': > - opt.log_max_size = > - strtoul(optarg, NULL, 0) * (1024 * 1024); > - printf(" Log file max size is %lu bytes\n", > + opt.log_max_size = strtoul(optarg, NULL, 0); > + printf(" Log file max size is %lu MBytes\n", > opt.log_max_size); > break; > > diff --git a/opensm/opensm/osm_log.c b/opensm/opensm/osm_log.c > index 88633ab..d5e1af6 100644 > --- a/opensm/opensm/osm_log.c > +++ b/opensm/opensm/osm_log.c > @@ -306,7 +306,7 @@ ib_api_status_t osm_log_init_v2(IN osm_log_t * const p_log, > p_log->level = log_flags; > p_log->flush = flush; > p_log->count = 0; > - p_log->max_size = max_size; > + p_log->max_size = max_size << 20; /* convert size in MB to bytes */ > p_log->accum_log_file = accum_log_file; > p_log->log_file_name = (char *)log_file; This is obviously not sufficient change. If you decided to store max log file size value in MB in options structure then all places where it is parsed/dumped should be changed. Something like this: diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index f786192..6f0d85e 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -777,9 +777,8 @@ int main(int argc, char *argv[]) break; case 'L': - opt.log_max_size = - strtoul(optarg, NULL, 0) * (1024 * 1024); - printf(" Log file max size is %lu bytes\n", + opt.log_max_size = strtoul(optarg, NULL, 0); + printf(" Log file max size is %lu MBytes\n", opt.log_max_size); break; diff --git a/opensm/opensm/osm_log.c b/opensm/opensm/osm_log.c index 88633ab..d5e1af6 100644 --- a/opensm/opensm/osm_log.c +++ b/opensm/opensm/osm_log.c @@ -306,7 +306,7 @@ ib_api_status_t osm_log_init_v2(IN osm_log_t * const p_log, p_log->level = log_flags; p_log->flush = flush; p_log->count = 0; - p_log->max_size = max_size; + p_log->max_size = max_size << 20; /* convert size in MB to bytes */ p_log->accum_log_file = accum_log_file; p_log->log_file_name = (char *)log_file; diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 94b6332..2141899 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1141,7 +1141,6 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) opts_unpack_uint32("log_max_size", p_key, p_val, (void *) & p_opts->log_max_size); - p_opts->log_max_size *= 1024 * 1024; /* convert to MB */ opts_unpack_charp("partition_config_file", p_key, p_val, &p_opts->partition_config_file); @@ -1620,7 +1619,7 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) p_opts->log_flags, p_opts->force_log_flush ? "TRUE" : "FALSE", p_opts->log_file, - p_opts->log_max_size/1024/1024, + p_opts->log_max_size, p_opts->accum_log_file ? "TRUE" : "FALSE", p_opts->dump_files_dir, p_opts->enable_quirks ? "TRUE" : "FALSE", I'm committing this with change above. Sasha From sashak at voltaire.com Mon Feb 2 12:59:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 2 Feb 2009 22:59:31 +0200 Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet configuration after SIGHUP In-Reply-To: <497DC96F.3000902@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> Message-ID: <20090202205924.GF5910@sashak.voltaire.com> On 16:32 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: > rescan subnet configuration after SIGHUP > call osm_subn_rescan_conf_files() after SIGHUP. > this is important when priority is changed and SM is in standby. > in that case it will not send capability mask trap and will not become master. > > Signed-off-by: Eli Dorfman > --- > opensm/opensm/main.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index f786192..0f7b822 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) > osm_hup_flag = 0; > /* a HUP signal should only start a new heavy sweep */ > p_osm->subn.force_heavy_sweep = TRUE; > + osm_subn_rescan_conf_files(&p_osm->subn); Is it synchronized with sweep? If regular (scheduled by timer) sweep starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters are freed..., etc.). I think it is not. Sasha > osm_opensm_sweep(p_osm); > } > } > -- > 1.5.5 > From sean.hefty at intel.com Mon Feb 2 14:11:45 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Feb 2009 14:11:45 -0800 Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> Message-ID: <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> Forwarding to general list and copying Sasha. >Hello, > The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in >opensm\user\include\vendor\osm_vendor_al.h) is missing >the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'. >saquery.c expects to find ca_names in osm_vendor_t. > >A couple of observations: >1) Windows currently supports a much older version of opensm than what OFED 1.4 >tools expect. > >2) saquery uses OpenSM mad interfaces while it 'could' be using libibmad >interfaces. > If libibmad from saquery, then OpenSM would not need libibmad references for >Windows. > >3) How bad is it to create libibmad dependencies for OpenSM? > >4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces; >the rest use > libibmad. > >Most of the OFED diagnostic tools support the cmd-line option '-C ca_name'. >This cmd-line input is resolved thru >libibmad interfaces. >Saquery is no exception in that it expects to match the '-C ca_name' against >osm_vendor_t.ca_names[]. 'ibstat -l' lists >CA names. > >The question becomes how best to resolve the missing ca_names? > >1) modify saquery to call libibmad interface to get CA names; leaves >osm_vendor_t unmodified. > So far, saquery is the only diag pgm which uses OSM mad interfaces; >expecting ca_names > in osm_vendor_t. > >2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and populate >ca_names > from OpenSM code? Creates libibmad dependencies for opensm. > >Comments? > >Thanks, > >Stan. From sean.hefty at intel.com Mon Feb 2 14:51:56 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Feb 2009 14:51:56 -0800 Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> Message-ID: >>4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces; >>the rest use libibmad. Looking briefly at the saquery code, I don't understand the benefit to using the opensm vendor interfaces, versus using libibmad or even libibumad directly, and switching to libibumad looks doable. (It's not clear to me that there are benefits to using libibmad over libibumad for saquery.) - osm_bind_handle_t looks like it could map to a libibumad port_id (int). - osmv_query_sa() could map to umad_send(), followed by umad_recv() to obtain the result. (Replace osmv_query_sa with a new function.) - There are a couple other calls that are used to loop through all returned attributes in a response MAD. We could use the MAD attribute offset directly. (Update loops where osmv_get_query_* is called.) Are there technical reasons why the opensm vendor library was chosen for saquery? Would there be any objection to changing saquery to use libibumad directly? - Sean From weiny2 at llnl.gov Mon Feb 2 15:06:58 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 2 Feb 2009 15:06:58 -0800 Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> Message-ID: <20090202150658.0af72134.weiny2@llnl.gov> On Mon, 2 Feb 2009 14:51:56 -0800 "Sean Hefty" wrote: > >>4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces; > >>the rest use libibmad. > > Looking briefly at the saquery code, I don't understand the benefit to using the > opensm vendor interfaces, versus using libibmad or even libibumad directly, and > switching to libibumad looks doable. (It's not clear to me that there are > benefits to using libibmad over libibumad for saquery.) > > - osm_bind_handle_t looks like it could map to a libibumad port_id (int). > - osmv_query_sa() could map to umad_send(), followed by umad_recv() to > obtain the result. (Replace osmv_query_sa with a new function.) > - There are a couple other calls that are used to loop through all returned > attributes in a response MAD. We could use the MAD attribute offset > directly. (Update loops where osmv_get_query_* is called.) > > Are there technical reasons why the opensm vendor library was chosen for > saquery? Would there be any objection to changing saquery to use libibumad > directly? I don't remember the exact details but at the time saquery was first written, ibmad/ibumad did not have all the functionality I needed and the OpenSM vendor library did. That may no longer be the case and if not then I would support converting to using those other libraries. Ira > > - Sean > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > From donald.e.wood at intel.com Mon Feb 2 15:07:56 2009 From: donald.e.wood at intel.com (Wood, Donald E) Date: Mon, 2 Feb 2009 16:07:56 -0700 Subject: [ofa-general] RE: [PATCH v2] RDMA/nes: Account for freed pbl after hw operation In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA383032085A8F4@azsmsx501.amr.corp.intel.com> References: <60BEFF3FBD4C6047B0F13F205CAFA383032085A8F4@azsmsx501.amr.corp.intel.com> Message-ID: <588992150B702C48B3312184F1B810AD03A516FC3D@azsmsx501.amr.corp.intel.com> > > @@ -572,6 +573,8 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) > > nesmr->ibmw.rkey = ibfmr->rkey; > > nesmr->ibmw.uobject = NULL; > > > > + rc = nes_dealloc_mw(&nesmr->ibmw); > > + > > if (nesfmr->nesmr.pbls_used != 0) { > > spin_lock_irqsave(&nesadapter->pbl_lock, flags); > > if (nesfmr->nesmr.pbl_4k) { > > Can this be right? nes_dealloc_mw() fails, so the HW still thinks it > owns the resources, and then the function just continues and releases > the PBLs before returning? You are right, the code in nes_dealloc_fmr is missing a check of the return code. This will be updated in a patch to follow. > [And same issue seems to be there for the change to nes_dereg_mr] I believe that nes_dereg_mr is correctly checking return codes and does not need to be changed. Please let me know if you still see a problem here. Don Wood From chien.tin.tung at intel.com Mon Feb 2 15:15:21 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Mon, 2 Feb 2009 17:15:21 -0600 Subject: [ofa-general] [PATCH v3] RDMA/nes: Account for freed pbl after hw operation Message-ID: <20090202231521.GA6220@ctung-MOBL> From: Don Wood Fix occurrences where the software pbl counts were changed before the hardware was updated. This bug allowed another thread to overallocate the hardware resources. Add proper pbl accounting in case nes_reg_mr failed. Signed-off-by: Don Wood --- V3 change: In nes_dealloc_fmr(), check return code from nes_dealloc_mw before pbl accounting. diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 4cfb4d9..b42b17a 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -551,6 +551,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) struct nes_device *nesdev = nesvnic->nesdev; struct nes_adapter *nesadapter = nesdev->nesadapter; int i = 0; + int rc; /* free the resources */ if (nesfmr->leaf_pbl_cnt == 0) { @@ -572,7 +573,9 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) nesmr->ibmw.rkey = ibfmr->rkey; nesmr->ibmw.uobject = NULL; - if (nesfmr->nesmr.pbls_used != 0) { + rc = nes_dealloc_mw(&nesmr->ibmw); + + if ((rc == 0) && (nesfmr->nesmr.pbls_used != 0)) { spin_lock_irqsave(&nesadapter->pbl_lock, flags); if (nesfmr->nesmr.pbl_4k) { nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used; @@ -584,7 +587,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } - return nes_dealloc_mw(&nesmr->ibmw); + return rc; } @@ -1993,7 +1996,16 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, stag, ret, cqp_request->major_code, cqp_request->minor_code); major_code = cqp_request->major_code; nes_put_cqp_request(nesdev, cqp_request); - + if ((!ret || major_code) && pbl_count != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (pbl_count > 1) + nesadapter->free_4kpbl += pbl_count+1; + else if (residual_page_count > 32) + nesadapter->free_4kpbl += pbl_count; + else + nesadapter->free_256pbl += pbl_count; + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } if (!ret) return -ETIME; else if (major_code) @@ -2607,24 +2619,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) cqp_request->waiting = 1; cqp_wqe = &cqp_request->cqp_wqe; - spin_lock_irqsave(&nesadapter->pbl_lock, flags); - if (nesmr->pbls_used != 0) { - if (nesmr->pbl_4k) { - nesadapter->free_4kpbl += nesmr->pbls_used; - if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) { - printk(KERN_ERR PFX "free 4KB PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_4kpbl, nesadapter->max_4kpbl); - } - } else { - nesadapter->free_256pbl += nesmr->pbls_used; - if (nesadapter->free_256pbl > nesadapter->max_256pbl) { - printk(KERN_ERR PFX "free 256B PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_256pbl, nesadapter->max_256pbl); - } - } - } - - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); nes_fill_init_cqp_wqe(cqp_wqe, nesdev); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_OPCODE_IDX, NES_CQP_DEALLOCATE_STAG | NES_CQP_STAG_VA_TO | @@ -2642,11 +2636,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " CQP Major:Minor codes = 0x%04X:0x%04X\n", ib_mr->rkey, ret, cqp_request->major_code, cqp_request->minor_code); - nes_free_resource(nesadapter, nesadapter->allocated_mrs, - (ib_mr->rkey & 0x0fffff00) >> 8); - - kfree(nesmr); - major_code = cqp_request->major_code; minor_code = cqp_request->minor_code; @@ -2662,8 +2651,33 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " to destroy STag, ib_mr=%p, rkey = 0x%08X\n", major_code, minor_code, ib_mr, ib_mr->rkey); return -EIO; - } else - return 0; + } + + if (nesmr->pbls_used != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesmr->pbl_4k) { + nesadapter->free_4kpbl += nesmr->pbls_used; + if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) + printk(KERN_ERR PFX "free 4KB PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_4kpbl, + nesadapter->max_4kpbl); + } else { + nesadapter->free_256pbl += nesmr->pbls_used; + if (nesadapter->free_256pbl > nesadapter->max_256pbl) + printk(KERN_ERR PFX "free 256B PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_256pbl, + nesadapter->max_256pbl); + } + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + (ib_mr->rkey & 0x0fffff00) >> 8); + + kfree(nesmr); + + return 0; } -- 1.5.3.3 From sean.hefty at intel.com Mon Feb 2 15:19:31 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Feb 2009 15:19:31 -0800 Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: <20090202150658.0af72134.weiny2@llnl.gov> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> <20090202150658.0af72134.weiny2@llnl.gov> Message-ID: <9632920386E943489C39D8637052F404@amr.corp.intel.com> >I don't remember the exact details but at the time saquery was first written, >ibmad/ibumad did not have all the functionality I needed and the OpenSM vendor >library did. That may no longer be the case and if not then I would support >converting to using those other libraries. libibumad does require the user to provide the address to the SA. Providing a libibumad helper function to fill out ib_mad_addr_t for the local SA seems reasonable. I guess we can look at what it would take to convert it in detail to see if anything is still missing from the lower libraries. - Sean From weiny2 at llnl.gov Mon Feb 2 18:54:25 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 2 Feb 2009 18:54:25 -0800 Subject: [ofa-general] [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces Message-ID: <20090202185425.729a80b3.weiny2@llnl.gov> Begining to clean up the libibmad interface. Ira >From 7e2f639905af92a6d4466d42af2e3e65bd717ffb Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Mon, 2 Feb 2009 10:21:18 -0800 Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces Signed-off-by: weiny2 at llnl.gov --- libibmad/include/infiniband/mad.h | 38 ++++++++++++++++++------------------ libibmad/src/fields.c | 22 ++++++++++---------- libibmad/src/resolve.c | 10 ++++---- 3 files changed, 35 insertions(+), 35 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 9ff4a3e..f235ab0 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -203,7 +203,7 @@ typedef struct ib_field { ib_mad_dump_fn *def_dump_fn; } ib_field_t; -enum MAD_FIELDS { +typedef enum MAD_FIELDS { IB_NO_FIELD, IB_GID_PREFIX_F, @@ -525,7 +525,7 @@ enum MAD_FIELDS { IB_GUID_GUID0_F, IB_FIELD_LAST_ /* must be last */ -}; +} mad_field_t; /* * SA RMPP section @@ -595,21 +595,21 @@ typedef struct ib_vendor_call { #define MAD_DEF_RETRIES 3 #define MAD_DEF_TIMEOUT_MS 1000 -enum { +typedef enum { IB_DEST_LID, IB_DEST_DRPATH, IB_DEST_GUID, IB_DEST_DRSLID, -}; +} mad_dest_t; -enum { +typedef enum { IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER, NODE_RNIC, IB_NODE_MAX = NODE_RNIC -}; +} mad_node_type_t; /******************************************************************************/ @@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey) } /* fields.c */ -MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); -MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field); +MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val); /* field must be byte aligned */ -MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); -MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field); +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val); -MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); -MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); -MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val); -MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val); -MAD_EXPORT int mad_print_field(int field, const char *name, void *val); -MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); -MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); +MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val); +MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val); +MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val); +MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val); +MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val); +MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val); +MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val); /* mad.c */ MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, @@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout); MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id); + mad_dest_t dest, ib_portid_t * sm_id); MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid); @@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout, const void *srcport); int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id, + mad_dest_t dest, ib_portid_t * sm_id, const void *srcport); int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, const void *srcport); diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index d5a1eb4..d435a2f 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f, memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8); } -uint32_t mad_get_field(void *buf, int base_offs, int field) +uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field) { return _get_field(buf, base_offs, ib_mad_f + field); } -void mad_set_field(void *buf, int base_offs, int field, uint32_t val) +void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val) { _set_field(buf, base_offs, ib_mad_f + field, val); } -uint64_t mad_get_field64(void *buf, int base_offs, int field) +uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field) { return _get_field64(buf, base_offs, ib_mad_f + field); } -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val) +void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val) { _set_field64(buf, base_offs, ib_mad_f + field, val); } -void mad_set_array(void *buf, int base_offs, int field, void *val) +void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val) { _set_array(buf, base_offs, ib_mad_f + field, val); } -void mad_get_array(void *buf, int base_offs, int field, void *val) +void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val) { _get_array(buf, base_offs, ib_mad_f + field, val); } -void mad_decode_field(uint8_t * buf, int field, void *val) +void mad_decode_field(uint8_t * buf, mad_field_t field, void *val) { const ib_field_t *f = ib_mad_f + field; @@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val) _get_array(buf, 0, f, val); } -void mad_encode_field(uint8_t * buf, int field, void *val) +void mad_encode_field(uint8_t * buf, mad_field_t field, void *val) { const ib_field_t *f = ib_mad_f + field; @@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val, valsz ? valsz : ALIGN(f->bitlen, 8) / 8); } -int mad_print_field(int field, const char *name, void *val) +int mad_print_field(mad_field_t field, const char *name, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return -1; return _mad_print_field(ib_mad_f + field, name, val, 0); } -char *mad_dump_field(int field, char *buf, int bufsz, void *val) +char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); } -char *mad_dump_val(int field, char *buf, int bufsz, void *val) +char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index b62360b..faac1f9 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, } int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id, + mad_dest_t dest, ib_portid_t * sm_id, const void *srcport) { uint64_t guid; @@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, ib_portid_t selfportid = { 0 }; int selfport = 0; - switch (dest_type) { + switch (dest) { case IB_DEST_LID: lid = strtol(addr_str, 0, 0); if (!IB_LID_VALID(lid)) @@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, return 0; default: - IBWARN("bad dest_type %d", dest_type); + IBWARN("bad dest %d", dest); } return -1; } -int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type, +int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest, ib_portid_t * sm_id) { - return ib_resolve_portid_str_via(portid, addr_str, dest_type, + return ib_resolve_portid_str_via(portid, addr_str, dest, sm_id, NULL); } -- 1.5.4.5 From krkumar2 at in.ibm.com Mon Feb 2 19:25:14 2009 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 3 Feb 2009 08:55:14 +0530 Subject: [ofa-general] Support for CXGB3 RNIC on P6 Message-ID: Hi, My colleague (at a different site) is trying to get couple of Chelsio RNIC adapters working on p6 systems but for some reason the cards aren't recognized on bootup. The same cards works on my xseries systems, and following are the messages I get (there are no messages on his p6 systems): Feb 1 11:42:49 localhost kernel: Chelsio T3 Network Driver - version 1.1.1-ko Feb 1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 Feb 1 11:42:49 localhost kernel: input: Power Button (FF) as /class/input/input1 Feb 1 11:42:49 localhost kernel: ACPI: Power Button (FF) [PWRF] Feb 1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: Port 0 using 4 queue sets. Feb 1 11:42:49 localhost kernel: eth2: Chelsio T310 10GBASE-R RNIC (rev 4) PCI Express x8 MSI-X Feb 1 11:42:49 localhost kernel: eth2: 128MB CM, 256MB PMTX, 256MB PMRX, S/N: PT49070050 Is this revision of cxgb3 (rev4) not supported on p6? Or are we missing something to get it to work? thanks, - KK From sean.hefty at intel.com Mon Feb 2 21:29:16 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Feb 2009 21:29:16 -0800 Subject: [ofa-general] [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090202185425.729a80b3.weiny2@llnl.gov> References: <20090202185425.729a80b3.weiny2@llnl.gov> Message-ID: <475BCB11F74B45BB8D8794BAEEC380C2@amr.corp.intel.com> >@@ -595,21 +595,21 @@ typedef struct ib_vendor_call { > #define MAD_DEF_RETRIES 3 > #define MAD_DEF_TIMEOUT_MS 1000 > >-enum { >+typedef enum { > IB_DEST_LID, > IB_DEST_DRPATH, > IB_DEST_GUID, > IB_DEST_DRSLID, >-}; >+} mad_dest_t; > >-enum { >+typedef enum { > IB_NODE_CA = 1, > IB_NODE_SWITCH, > IB_NODE_ROUTER, > NODE_RNIC, > > IB_NODE_MAX = NODE_RNIC >-}; >+} mad_node_type_t; For consistency, should these be named enums? (MAD_DEST and MAD_NODE_TYPE) - Sean From devesh28 at gmail.com Mon Feb 2 23:49:26 2009 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 3 Feb 2009 13:19:26 +0530 Subject: ***SPAM*** Re: ***SPAM*** [ofa-general] compiling OFED-1.2 with RHEL5.1 In-Reply-To: <309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com> References: <309a667c0812290320m54efd47fr27affb1d5cc6dcec@mail.gmail.com> <4958CB6A.3090306@mellanox.co.il> <309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com> Message-ID: <309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com> Hello list, I have successfully ported ofed-1.2 for RHEL5.1. should I post the patch? On Tue, Dec 30, 2008 at 10:38 AM, Devesh Sharma wrote: > hello Tziporet, thanks for replying, I will try to do this, how many > changes do you think I will have to made, are they many? > If there are some problems I will contact to you for further help > > -Devesh > > On Mon, Dec 29, 2008 at 6:36 PM, Tziporet Koren < > tziporet at dev.mellanox.co.il> wrote: > >> Devesh Sharma wrote: >> >>> Hello all, >>> I am trying to compile OFED-1.2 with RHEL5.1 I know that this OS is not >>> supported by this >>> distribution, is there any work around other than switing to OFED-1.2.5 >>> or OFED-1.3? >>> >>> >> I don't think there is a workaround >> You can try to take RHEL 5.1 backports from 1.2.5 and use them on 1.2 but >> I guess you will have to change them >> >> Tziporet >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Tue Feb 3 00:22:37 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 03 Feb 2009 10:22:37 +0200 Subject: ***SPAM*** [ofa-general] compiling OFED-1.2 with RHEL5.1 In-Reply-To: <309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com> References: <309a667c0812290320m54efd47fr27affb1d5cc6dcec@mail.gmail.com> <4958CB6A.3090306@mellanox.co.il> <309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com> <309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com> Message-ID: <4987FECD.6000409@mellanox.co.il> Devesh Sharma wrote: > Hello list, > > I have successfully ported ofed-1.2 for RHEL5.1. should I post the patch? > > On Tue, Dec 30, 2008 at 10:38 AM, Devesh Sharma > wrote: > > hello Tziporet, thanks for replying, I will try to do this, how > many changes do you think I will have to made, are they many? > If there are some problems I will contact to you for further help > Why not - maybe someone will make use of it too Tziporet From devesh28 at gmail.com Tue Feb 3 00:35:36 2009 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 3 Feb 2009 14:05:36 +0530 Subject: ***SPAM*** Re: ***SPAM*** [ofa-general] compiling OFED-1.2 with RHEL5.1 In-Reply-To: <4987FECD.6000409@mellanox.co.il> References: <309a667c0812290320m54efd47fr27affb1d5cc6dcec@mail.gmail.com> <4958CB6A.3090306@mellanox.co.il> <309a667c0812292108w162e747ayfa132a60df729e01@mail.gmail.com> <309a667c0902022349je89e655u279457e7585ad7ac@mail.gmail.com> <4987FECD.6000409@mellanox.co.il> Message-ID: <309a667c0902030035i1873124au367a05b35fc8eed9@mail.gmail.com> I am in processes to develop a script to add the backport kernel_addons taken from OFED-1.3 to OFED-1.2 once that is complete I will post the patch and script to the list...:) On Tue, Feb 3, 2009 at 1:52 PM, Tziporet Koren wrote: > Devesh Sharma wrote: > >> Hello list, >> >> I have successfully ported ofed-1.2 for RHEL5.1. should I post the patch? >> >> On Tue, Dec 30, 2008 at 10:38 AM, Devesh Sharma > devesh28 at gmail.com>> wrote: >> >> hello Tziporet, thanks for replying, I will try to do this, how >> many changes do you think I will have to made, are they many? >> If there are some problems I will contact to you for further help >> >> > Why not - maybe someone will make use of it too > > Tziporet > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Tue Feb 3 00:36:21 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Feb 2009 10:36:21 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm/include/iba/ib_types.h: Definition of Congestion Control MADs In-Reply-To: <4868B928.4070307@dev.mellanox.co.il> References: <4868B928.4070307@dev.mellanox.co.il> Message-ID: <49880205.7070605@voltaire.com> Yevgeny Kliteynik wrote: > Adding definition of all the Congestion Control (CC) MADs to ib_types.h. > V2 - fixed comment typo > Hi Yevgeny, Sasha I wonder where this patch stands, any reason not to merge it? Or. From kliteyn at dev.mellanox.co.il Tue Feb 3 01:09:54 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 03 Feb 2009 11:09:54 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm/include/iba/ib_types.h: Definition of Congestion Control MADs In-Reply-To: <49880205.7070605@voltaire.com> References: <4868B928.4070307@dev.mellanox.co.il> <49880205.7070605@voltaire.com> Message-ID: <498809E2.1050306@dev.mellanox.co.il> Hi Or, Or Gerlitz wrote: > Yevgeny Kliteynik wrote: >> Adding definition of all the Congestion Control (CC) MADs to ib_types.h. >> V2 - fixed comment typo >> > Hi Yevgeny, Sasha > > I wonder where this patch stands, any reason not to merge it? The updated CC Annex that will contain many packets format changes hasn't been published yet. -- Yevgeny > Or. > > > From ogerlitz at voltaire.com Tue Feb 3 01:21:23 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 3 Feb 2009 11:21:23 +0200 (IST) Subject: [ofa-general] impossibility to bind a device/port with the rdma-cm when the port is down Message-ID: Hi Sean, It turns out that with the IPOIB port-space, its impossible to bind a device/port through the rdma-cm when the port is down. This is as of the following call sequence cma_acquire_dev --> cma_set_qkey/ps=IPOIB --> ib_sa_get_mcmember_rec where the latter returns EADDRNOTAVAIL since when the port went down the core multicast code flushed its data base. I see that the qkey is actually used by the rdma-cm when the user attepts to join a multicast group, when "connect" a UD QP and when creating a UD QP. I assume there must be a way to defer this resolving to a later stage such that binding would be possible when the port is down, thoughts? Or. $ udaddy -b 10.10.5.157 -p 2 udaddy: starting server udaddy: bind address failed: -1 test complete return status -1 $ strace udaddy -b 10.10.5.157 -p 2 [...] write(5, "\2\0\0\0(\0\0\0\0\0\0\0\0\0\0\0\2\0\6\34\n\n\5\235\0\0\0\0\0\0\0\0"..., 48) = -1 EADDRNOTAVAIL (Cannot assign requested address) write(1, "udaddy: bind address failed: -1\n", 32udaddy: bind address failed: -1) = 32 From dorfman.eli at gmail.com Tue Feb 3 01:28:45 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 03 Feb 2009 11:28:45 +0200 Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet configuration after SIGHUP In-Reply-To: <20090202205924.GF5910@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> Message-ID: <49880E4D.2090107@gmail.com> Sasha Khapyorsky wrote: > On 16:32 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: >> rescan subnet configuration after SIGHUP >> call osm_subn_rescan_conf_files() after SIGHUP. >> this is important when priority is changed and SM is in standby. >> in that case it will not send capability mask trap and will not become master. >> >> Signed-off-by: Eli Dorfman >> --- >> opensm/opensm/main.c | 1 + >> 1 files changed, 1 insertions(+), 0 deletions(-) >> >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >> index f786192..0f7b822 100644 >> --- a/opensm/opensm/main.c >> +++ b/opensm/opensm/main.c >> @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) >> osm_hup_flag = 0; >> /* a HUP signal should only start a new heavy sweep */ >> p_osm->subn.force_heavy_sweep = TRUE; >> + osm_subn_rescan_conf_files(&p_osm->subn); > > Is it synchronized with sweep? If regular (scheduled by timer) sweep > starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters > are freed..., etc.). I think it is not. > i assume it is not. what about the the following (though it uses yet another flag...) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 8863e47..88c977d 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -169,6 +169,7 @@ typedef struct osm_subn_opt { uint32_t polling_retry_number; uint32_t max_msg_fifo_timeout; boolean_t force_heavy_sweep; + boolean_t rescan_conf_file; uint8_t log_flags; char *dump_files_dir; char *log_file; diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index de38056..f2d7846 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -507,7 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) osm_hup_flag = 0; /* a HUP signal should only start a new heavy sweep */ p_osm->subn.force_heavy_sweep = TRUE; - osm_subn_rescan_conf_files(&p_osm->subn); + p_osm->subn.rescan_conf_file = TRUE; osm_opensm_sweep(p_osm); } } diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index fc7ceb9..87a5db9 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm) ib_api_status_t status; osm_remote_sm_t *p_remote_sm; + if (sm->p_subn->rescan_conf_file) { + if (osm_subn_rescan_conf_files(sm->p_subn) < 0) + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " + "osm_subn_rescan_conf_file failed\n"); + sm->p_subn->rescan_conf_file = FALSE; + } + if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING) return; From ogerlitz at Voltaire.com Tue Feb 3 01:43:31 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Tue, 03 Feb 2009 11:43:31 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm/include/iba/ib_types.h: Definition of Congestion Control MADs In-Reply-To: <498809E2.1050306@dev.mellanox.co.il> References: <4868B928.4070307@dev.mellanox.co.il> <49880205.7070605@voltaire.com> <498809E2.1050306@dev.mellanox.co.il> Message-ID: <498811C3.1020005@Voltaire.com> Yevgeny Kliteynik wrote: > The updated CC Annex that will contain many packets > format changes hasn't been published yet. OK, got it. Or. From o.w.saastad at usit.uio.no Tue Feb 3 01:44:02 2009 From: o.w.saastad at usit.uio.no (Ole Widar Saastad) Date: Tue, 03 Feb 2009 10:44:02 +0100 Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes Message-ID: <1233654242.1364.39.camel@pyren.uio.no> I have problems using the OFED 1.4 software on the Sun x4600 nodes. Need help to get this to work. We plan to run GPFS over IB on these nodes in addition to MPI. Sun 4600 nodes with 8 quad core cpus, Quad-Core AMD Opteron(tm) Processor 8380 OS is Rocks release 4. centos-release-4-4.2/x86_64/ Linux compute-0-0.local 2.6.9-67.0.15.ELlargesmp #1 SMP Thu May 8 11:03:57 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux Needless to say our 300+ nodes (SUN x2200 with quad core) runs fine with OFED 1.4 (and 1.3), they have the almost the same kernel : Linux compute-4-0.local 2.6.9-67.0.15.ELsmp #1 SMP Thu May 8 10:50:20 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux Same except ELsmp and not ELlargesmp. More information: dmesg prints out the following error message : Losing some ticks... checking if CPU frequency changed. modulecmd[17499]: segfault at 0000007fc0b01688 rip 000000000060aa38 rsp 0000007fbfffcfd8 error 6 mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) mlx4_core: Initializing 0000:02:00.0 ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 19 (level, low) -> IRQ 193 PCI: Setting latency timer of device 0000:02:00.0 to 64 mlx4_core 0000:02:00.0: Requested number of MACs is too much for port 1, reducing to 1. MSI INIT SUCCESS mlx4_core 0000:02:00.0: command 0x13 failed: fw status = 0x1 mlx4_core 0000:02:00.0: SW2HW_EQ failed (-5) mlx4_core 0000:02:00.0: Failed to initialize event queue table, aborting. mlx4_core: probe of 0000:02:00.0 failed with error -5 The following software is installed: Select Option [1-5]:3 kernel-ib libibverbs libibverbs-devel libibverbs-utils libmthca libmlx4 libcxgb3 libnes libipathverbs libibcommon libibcommon-devel libibumad libibumad-devel ofed-docs ofed-scripts ibvexdmtools qlgc_vnic_daemon Just to be sure the card is present : lspci returns : 02:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0) -- Ole W. Saastad, dr. scient. Scientific Computing Group, USIT, University of Oslo http://hpc.uio.no From vlad at lists.openfabrics.org Tue Feb 3 03:18:10 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 3 Feb 2009 03:18:10 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090203-0200 daily build status Message-ID: <20090203111810.EC436E6114B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Tue Feb 3 04:24:50 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 14:24:50 +0200 Subject: [ofa-general] [PATCH 4/4] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <497DC9FC.2050907@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com> Message-ID: <20090203122450.GB11874@sashak.voltaire.com> On 16:34 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: > support subnet configuration rescan and update > subnet configuration parameters are rescanned every heavy sweep. > every parameter is parsed by parse function according to its type. > some params require special post update function to setup them. > every parameter has also a flag that specifies whether it > can be updated during runtime. > > Signed-off-by: Eli Dorfman I'm applying this with several changes: - disable update option and setup function for all string parameter - as I commented originally opts_parse_charp() will leak memory and this cannot be ignored if config file is rescanned. Exception is QoS string parameters where memory leak is handled. - small fixes I mentioned in original review. Sasha From sashak at voltaire.com Tue Feb 3 04:32:49 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 14:32:49 +0200 Subject: [ofa-general] [PATCH 1/4] opensm/osm_opensm.[ch] make setup and destroy routing engines fucntions global In-Reply-To: <497DC937.7020102@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC937.7020102@gmail.com> Message-ID: <20090203123249.GC11874@sashak.voltaire.com> On 16:31 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: > make setup and destroy routing engines fucntions global. > change setup_routing_engines() and destroy_routing_engines() > declaration Below is a comment about this patch. I'm not applying this yet and will comment separately about its usage. > > Signed-off-by: Eli Dorfman > --- > opensm/include/opensm/osm_opensm.h | 53 ++++++++++++++++++++++++++++++++++++ > opensm/opensm/osm_opensm.c | 5 ++- > 2 files changed, 56 insertions(+), 2 deletions(-) > > diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h > index c121be4..5b0a1dd 100644 > --- a/opensm/include/opensm/osm_opensm.h > +++ b/opensm/include/opensm/osm_opensm.h > @@ -458,6 +458,59 @@ osm_opensm_wait_for_subnet_up(IN osm_opensm_t * const p_osm, > * SEE ALSO > *********/ > > +/****f* OpenSM: OpenSM/setup_routing_engines > +* NAME > +* setup_routing_engines > +* > +* DESCRIPTION > +* This function constructs an routing engines. > +* > +* SYNOPSIS > +*/ > +void setup_routing_engines(osm_opensm_t *osm, const char *name); > +/* > +* PARAMETERS > +* p_osm > +* [in] Pointer to a OpenSM object to construct. > +* > +* name > +* [in] Routing engine names. > +* > +* RETURN VALUE > +* This function does not return a value. > +* > +* NOTES > +* Setup of routing engines > +* > +* SEE ALSO > +* destroy_routing_engines > +*********/ > + > +/****f* OpenSM: OpenSM/destroy_routing_engines > +* NAME > +* destroy_routing_engines > +* > +* DESCRIPTION > +* This function constructs an routing engines. > +* > +* SYNOPSIS > +*/ > +void destroy_routing_engines(osm_opensm_t *osm); For public names we are using 'osm_' prefix in OpenSM. Sasha > +/* > +* PARAMETERS > +* p_osm > +* [in] Pointer to a OpenSM object to construct. > +* > +* RETURN VALUE > +* This function does not return a value. > +* > +* NOTES > +* Setup of routing engines > +* > +* SEE ALSO > +* setup_routing_engines > +*********/ > + > /****f* OpenSM: OpenSM/osm_routing_engine_type_str > * NAME > * osm_routing_engine_type_str > diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c > index 7de2e5b..8ecb942 100644 > --- a/opensm/opensm/osm_opensm.c > +++ b/opensm/opensm/osm_opensm.c > @@ -186,7 +186,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) > "cannot find or setup routing engine \'%s\'", name); > } > > -static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) > +void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) > { > char *name, *str, *p; > > @@ -224,7 +224,7 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm) > > /********************************************************************** > **********************************************************************/ > -static void destroy_routing_engines(osm_opensm_t *osm) > +void destroy_routing_engines(osm_opensm_t *osm) > { > struct osm_routing_engine *r, *next; > > @@ -236,6 +236,7 @@ static void destroy_routing_engines(osm_opensm_t *osm) > r->delete(r->context); > free(r); > } > + osm->routing_engine_list = NULL; > } > > /********************************************************************** > -- > 1.5.5 > From sashak at voltaire.com Tue Feb 3 04:37:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 14:37:06 +0200 Subject: [ofa-general] [PATCH 4/4] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <497DC9FC.2050907@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com> Message-ID: <20090203123706.GD11874@sashak.voltaire.com> On 16:34 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: [snip...] > + > +static void opts_setup_routing_engine(osm_subn_t *p_subn, void *p_val) > +{ > + char *routing_engine_names = (char *) p_val; > + > + destroy_routing_engines(p_subn->p_osm); > + setup_routing_engines(p_subn->p_osm, routing_engine_names); > +} This probably can work with updn and minhops, but it certainly will be destructive when LASH routing engine is used. LASH stores internal data between sweep cycles, it is used to answer correct SL value in SA PathRecord queries. So I think routing engine "switch" should be a bit smarter. Sasha From sashak at voltaire.com Tue Feb 3 04:44:07 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 14:44:07 +0200 Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet configuration after SIGHUP In-Reply-To: <49880E4D.2090107@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> Message-ID: <20090203124407.GE11874@sashak.voltaire.com> On 11:28 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: > Sasha Khapyorsky wrote: > > On 16:32 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: > >> rescan subnet configuration after SIGHUP > >> call osm_subn_rescan_conf_files() after SIGHUP. > >> this is important when priority is changed and SM is in standby. > >> in that case it will not send capability mask trap and will not become master. > >> > >> Signed-off-by: Eli Dorfman > >> --- > >> opensm/opensm/main.c | 1 + > >> 1 files changed, 1 insertions(+), 0 deletions(-) > >> > >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > >> index f786192..0f7b822 100644 > >> --- a/opensm/opensm/main.c > >> +++ b/opensm/opensm/main.c > >> @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) > >> osm_hup_flag = 0; > >> /* a HUP signal should only start a new heavy sweep */ > >> p_osm->subn.force_heavy_sweep = TRUE; > >> + osm_subn_rescan_conf_files(&p_osm->subn); > > > > Is it synchronized with sweep? If regular (scheduled by timer) sweep > > starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters > > are freed..., etc.). I think it is not. > > > i assume it is not. > what about the the following (though it uses yet another flag...) > > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index 8863e47..88c977d 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -169,6 +169,7 @@ typedef struct osm_subn_opt { > uint32_t polling_retry_number; > uint32_t max_msg_fifo_timeout; > boolean_t force_heavy_sweep; > + boolean_t rescan_conf_file; > uint8_t log_flags; > char *dump_files_dir; > char *log_file; > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index de38056..f2d7846 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -507,7 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) > osm_hup_flag = 0; > /* a HUP signal should only start a new heavy sweep */ > p_osm->subn.force_heavy_sweep = TRUE; > - osm_subn_rescan_conf_files(&p_osm->subn); > + p_osm->subn.rescan_conf_file = TRUE; > osm_opensm_sweep(p_osm); > } > } > diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c > index fc7ceb9..87a5db9 100644 > --- a/opensm/opensm/osm_state_mgr.c > +++ b/opensm/opensm/osm_state_mgr.c > @@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm) > ib_api_status_t status; > osm_remote_sm_t *p_remote_sm; > > + if (sm->p_subn->rescan_conf_file) { > + if (osm_subn_rescan_conf_files(sm->p_subn) < 0) > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " > + "osm_subn_rescan_conf_file failed\n"); > + sm->p_subn->rescan_conf_file = FALSE; > + } > + What would be wrong with using exiting 'force_heavy_sweep' flag? Another issue with this patch - config file will be rescanned later again (during heavy sweep). It would be really nice to avoid such obviously unneeded double parsing. Sasha > if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && > sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING) > return; From dorfman.eli at gmail.com Tue Feb 3 05:40:50 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 03 Feb 2009 15:40:50 +0200 Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet configuration after SIGHUP In-Reply-To: <20090203124407.GE11874@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> Message-ID: <49884962.5070601@gmail.com> Sasha Khapyorsky wrote: > On 11:28 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: >> Sasha Khapyorsky wrote: >>> On 16:32 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: >>>> rescan subnet configuration after SIGHUP >>>> call osm_subn_rescan_conf_files() after SIGHUP. >>>> this is important when priority is changed and SM is in standby. >>>> in that case it will not send capability mask trap and will not become master. >>>> >>>> Signed-off-by: Eli Dorfman >>>> --- >>>> opensm/opensm/main.c | 1 + >>>> 1 files changed, 1 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >>>> index f786192..0f7b822 100644 >>>> --- a/opensm/opensm/main.c >>>> +++ b/opensm/opensm/main.c >>>> @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) >>>> osm_hup_flag = 0; >>>> /* a HUP signal should only start a new heavy sweep */ >>>> p_osm->subn.force_heavy_sweep = TRUE; >>>> + osm_subn_rescan_conf_files(&p_osm->subn); >>> Is it synchronized with sweep? If regular (scheduled by timer) sweep >>> starts in a middle of osm_subn_rescan_conf_files() (when QoS parameters >>> are freed..., etc.). I think it is not. >>> >> i assume it is not. >> what about the the following (though it uses yet another flag...) >> >> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h >> index 8863e47..88c977d 100644 >> --- a/opensm/include/opensm/osm_subnet.h >> +++ b/opensm/include/opensm/osm_subnet.h >> @@ -169,6 +169,7 @@ typedef struct osm_subn_opt { >> uint32_t polling_retry_number; >> uint32_t max_msg_fifo_timeout; >> boolean_t force_heavy_sweep; >> + boolean_t rescan_conf_file; >> uint8_t log_flags; >> char *dump_files_dir; >> char *log_file; >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >> index de38056..f2d7846 100644 >> --- a/opensm/opensm/main.c >> +++ b/opensm/opensm/main.c >> @@ -507,7 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) >> osm_hup_flag = 0; >> /* a HUP signal should only start a new heavy sweep */ >> p_osm->subn.force_heavy_sweep = TRUE; >> - osm_subn_rescan_conf_files(&p_osm->subn); >> + p_osm->subn.rescan_conf_file = TRUE; >> osm_opensm_sweep(p_osm); >> } >> } >> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c >> index fc7ceb9..87a5db9 100644 >> --- a/opensm/opensm/osm_state_mgr.c >> +++ b/opensm/opensm/osm_state_mgr.c >> @@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm) >> ib_api_status_t status; >> osm_remote_sm_t *p_remote_sm; >> >> + if (sm->p_subn->rescan_conf_file) { >> + if (osm_subn_rescan_conf_files(sm->p_subn) < 0) >> + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " >> + "osm_subn_rescan_conf_file failed\n"); >> + sm->p_subn->rescan_conf_file = FALSE; >> + } >> + > > What would be wrong with using exiting 'force_heavy_sweep' flag? > 'force_heavy_sweep' flag is set in other occasions as well > Another issue with this patch - config file will be rescanned later > again (during heavy sweep). It would be really nice to avoid such > obviously unneeded double parsing. > that is correct, but we need a special flag to handle the priority change when SM is in standby. In that case a rescan at the beginning of do_sweep is a must, otherwise it will simply return without doing anything. what was the reason of putting rescan not in the beginning of do_sweep(). If none then we can simply rescan as first step. Eli > Sasha > >> if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && >> sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING) >> return; From dorfman.eli at gmail.com Tue Feb 3 05:43:21 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 03 Feb 2009 15:43:21 +0200 Subject: [ofa-general] [PATCH 4/4] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <20090203123706.GD11874@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com> <20090203123706.GD11874@sashak.voltaire.com> Message-ID: <498849F9.1030700@gmail.com> Sasha Khapyorsky wrote: > On 16:34 Mon 26 Jan , Eli Dorfman (Voltaire) wrote: > > [snip...] >> + >> +static void opts_setup_routing_engine(osm_subn_t *p_subn, void *p_val) >> +{ >> + char *routing_engine_names = (char *) p_val; >> + >> + destroy_routing_engines(p_subn->p_osm); >> + setup_routing_engines(p_subn->p_osm, routing_engine_names); >> +} > > This probably can work with updn and minhops, but it certainly will be > destructive when LASH routing engine is used. LASH stores internal data > between sweep cycles, it is used to answer correct SL value in SA > PathRecord queries. So I think routing engine "switch" should be a bit > smarter. > that means that destroy and setup routing engine functions should be improved. what do you suggest in the meantime? limit this to minhop/updn? Eli From sashak at voltaire.com Tue Feb 3 05:42:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 15:42:04 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_perfmgr_db.h: Remove unused typedef In-Reply-To: <1233601610.8992.389.camel@bertha1.edm.orcorp.ca> References: <1233601610.8992.389.camel@bertha1.edm.orcorp.ca> Message-ID: <20090203134204.GG11874@sashak.voltaire.com> On 12:06 Mon 02 Feb , Hal Rosenstock wrote: > Sasha, > > Trivial patch to remove an unused typedef in perfmgr. Applied. Thanks. Sasha From sashak at voltaire.com Tue Feb 3 05:42:23 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 15:42:23 +0200 Subject: [ofa-general] Re: [PATCH][MINOR] opensm/osm_perfmgr.c: Eliminate memory leak on error In-Reply-To: <1233601621.8992.390.camel@bertha1.edm.orcorp.ca> References: <1233601621.8992.390.camel@bertha1.edm.orcorp.ca> Message-ID: <20090203134223.GH11874@sashak.voltaire.com> On 12:07 Mon 02 Feb , Hal Rosenstock wrote: > > Minor patch to osm_perfmgr.c to eliminate a memory leak on error in > osm_perfmgr_init. Applied. Thanks. Sasha From sashak at voltaire.com Tue Feb 3 05:48:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 15:48:31 +0200 Subject: [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet configuration after SIGHUP In-Reply-To: <49884962.5070601@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> Message-ID: <20090203134831.GI11874@sashak.voltaire.com> On 15:40 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: > >> --- a/opensm/opensm/osm_state_mgr.c > >> +++ b/opensm/opensm/osm_state_mgr.c > >> @@ -1042,6 +1042,13 @@ static void do_sweep(osm_sm_t * sm) > >> ib_api_status_t status; > >> osm_remote_sm_t *p_remote_sm; > >> > >> + if (sm->p_subn->rescan_conf_file) { > >> + if (osm_subn_rescan_conf_files(sm->p_subn) < 0) > >> + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " > >> + "osm_subn_rescan_conf_file failed\n"); > >> + sm->p_subn->rescan_conf_file = FALSE; > >> + } > >> + > > > > What would be wrong with using exiting 'force_heavy_sweep' flag? > > > 'force_heavy_sweep' flag is set in other occasions as well Yes. And file is rescanned on heavy sweep (later) anyway :) > > > Another issue with this patch - config file will be rescanned later > > again (during heavy sweep). It would be really nice to avoid such > > obviously unneeded double parsing. > > > that is correct, but we need a special flag to handle the priority change when SM > is in standby. > In that case a rescan at the beginning of do_sweep is a must, otherwise it will > simply return without doing anything. > what was the reason of putting rescan not in the beginning of do_sweep(). I don't remember many details, but originally it was used for updating only selected parameters. Also (I guess) for eliminating rescanning on light sweeps. > If none then we can simply rescan as first step. After fast thinking - it could be an option, just verify that it will not break functionality a lot. Sasha From sashak at voltaire.com Tue Feb 3 05:53:50 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 15:53:50 +0200 Subject: [ofa-general] [PATCH 4/4] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <498849F9.1030700@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com> <20090203123706.GD11874@sashak.voltaire.com> <498849F9.1030700@gmail.com> Message-ID: <20090203135350.GJ11874@sashak.voltaire.com> On 15:43 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: > > > > This probably can work with updn and minhops, but it certainly will be > > destructive when LASH routing engine is used. LASH stores internal data > > between sweep cycles, it is used to answer correct SL value in SA > > PathRecord queries. So I think routing engine "switch" should be a bit > > smarter. > > > > that means that destroy and setup routing engine functions should be improved. > what do you suggest in the meantime? I meant that instead of destroy/setup pair we need an update function which will carefully compare a current routing engine list against requested one and in any case will not destroy an routing engine(s) which is in use. > limit this to minhop/updn? No. Sasha From dorfman.eli at gmail.com Tue Feb 3 06:03:08 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 03 Feb 2009 16:03:08 +0200 Subject: [ofa-general] [PATCH 4/4] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <20090203135350.GJ11874@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com> <20090203123706.GD11874@sashak.voltaire.com> <498849F9.1030700@gmail.com> <20090203135350.GJ11874@sashak.voltaire.com> Message-ID: <49884E9C.1090704@gmail.com> Sasha Khapyorsky wrote: > On 15:43 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: >>> This probably can work with updn and minhops, but it certainly will be >>> destructive when LASH routing engine is used. LASH stores internal data >>> between sweep cycles, it is used to answer correct SL value in SA >>> PathRecord queries. So I think routing engine "switch" should be a bit >>> smarter. >>> >> that means that destroy and setup routing engine functions should be improved. >> what do you suggest in the meantime? > > I meant that instead of destroy/setup pair we need an update function > which will carefully compare a current routing engine list against > requested one and in any case will not destroy an routing engine(s) > which is in use. so if we find the diff between the old routing engine and new one and use destroy to remove non used and setup for new engines - is it good enough? > >> limit this to minhop/updn? > > No. > > Sasha From sashak at voltaire.com Tue Feb 3 06:05:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 16:05:46 +0200 Subject: [ofa-general] [PATCH 4/4] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <49884E9C.1090704@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com> <20090203123706.GD11874@sashak.voltaire.com> <498849F9.1030700@gmail.com> <20090203135350.GJ11874@sashak.voltaire.com> <49884E9C.1090704@gmail.com> Message-ID: <20090203140546.GL11874@sashak.voltaire.com> On 16:03 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: > > > > I meant that instead of destroy/setup pair we need an update function > > which will carefully compare a current routing engine list against > > requested one and in any case will not destroy an routing engine(s) > > which is in use. > > so if we find the diff between the old routing engine and new one and > use destroy to remove non used and setup for new engines - is it good enough? Maybe (I didn't look at code now). Also if currently used routing engine is going to be switched it should be cleaned up after switch too. Sasha From dorfman.eli at gmail.com Tue Feb 3 06:11:46 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 03 Feb 2009 16:11:46 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <20090203134831.GI11874@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> Message-ID: <498850A2.8090701@gmail.com> rescan configuration as first step on every heavy sweep this is a must in case of priority change (increase) for standby SM Signed-off-by: Eli Dorfman --- opensm/opensm/osm_state_mgr.c | 11 ++++++----- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index fc7ceb9..622867b 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1042,6 +1042,12 @@ static void do_sweep(osm_sm_t * sm) ib_api_status_t status; osm_remote_sm_t *p_remote_sm; + if (sm->p_subn->force_heavy_sweep && + osm_subn_rescan_conf_files(sm->p_subn) < 0) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " + "osm_subn_rescan_conf_file failed\n"); + } + if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING) return; @@ -1131,11 +1137,6 @@ _repeat_discovery: sm->p_subn->force_reroute = FALSE; sm->p_subn->subnet_initialization_error = FALSE; - /* rescan configuration updates */ - if (osm_subn_rescan_conf_files(sm->p_subn) < 0) - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " - "osm_subn_rescan_conf_file failed\n"); - if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER) sm->p_subn->need_update = 1; -- 1.5.5 From devesh28 at gmail.com Tue Feb 3 06:09:01 2009 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 3 Feb 2009 19:39:01 +0530 Subject: ***SPAM*** Re: ***SPAM*** [ofa-general][PATCH v1] compiling OFED-1.2 with RHEL5.1 Message-ID: <309a667c0902030609m4ba4a685pa18a14d8fd34f7f2@mail.gmail.com> Following is the patch that must be applied to ofa_kernel-1.2 to be able to compile it with RHEL5.1, whole there is one more patch I will be posting after this. It deals with the declarations of kmem_cache_create(). One configuration script also written, derived from ofed_patch.sh to add backport directory 2.6.28-EL5.1 in ofa_kernel-1.2 and build rpm with changes. The scripts assumes the names of patches are OFED-1.2_RHEL5.1_fix.patch for this patch kmem_cache_create_fix.patch for kmem_cache related patch. diff -ruN ofa_kernel-1.2/configure ofa_kernel-1.2_try2/configure --- ofa_kernel-1.2/configure 2009-02-03 02:12:23.000000000 +0530 +++ ofa_kernel-1.2_try2/configure 2009-02-03 00:46:15.000000000 +0530 @@ -218,9 +218,12 @@ 2.6.17*) echo 2.6.17 ;; - 2.6.18-*fc[56]*|2.6.18-*el5*) + 2.6.18-*fc[56]*|2.6.18-8.el5) echo 2.6.18_FC6 ;; + 2.6.18-53.el5) + echo 2.6.18-EL5.1 + ;; 2.6.18*) echo 2.6.18 ;; diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/prom.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,8 @@ +#ifndef ASM_PROM_BACKPORT_TO_2_6_21_H +#define ASM_PROM_BACKPORT_TO_2_6_21_H + +#include_next + +#define of_get_property(a, b, c) get_property((a), (b), (c)) + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/asm/scatterlist.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,5 @@ +#if defined(__ia64__) +#include +#endif +#include +#include_next diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/compiler.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,8 @@ +#ifndef BACKPORT_LINUX_COMPILER_TO_2_6_22_H +#define BACKPORT_LINUX_COMPILER_TO_2_6_22_H + +#include_next + +#define uninitialized_var(x) x = x + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/crypto.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,54 @@ +#ifndef BACKPORT_LINUX_CRYPTO_H +#define BACKPORT_LINUX_CRYPTO_H + +#include_next + +#define CRYPTO_ALG_ASYNC 0x00000080 + +struct hash_desc +{ + struct crypto_tfm *tfm; + u32 flags; +}; + +static inline int crypto_hash_init(struct hash_desc *desc) +{ + crypto_digest_init(desc->tfm); + return 0; +} + +static inline int crypto_hash_digest(struct hash_desc *desc, + struct scatterlist *sg, + unsigned int nbytes, u8 *out) +{ + crypto_digest_digest(desc->tfm, sg, 1, out); + return nbytes; +} + +static inline int crypto_hash_update(struct hash_desc *desc, + struct scatterlist *sg, + unsigned int nbytes) +{ + crypto_digest_update(desc->tfm, sg, 1); + return nbytes; +} + +static inline int crypto_hash_final(struct hash_desc *desc, u8 *out) +{ + crypto_digest_final(desc->tfm, out); + return 0; +} + +static inline struct crypto_tfm *crypto_alloc_hash(const char *alg_name, + u32 type, u32 mask) +{ + struct crypto_tfm *ret = crypto_alloc_tfm(alg_name ,type); + return ret ? ret : ERR_PTR(-ENOMEM); +} + +static inline void crypto_free_hash(struct crypto_tfm *tfm) +{ + crypto_free_tfm(tfm); +} + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/etherdevice.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,15 @@ +#ifndef BACKPORT_LINUX_ETHERDEVICE +#define BACKPORT_LINUX_ETHERDEVICE + +#include_next + +static inline unsigned short backport_eth_type_trans(struct sk_buff *skb, + struct net_device *dev) +{ + skb->dev = dev; + return eth_type_trans(skb, dev); +} + +#define eth_type_trans backport_eth_type_trans + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/genalloc.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,42 @@ +/* + * Basic general purpose allocator for managing special purpose memory + * not managed by the regular kmalloc/kfree interface. + * Uses for this includes on-device special memory, uncached memory + * etc. + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. + */ + + +/* + * General purpose special memory pool descriptor. + */ +struct gen_pool { + rwlock_t lock; + struct list_head chunks; /* list of chunks in this pool */ + int min_alloc_order; /* minimum allocation order */ +}; + +/* + * General purpose special memory pool chunk descriptor. + */ +struct gen_pool_chunk { + spinlock_t lock; + struct list_head next_chunk; /* next chunk in pool */ + unsigned long start_addr; /* starting address of memory chunk */ + unsigned long end_addr; /* ending address of memory chunk */ + unsigned long bits[0]; /* bitmap for allocating memory chunk */ +}; + +extern struct gen_pool *ib_gen_pool_create(int, int); +extern int ib_gen_pool_add(struct gen_pool *, unsigned long, size_t, int); +extern void ib_gen_pool_destroy(struct gen_pool *); +extern unsigned long ib_gen_pool_alloc(struct gen_pool *, size_t); +extern void ib_gen_pool_free(struct gen_pool *, unsigned long, size_t); + +#define gen_pool_create ib_gen_pool_create +#define gen_pool_add ib_gen_pool_add +#define gen_pool_destroy ib_gen_pool_destroy +#define gen_pool_alloc ib_gen_pool_alloc +#define gen_pool_free ib_gen_pool_free diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_ether.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,8 @@ +#ifndef __BACKPORT_LINUX_IF_ETHER_H_TO_2_6_21__ +#define __BACKPORT_LINUX_IF_ETHER_H_TO_2_6_21__ + +#include_next + +#define ETH_FCS_LEN 4 /* Octets in the FCS */ + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/if_vlan.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,17 @@ +#ifndef __BACKPORT_LINUX_IF_VLAN_H_TO_2_6_20__ +#define __BACKPORT_LINUX_IF_VLAN_H_TO_2_6_20__ + +#include_next + +static inline struct net_device *vlan_group_get_device(struct vlan_group *vg, int vlan_id) +{ + return vg->vlan_devices[vlan_id]; +} + +static inline void vlan_group_set_device(struct vlan_group *vg, int vlan_id, + struct net_device *dev) +{ + vg->vlan_devices[vlan_id] = dev; +} + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/interrupt.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,20 @@ +#ifndef BACKPORT_LINUX_INTERRUPT_TO_2_6_18 +#define BACKPORT_LINUX_INTERRUPT_TO_2_6_18 +#include_next + +typedef irqreturn_t (*backport_irq_handler_t)(int, void *); + +static inline int +backport_request_irq(unsigned int irq, + irqreturn_t (*handler)(int, void *), + unsigned long flags, const char *dev_name, void *dev_id) +{ + return request_irq(irq, + (irqreturn_t (*)(int, void *, struct pt_regs *))handler, + flags, dev_name, dev_id); +} + +#define request_irq backport_request_irq +#define irq_handler_t backport_irq_handler_t + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/ip.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,11 @@ +#ifndef __LINUX_IP_BACKPORT_TO_2_6_21__ +#define __LINUX_IP_BACKPORT_TO_2_6_21__ + +#include_next + +static inline struct iphdr *ip_hdr(const struct sk_buff *skb) +{ + return (struct iphdr *)skb_network_header(skb); +} + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/kernel.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,14 @@ +#ifndef BACKPORT_KERNEL_H_2_6_22 +#define BACKPORT_KERNEL_H_2_6_22 + +#include_next + +#define upper_32_bits(n) ((u32)(((n) >> 16) >> 16)) + +#endif +#ifndef BACKPORT_KERNEL_H_2_6_19 +#define BACKPORT_KERNEL_H_2_6_19 + +#include + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/log2.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,169 @@ +/* Integer base 2 logarithm calculation + * + * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells at redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifndef _LINUX_LOG2_H +#define _LINUX_LOG2_H + +#include_next +#include +#include + +/* + * deal with unrepresentable constant logarithms + */ +extern __attribute__((const, noreturn)) +int ____ilog2_NaN(void); + +/* + * non-constant log of base 2 calculators + * - the arch may override these in asm/bitops.h if they can be implemented + * more efficiently than using fls() and fls64() + * - the arch is not required to handle n==0 if implementing the fallback + */ +#ifndef CONFIG_ARCH_HAS_ILOG2_U32 +static inline __attribute__((const)) +int __ilog2_u32(u32 n) +{ + return fls(n) - 1; +} +#endif + +#ifndef CONFIG_ARCH_HAS_ILOG2_U64 +static inline __attribute__((const)) +int __ilog2_u64(u64 n) +{ + return fls64(n) - 1; +} +#endif + +/* + * Determine whether some value is a power of two, where zero is + * *not* considered a power of two. + */ + +static inline __attribute__((const)) +bool is_power_of_2(unsigned long n) +{ + return (n != 0 && ((n & (n - 1)) == 0)); +} + +/* + * round up to nearest power of two + */ +static inline __attribute__((const)) +unsigned long __roundup_pow_of_two(unsigned long n) +{ + return 1UL << fls_long(n - 1); +} + +/** + * ilog2 - log of base 2 of 32-bit or a 64-bit unsigned value + * @n - parameter + * + * constant-capable log of base 2 calculation + * - this can be used to initialise global variables from constant data, hence + * the massive ternary operator construction + * + * selects the appropriately-sized optimised version depending on sizeof(n) + */ +#define ilog2(n) \ +( \ + __builtin_constant_p(n) ? ( \ + (n) < 1 ? ____ilog2_NaN() : \ + (n) & (1ULL << 63) ? 63 : \ + (n) & (1ULL << 62) ? 62 : \ + (n) & (1ULL << 61) ? 61 : \ + (n) & (1ULL << 60) ? 60 : \ + (n) & (1ULL << 59) ? 59 : \ + (n) & (1ULL << 58) ? 58 : \ + (n) & (1ULL << 57) ? 57 : \ + (n) & (1ULL << 56) ? 56 : \ + (n) & (1ULL << 55) ? 55 : \ + (n) & (1ULL << 54) ? 54 : \ + (n) & (1ULL << 53) ? 53 : \ + (n) & (1ULL << 52) ? 52 : \ + (n) & (1ULL << 51) ? 51 : \ + (n) & (1ULL << 50) ? 50 : \ + (n) & (1ULL << 49) ? 49 : \ + (n) & (1ULL << 48) ? 48 : \ + (n) & (1ULL << 47) ? 47 : \ + (n) & (1ULL << 46) ? 46 : \ + (n) & (1ULL << 45) ? 45 : \ + (n) & (1ULL << 44) ? 44 : \ + (n) & (1ULL << 43) ? 43 : \ + (n) & (1ULL << 42) ? 42 : \ + (n) & (1ULL << 41) ? 41 : \ + (n) & (1ULL << 40) ? 40 : \ + (n) & (1ULL << 39) ? 39 : \ + (n) & (1ULL << 38) ? 38 : \ + (n) & (1ULL << 37) ? 37 : \ + (n) & (1ULL << 36) ? 36 : \ + (n) & (1ULL << 35) ? 35 : \ + (n) & (1ULL << 34) ? 34 : \ + (n) & (1ULL << 33) ? 33 : \ + (n) & (1ULL << 32) ? 32 : \ + (n) & (1ULL << 31) ? 31 : \ + (n) & (1ULL << 30) ? 30 : \ + (n) & (1ULL << 29) ? 29 : \ + (n) & (1ULL << 28) ? 28 : \ + (n) & (1ULL << 27) ? 27 : \ + (n) & (1ULL << 26) ? 26 : \ + (n) & (1ULL << 25) ? 25 : \ + (n) & (1ULL << 24) ? 24 : \ + (n) & (1ULL << 23) ? 23 : \ + (n) & (1ULL << 22) ? 22 : \ + (n) & (1ULL << 21) ? 21 : \ + (n) & (1ULL << 20) ? 20 : \ + (n) & (1ULL << 19) ? 19 : \ + (n) & (1ULL << 18) ? 18 : \ + (n) & (1ULL << 17) ? 17 : \ + (n) & (1ULL << 16) ? 16 : \ + (n) & (1ULL << 15) ? 15 : \ + (n) & (1ULL << 14) ? 14 : \ + (n) & (1ULL << 13) ? 13 : \ + (n) & (1ULL << 12) ? 12 : \ + (n) & (1ULL << 11) ? 11 : \ + (n) & (1ULL << 10) ? 10 : \ + (n) & (1ULL << 9) ? 9 : \ + (n) & (1ULL << 8) ? 8 : \ + (n) & (1ULL << 7) ? 7 : \ + (n) & (1ULL << 6) ? 6 : \ + (n) & (1ULL << 5) ? 5 : \ + (n) & (1ULL << 4) ? 4 : \ + (n) & (1ULL << 3) ? 3 : \ + (n) & (1ULL << 2) ? 2 : \ + (n) & (1ULL << 1) ? 1 : \ + (n) & (1ULL << 0) ? 0 : \ + ____ilog2_NaN() \ + ) : \ + (sizeof(n) <= 4) ? \ + __ilog2_u32(n) : \ + __ilog2_u64(n) \ + ) + +/** + * roundup_pow_of_two - round the given value up to nearest power of two + * @n - parameter + * + * round the given balue up to the nearest power of two + * - the result is undefined when n == 0 + * - this can be used to initialise global variables from constant data + */ +#define roundup_pow_of_two(n) \ +( \ + __builtin_constant_p(n) ? ( \ + (n == 1) ? 0 : \ + (1UL << (ilog2((n) - 1) + 1)) \ + ) : \ + __roundup_pow_of_two(n) \ + ) + +#endif /* _LINUX_LOG2_H */ diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netdevice.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,16 @@ +#ifndef BACKPORT_LINUX_NETDEVICE_TO_2_6_18 +#define BACKPORT_LINUX_NETDEVICE_TO_2_6_18 +#include_next + +static inline int skb_checksum_help_to_2_6_18(struct sk_buff *skb) +{ + return skb_checksum_help(skb, 0); +} + +#define skb_checksum_help skb_checksum_help_to_2_6_18 + +#undef SET_ETHTOOL_OPS +#define SET_ETHTOOL_OPS(netdev, ops) \ + (netdev)->ethtool_ops = (struct ethtool_ops *)(ops) + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/net.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_LINUX_NET_H +#define BACKPORT_LINUX_NET_H + +#include_next +#include + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/netlink.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,14 @@ +#ifndef BACKPORT_LINUX_NETLINK_H +#define BACKPORT_LINUX_NETLINK_H + +#include_next + +/*#define netlink_kernel_create(net, uint, groups, input, mutex, mod) \ + netlink_kernel_create(uint, groups, input, mod)*/ + +static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb) +{ + return (struct nlmsghdr *)skb->data; +} + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/notifier.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,19 @@ +#ifndef LINUX_NOTIFIER_BACKPORT_TO_2_6_21_H +#define LINUX_NOTIFIER_BACKPORT_TO_2_6_21_H + +#include_next + + +/* Used for CPU hotplug events occuring while tasks are frozen due to a suspend + * operation in progress + */ +#define CPU_TASKS_FROZEN 0x0010 + +#define CPU_ONLINE_FROZEN (CPU_ONLINE | CPU_TASKS_FROZEN) +#define CPU_UP_PREPARE_FROZEN (CPU_UP_PREPARE | CPU_TASKS_FROZEN) +#define CPU_UP_CANCELED_FROZEN (CPU_UP_CANCELED | CPU_TASKS_FROZEN) +#define CPU_DOWN_PREPARE_FROZEN (CPU_DOWN_PREPARE | CPU_TASKS_FROZEN) +#define CPU_DOWN_FAILED_FROZEN (CPU_DOWN_FAILED | CPU_TASKS_FROZEN) +#define CPU_DEAD_FROZEN (CPU_DEAD | CPU_TASKS_FROZEN) + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/pci.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,21 @@ +#ifndef __BACKPORT_LINUX_PCI_TO_2_6_19__ +#define __BACKPORT_LINUX_PCI_TO_2_6_19__ + +#include_next + +/** + * PCI_VDEVICE - macro used to describe a specific pci device in short form + * @vend: the vendor name + * @dev: the 16 bit PCI Device ID + * + * This macro is used to create a struct pci_device_id that matches a + * specific PCI device. The subvendor, and subdevice fields will be set + * to PCI_ANY_ID. The macro allows the next field to follow as the device + * private data. + */ + +#define PCI_VDEVICE(vendor, device) \ + PCI_VENDOR_ID_##vendor, (device), \ + PCI_ANY_ID, PCI_ANY_ID, 0, 0 + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/random.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,8 @@ +#ifndef BACKPORT_LINUX_RANDOM_TO_2_6_18 +#define BACKPORT_LINUX_RANDOM_TO_2_6_18 +#include_next +#include_next + +#define random32() net_random() + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/rbtree.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,10 @@ +#ifndef BACKPORT_LINUX_RBTREE_TO_2_6_18 +#define BACKPORT_LINUX_RBTREE_TO_2_6_18 +#include_next + +/* Band-aid for buggy rbtree.h */ +#undef RB_EMPTY_NODE +#define RB_EMPTY_NODE(node) (rb_parent(node) == node) + +#endif + diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,33 @@ +#ifndef __BACKPORT_LINUX_SCATTERLIST_H_TO_2_6_23__ +#define __BACKPORT_LINUX_SCATTERLIST_H_TO_2_6_23__ +#include_next + +static inline void sg_set_page(struct scatterlist *sg, struct page *page, + unsigned int len, unsigned int offset) +{ + sg->page = page; + sg->offset = offset; + sg->length = len; +} + +static inline void sg_assign_page(struct scatterlist *sg, struct page *page) +{ + sg->page = page; +} + +#define sg_page(a) (a)->page +#define sg_init_table(a, b) + +#define for_each_sg(sglist, sg, nr, __i) \ + for (__i = 0, sg = (sglist); __i < (nr); __i++, sg++) + +static inline struct scatterlist *sg_next(struct scatterlist *sg) +{ + if (!sg) { + BUG(); + return NULL; + } + return sg + 1; +} + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/skbuff.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,85 @@ +#ifndef LINUX_SKBUFF_H_BACKPORT +#define LINUX_SKBUFF_H_BACKPORT + +#include_next + +#define CHECKSUM_PARTIAL CHECKSUM_HW +#define CHECKSUM_COMPLETE CHECKSUM_HW + +#endif +#ifndef __BACKPORT_LINUX_SKBUFF_H_TO_2_6_21__ +#define __BACKPORT_LINUX_SKBUFF_H_TO_2_6_21__ + +#include_next + +#define transport_header h.raw +#define network_header nh.raw + +static inline void skb_reset_mac_header(struct sk_buff *skb) +{ + skb->mac.raw = skb->data; +} + +static inline void skb_reset_network_header(struct sk_buff *skb) +{ + skb->network_header = skb->data; +} + +#if 0 +static inline void skb_copy_from_linear_data(const struct sk_buff *skb, + void *to, + const unsigned int len) +{ + memcpy(to, skb->data, len); +} + +static inline void skb_copy_to_linear_data(struct sk_buff *skb, + const void *from, + const unsigned int len) +{ + memcpy(skb->data, from, len); +} +#endif + +static inline unsigned char *skb_end_pointer(const struct sk_buff *skb) +{ + return skb->end; +} + +static inline unsigned char *skb_transport_header(const struct sk_buff *skb) +{ + return skb->transport_header; +} + +static inline unsigned char *skb_network_header(const struct sk_buff *skb) +{ + return skb->network_header; +} + +static inline void skb_reset_transport_header(struct sk_buff *skb) +{ + skb->transport_header = skb->data; +} + +static inline int skb_transport_offset(const struct sk_buff *skb) +{ + return skb_transport_header(skb) - skb->data; +} + +static inline int skb_network_offset(const struct sk_buff *skb) +{ + return skb_network_header(skb) - skb->data; +} +static inline void skb_set_transport_header(struct sk_buff *skb, + const int offset) +{ + skb->h.raw = skb->data + offset; +} + +static inline void skb_set_network_header(struct sk_buff *skb, + const int offset) +{ + skb->nh.raw = skb->data + offset; +} + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/slab.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,20 @@ +#include_next + +#ifndef LINUX_SLAB_BACKPORT_tO_2_6_22_H +#define LINUX_SLAB_BACKPORT_tO_2_6_22_H + +#include_next + +static inline +struct kmem_cache * +kmem_cache_create_for_2_6_22 (const char *name, size_t size, size_t align, + unsigned long flags, + void (*ctor)(void*, struct kmem_cache *, unsigned long) + ) +{ + return kmem_cache_create(name, size, align, flags, ctor, NULL); +} + +#define kmem_cache_create kmem_cache_create_for_2_6_22 + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/tcp.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,11 @@ +#ifndef __BACKPORT_LINUX_TCP_TO_2_6_21__ +#define __BACKPORT_LINUX_TCP_TO_2_6_21__ + +#include_next + +static inline struct tcphdr *tcp_hdr(const struct sk_buff *skb) +{ + return (struct tcphdr *)skb_transport_header(skb); +} + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/types.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,9 @@ +#ifndef BACKPORT_LINUX_TYPES_TO_2_6_19 +#define BACKPORT_LINUX_TYPES_TO_2_6_19 + +#include_next + +typedef _Bool bool; +typedef __u16 __sum16; + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/linux/workqueue.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,62 @@ +#ifndef BACKPORT_LINUX_WORKQUEUE_TO_2_6_19 +#define BACKPORT_LINUX_WORKQUEUE_TO_2_6_19 + +#include_next + +struct delayed_work { + struct work_struct work; +}; + +static inline void +backport_INIT_WORK(struct work_struct *work, void *func) +{ + INIT_WORK(work, func, work); +} + +static inline int backport_queue_delayed_work(struct workqueue_struct *wq, + struct delayed_work *work, + unsigned long delay) +{ + if (likely(!delay)) + return queue_work(wq, &work->work); + else + return queue_delayed_work(wq, &work->work, delay); +} + +static inline int +backport_cancel_delayed_work(struct delayed_work *work) +{ + return cancel_delayed_work(&work->work); +} + +static inline void +backport_cancel_rearming_delayed_workqueue(struct workqueue_struct *wq, struct delayed_work *work) +{ + cancel_rearming_delayed_workqueue(wq, &work->work); +} + +static inline +int backport_schedule_delayed_work(struct delayed_work *work, unsigned long delay) +{ + if (likely(!delay)) + return schedule_work(&work->work); + else + return schedule_delayed_work(&work->work, delay); +} + +#undef INIT_WORK +#define INIT_WORK(_work, _func) backport_INIT_WORK(_work, _func) +#define INIT_DELAYED_WORK(_work, _func) INIT_WORK(&(_work)->work, _func) + +#undef DECLARE_WORK +#define DECLARE_WORK(n, f) \ + struct work_struct n = __WORK_INITIALIZER(n, (void (*)(void *))f, &(n)) +#define DECLARE_DELAYED_WORK(n, f) \ + struct delayed_work n = { .work = __WORK_INITIALIZER(n.work, (void (*)(void *))f, &(n.work)) } + +#define queue_delayed_work backport_queue_delayed_work +#define cancel_delayed_work backport_cancel_delayed_work +#define cancel_rearming_delayed_workqueue backport_cancel_rearming_delayed_workqueue +#define schedule_delayed_work backport_schedule_delayed_work + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/ip.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,7 @@ +#ifndef __BACKPORT_NET_IP_H_TO_2_6_23__ +#define __BACKPORT_NET_IP_H_TO_2_6_23__ + +#include_next +#define inet_get_local_port_range(a, b) { *(a) = sysctl_local_port_range[0]; *(b) = sysctl_local_port_range[1]; } + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/net/neighbour.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,8 @@ +#ifndef __BACKPORT_NET_NEIGHBOUR_TO_2_6_20__ +#define __BACKPORT_NET_NEIGHBOUR_TO_2_6_20__ + +#include_next + +#define neigh_cleanup neigh_destructor + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/scsi/scsi_cmnd.h 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,23 @@ +#ifndef SCSI_SCSI_CMND_BACKPORT_TO_2_6_22_H +#define SCSI_SCSI_CMND_BACKPORT_TO_2_6_22_H + +#include_next + +#define scsi_sg_count(cmd) ((cmd)->use_sg) +#define scsi_sglist(cmd) ((struct scatterlist *)(cmd)->request_buffer) +#define scsi_bufflen(cmd) ((cmd)->request_bufflen) + +static inline void scsi_set_resid(struct scsi_cmnd *cmd, int resid) +{ + cmd->resid = resid; +} + +static inline int scsi_get_resid(struct scsi_cmnd *cmd) +{ + return cmd->resid; +} + +#define scsi_for_each_sg(cmd, sg, nseg, __i) \ + for (__i = 0, sg = scsi_sglist(cmd); __i < (nseg); __i++, (sg)++) + +#endif diff -ruN ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c --- ofa_kernel-1.2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_addons/backport/2.6.18-EL5.1/include/src/genalloc.c 2009-02-03 00:42:05.000000000 +0530 @@ -0,0 +1,198 @@ +/* + * Basic general purpose allocator for managing special purpose memory + * not managed by the regular kmalloc/kfree interface. + * Uses for this includes on-device special memory, uncached memory + * etc. + * + * Copyright 2005 (C) Jes Sorensen + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. + */ + +#include +#include + + +/** + * gen_pool_create - create a new special memory pool + * @min_alloc_order: log base 2 of number of bytes each bitmap bit represents + * @nid: node id of the node the pool structure should be allocated on, or -1 + * + * Create a new special memory pool that can be used to manage special purpose + * memory not managed by the regular kmalloc/kfree interface. + */ +struct gen_pool *gen_pool_create(int min_alloc_order, int nid) +{ + struct gen_pool *pool; + + pool = kmalloc_node(sizeof(struct gen_pool), GFP_KERNEL, nid); + if (pool != NULL) { + rwlock_init(&pool->lock); + INIT_LIST_HEAD(&pool->chunks); + pool->min_alloc_order = min_alloc_order; + } + return pool; +} +EXPORT_SYMBOL(gen_pool_create); + +/** + * gen_pool_add - add a new chunk of special memory to the pool + * @pool: pool to add new memory chunk to + * @addr: starting address of memory chunk to add to pool + * @size: size in bytes of the memory chunk to add to pool + * @nid: node id of the node the chunk structure and bitmap should be + * allocated on, or -1 + * + * Add a new chunk of special memory to the specified pool. + */ +int gen_pool_add(struct gen_pool *pool, unsigned long addr, size_t size, + int nid) +{ + struct gen_pool_chunk *chunk; + int nbits = size >> pool->min_alloc_order; + int nbytes = sizeof(struct gen_pool_chunk) + + (nbits + BITS_PER_BYTE - 1) / BITS_PER_BYTE; + + chunk = kmalloc_node(nbytes, GFP_KERNEL, nid); + if (unlikely(chunk == NULL)) + return -1; + + memset(chunk, 0, nbytes); + spin_lock_init(&chunk->lock); + chunk->start_addr = addr; + chunk->end_addr = addr + size; + + write_lock(&pool->lock); + list_add(&chunk->next_chunk, &pool->chunks); + write_unlock(&pool->lock); + + return 0; +} +EXPORT_SYMBOL(gen_pool_add); + +/** + * gen_pool_destroy - destroy a special memory pool + * @pool: pool to destroy + * + * Destroy the specified special memory pool. Verifies that there are no + * outstanding allocations. + */ +void gen_pool_destroy(struct gen_pool *pool) +{ + struct list_head *_chunk, *_next_chunk; + struct gen_pool_chunk *chunk; + int order = pool->min_alloc_order; + int bit, end_bit; + + + write_lock(&pool->lock); + list_for_each_safe(_chunk, _next_chunk, &pool->chunks) { + chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk); + list_del(&chunk->next_chunk); + + end_bit = (chunk->end_addr - chunk->start_addr) >> order; + bit = find_next_bit(chunk->bits, end_bit, 0); + BUG_ON(bit < end_bit); + + kfree(chunk); + } + kfree(pool); + return; +} +EXPORT_SYMBOL(gen_pool_destroy); + +/** + * gen_pool_alloc - allocate special memory from the pool + * @pool: pool to allocate from + * @size: number of bytes to allocate from the pool + * + * Allocate the requested number of bytes from the specified pool. + * Uses a first-fit algorithm. + */ +unsigned long gen_pool_alloc(struct gen_pool *pool, size_t size) +{ + struct list_head *_chunk; + struct gen_pool_chunk *chunk; + unsigned long addr, flags; + int order = pool->min_alloc_order; + int nbits, bit, start_bit, end_bit; + + if (size == 0) + return 0; + + nbits = (size + (1UL << order) - 1) >> order; + + read_lock(&pool->lock); + list_for_each(_chunk, &pool->chunks) { + chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk); + + end_bit = (chunk->end_addr - chunk->start_addr) >> order; + end_bit -= nbits + 1; + + spin_lock_irqsave(&chunk->lock, flags); + bit = -1; + while (bit + 1 < end_bit) { + bit = find_next_zero_bit(chunk->bits, end_bit, bit + 1); + if (bit >= end_bit) + break; + + start_bit = bit; + if (nbits > 1) { + bit = find_next_bit(chunk->bits, bit + nbits, + bit + 1); + if (bit - start_bit < nbits) + continue; + } + + addr = chunk->start_addr + + ((unsigned long)start_bit << order); + while (nbits--) + __set_bit(start_bit++, &chunk->bits); + spin_unlock_irqrestore(&chunk->lock, flags); + read_unlock(&pool->lock); + return addr; + } + spin_unlock_irqrestore(&chunk->lock, flags); + } + read_unlock(&pool->lock); + return 0; +} +EXPORT_SYMBOL(gen_pool_alloc); + +/** + * gen_pool_free - free allocated special memory back to the pool + * @pool: pool to free to + * @addr: starting address of memory to free back to pool + * @size: size in bytes of memory to free + * + * Free previously allocated special memory back to the specified pool. + */ +void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size) +{ + struct list_head *_chunk; + struct gen_pool_chunk *chunk; + unsigned long flags; + int order = pool->min_alloc_order; + int bit, nbits; + + nbits = (size + (1UL << order) - 1) >> order; + + read_lock(&pool->lock); + list_for_each(_chunk, &pool->chunks) { + chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk); + + if (addr >= chunk->start_addr && addr < chunk->end_addr) { + BUG_ON(addr + size > chunk->end_addr); + spin_lock_irqsave(&chunk->lock, flags); + bit = (addr - chunk->start_addr) >> order; + while (nbits--) + __clear_bit(bit++, &chunk->bits); + spin_unlock_irqrestore(&chunk->lock, flags); + break; + } + } + BUG_ON(nbits > 0); + read_unlock(&pool->lock); +} +EXPORT_SYMBOL(gen_pool_free); diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch --- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/1_struct_path_revert_to_2_6_19.patch 2009-02-03 00:44:23.000000000 +0530 @@ -0,0 +1,82 @@ +diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c +index a617ca7..4e16314 100644 +--- a/drivers/infiniband/core/uverbs_main.c ++++ b/drivers/infiniband/core/uverbs_main.c +@@ -534,9 +534,9 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file, + * module reference. + */ + filp->f_op = fops_get(&uverbs_event_fops); +- filp->f_path.mnt = mntget(uverbs_event_mnt); +- filp->f_path.dentry = dget(uverbs_event_mnt->mnt_root); +- filp->f_mapping = filp->f_path.dentry->d_inode->i_mapping; ++ filp->f_vfsmnt = mntget(uverbs_event_mnt); ++ filp->f_dentry = dget(uverbs_event_mnt->mnt_root); ++ filp->f_mapping = filp->f_dentry->d_inode->i_mapping; + filp->f_flags = O_RDONLY; + filp->f_mode = FMODE_READ; + filp->private_data = ev_file; +diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c +index b932bcb..ddbcabd 100644 +--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c ++++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c +@@ -1744,9 +1744,9 @@ static int ipath_assign_port(struct file *fp, + goto done; + } + +- i_minor = iminor(fp->f_path.dentry->d_inode) - IPATH_USER_MINOR_BASE; ++ i_minor = iminor(fp->f_dentry->d_inode) - IPATH_USER_MINOR_BASE; + ipath_cdbg(VERBOSE, "open on dev %lx (minor %d)\n", +- (long)fp->f_path.dentry->d_inode->i_rdev, i_minor); ++ (long)fp->f_dentry->d_inode->i_rdev, i_minor); + + if (i_minor) + ret = find_free_port(i_minor - 1, fp, uinfo); +diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c +index 79a60f0..d9ff283 100644 +--- a/drivers/infiniband/hw/ipath/ipath_fs.c ++++ b/drivers/infiniband/hw/ipath/ipath_fs.c +@@ -118,7 +118,7 @@ static ssize_t atomic_counters_read(struct file *file, char __user *buf, + u16 i; + struct ipath_devdata *dd; + +- dd = file->f_path.dentry->d_inode->i_private; ++ dd = file->f_dentry->d_inode->i_private; + + for (i = 0; i < NUM_COUNTERS; i++) + counters[i] = ipath_snap_cntr(dd, i); +@@ -138,7 +138,7 @@ static ssize_t atomic_node_info_read(struct file *file, char __user *buf, + struct ipath_devdata *dd; + u64 guid; + +- dd = file->f_path.dentry->d_inode->i_private; ++ dd = file->f_dentry->d_inode->i_private; + + guid = be64_to_cpu(dd->ipath_guid); + +@@ -177,7 +177,7 @@ static ssize_t atomic_port_info_read(struct file *file, char __user *buf, + u32 tmp, tmp2; + struct ipath_devdata *dd; + +- dd = file->f_path.dentry->d_inode->i_private; ++ dd = file->f_dentry->d_inode->i_private; + + /* so we only initialize non-zero fields. */ + memset(portinfo, 0, sizeof portinfo); +@@ -324,7 +324,7 @@ static ssize_t flash_read(struct file *file, char __user *buf, + goto bail; + } + +- dd = file->f_path.dentry->d_inode->i_private; ++ dd = file->f_dentry->d_inode->i_private; + if (ipath_eeprom_read(dd, pos, tmp, count)) { + ipath_dev_err(dd, "failed to read from flash\n"); + ret = -ENXIO; +@@ -377,7 +377,7 @@ static ssize_t flash_write(struct file *file, const char __user *buf, + goto bail_tmp; + } + +- dd = file->f_path.dentry->d_inode->i_private; ++ dd = file->f_dentry->d_inode->i_private; + if (ipath_eeprom_write(dd, pos, tmp, count)) { + ret = -ENXIO; + ipath_dev_err(dd, "failed to write to flash\n"); diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch --- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/2_misc_device_to_2_6_19.patch 2009-02-03 00:44:23.000000000 +0530 @@ -0,0 +1,50 @@ +>Post a replacement to 2_misc_device_to_2_6_19.patch, we'll test. + +I did not test this patch, but you can try replacing the contents of +the 2_misc_device_to_2_6_19.patch with the changes below. (It's +possible that this may lead to some conflict further down in the patch +chain...) The function prototype for show_abi_version changed between +2.6.20 to 2.6.19; this was the missing piece in the original backport +patch. I would have expected a build warning for this. + +Signed-off-by: Sean Hefty + +--- +--- ofa_kernel-1.2/drivers/infiniband/core/ucma.c 2007-03-08 12:11:37.000000000 -0800 ++++ b/drivers/infiniband/core/ucma.c 2007-03-08 12:13:13.000000000 -0800 +@@ -847,13 +847,11 @@ static struct miscdevice ucma_misc = { + .fops = &ucma_fops, + }; + +-static ssize_t show_abi_version(struct device *dev, +- struct device_attribute *attr, +- char *buf) ++static ssize_t show_abi_version(struct class_device *class_dev, char *buf) + { + return sprintf(buf, "%d\n", RDMA_USER_CM_ABI_VERSION); + } +-static DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); ++static CLASS_DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + + static int __init ucma_init(void) + { +@@ -863,7 +861,8 @@ static int __init ucma_init(void) + if (ret) + return ret; + +- ret = device_create_file(ucma_misc.this_device, &dev_attr_abi_version); ++ ret = class_device_create_file(ucma_misc.class, ++ &class_device_attr_abi_version); + if (ret) { + printk(KERN_ERR "rdma_ucm: couldn't create abi_version attr\n"); + goto err; +@@ -876,7 +875,8 @@ err: + + static void __exit ucma_cleanup(void) + { +- device_remove_file(ucma_misc.this_device, &dev_attr_abi_version); ++ class_device_remove_file(ucma_misc.class, ++ &class_device_attr_abi_version); + misc_deregister(&ucma_misc); + idr_destroy(&ctx_idr); + } diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch --- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/cxgb3_makefile_to_2_6_19.patch 2009-02-03 00:44:23.000000000 +0530 @@ -0,0 +1,12 @@ +diff --git a/drivers/net/cxgb3/Makefile b/drivers/net/cxgb3/Makefile +index 3434679..bb008b6 100755 +--- a/drivers/net/cxgb3/Makefile ++++ b/drivers/net/cxgb3/Makefile +@@ -1,6 +1,7 @@ + # + # Chelsio T3 driver + # ++NOSTDINC_FLAGS:= $(NOSTDINC_FLAGS) $(LINUXINCLUDE) + + obj-$(CONFIG_CHELSIO_T3) += cxgb3.o + diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch --- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-16-htirq-2.6.18.patch 2009-02-03 00:44:23.000000000 +0530 @@ -0,0 +1,352 @@ +BACKPORT - use old IRQ infrastructure on 2.6.18 and earlier + +diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/Kconfig +--- a/drivers/infiniband/hw/ipath/Kconfig Thu Mar 08 14:02:44 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/Kconfig Thu Mar 08 14:04:08 2007 -0800 +@@ -1,6 +1,6 @@ config INFINIBAND_IPATH + config INFINIBAND_IPATH + tristate "QLogic InfiniPath Driver" +- depends on (PCI_MSI || HT_IRQ) && 64BIT && INFINIBAND && NET ++ depends on PCI_MSI && 64BIT && INFINIBAND && NET + ---help--- + This is a driver for QLogic InfiniPath host channel adapters, + including InfiniBand verbs support. This driver allows these +diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/Makefile +--- a/drivers/infiniband/hw/ipath/Makefile Thu Mar 08 14:02:44 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/Makefile Thu Mar 08 14:04:08 2007 -0800 +@@ -32,7 +32,7 @@ ib_ipath-y := \ + ipath_verbs_mcast.o \ + ipath_verbs.o + +-ib_ipath-$(CONFIG_HT_IRQ) += ipath_iba6110.o ++ib_ipath-y += ipath_iba6110.o + ib_ipath-$(CONFIG_PCI_MSI) += ipath_iba6120.o + + ib_ipath-$(CONFIG_X86_64) += ipath_wc_x86_64.o +diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_driver.c +--- a/drivers/infiniband/hw/ipath/ipath_driver.c Thu Mar 08 14:02:44 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_driver.c Thu Mar 08 14:04:08 2007 -0800 +@@ -42,6 +42,8 @@ + #include "ipath_verbs.h" + #include "ipath_common.h" + ++#define CONFIG_HT_IRQ ++ + static void ipath_update_pio_bufs(struct ipath_devdata *); + + const char *ipath_get_unit_name(int unit) +@@ -347,7 +349,7 @@ static int __devinit ipath_init_one(stru + } + addr = pci_resource_start(pdev, 0); + len = pci_resource_len(pdev, 0); +- ipath_cdbg(VERBOSE, "regbase (0) %llx len %d pdev->irq %d, vend %x/%x " ++ ipath_cdbg(VERBOSE, "regbase (0) %llx len %d irq %x, vend %x/%x " + "driver_data %lx\n", addr, len, pdev->irq, ent->vendor, + ent->device, ent->driver_data); + +@@ -530,15 +532,15 @@ static int __devinit ipath_init_one(stru + * check 0 irq after we return from chip-specific bus setup, since + * that can affect this due to setup + */ +- if (!dd->ipath_irq) ++ if (!pdev->irq) + ipath_dev_err(dd, "irq is 0, BIOS error? Interrupts won't " + "work\n"); + else { +- ret = request_irq(dd->ipath_irq, ipath_intr, IRQF_SHARED, ++ ret = request_irq(pdev->irq, ipath_intr, IRQF_SHARED, + IPATH_DRV_NAME, dd); + if (ret) { + ipath_dev_err(dd, "Couldn't setup irq handler, " +- "irq=%d: %d\n", dd->ipath_irq, ret); ++ "irq=%d: %d\n", pdev->irq, ret); + goto bail_iounmap; + } + } +@@ -709,10 +711,11 @@ static void __devexit ipath_remove_one(s + * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs + * for all versions of the driver, if they were allocated + */ +- if (dd->ipath_irq) { +- ipath_cdbg(VERBOSE, "unit %u free irq %d\n", +- dd->ipath_unit, dd->ipath_irq); +- dd->ipath_f_free_irq(dd); ++ if (pdev->irq) { ++ ipath_cdbg(VERBOSE, ++ "unit %u free_irq of irq %x\n", ++ dd->ipath_unit, pdev->irq); ++ free_irq(pdev->irq, dd); + } else + ipath_dbg("irq is 0, not doing free_irq " + "for unit %u\n", dd->ipath_unit); +diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_iba6110.c +--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c Thu Mar 08 14:02:44 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c Thu Mar 08 14:04:08 2007 -0800 +@@ -38,7 +38,6 @@ + + #include + #include +-#include + + #include "ipath_kernel.h" + #include "ipath_registers.h" +@@ -914,40 +913,49 @@ static void slave_or_pri_blk(struct ipat + } + } + +-static int ipath_ht_intconfig(struct ipath_devdata *dd) +-{ +- int ret; +- +- if (dd->ipath_intconfig) { +- ipath_write_kreg(dd, dd->ipath_kregs->kr_interruptconfig, +- dd->ipath_intconfig); /* interrupt address */ +- ret = 0; +- } else { +- ipath_dev_err(dd, "No interrupts enabled, couldn't setup " +- "interrupt address\n"); +- ret = -EINVAL; +- } +- +- return ret; +-} +- +-static void ipath_ht_irq_update(struct pci_dev *dev, int irq, +- struct ht_irq_msg *msg) +-{ +- struct ipath_devdata *dd = pci_get_drvdata(dev); +- u64 prev_intconfig = dd->ipath_intconfig; +- +- dd->ipath_intconfig = msg->address_lo; +- dd->ipath_intconfig |= ((u64) msg->address_hi) << 32; +- +- /* +- * If the previous value of dd->ipath_intconfig is zero, we're +- * getting configured for the first time, and must not program the +- * intconfig register here (it will be programmed later, when the +- * hardware is ready). Otherwise, we should. +- */ +- if (prev_intconfig) +- ipath_ht_intconfig(dd); ++static int set_int_handler(struct ipath_devdata *dd, struct pci_dev *pdev, ++ int pos) ++{ ++ u32 int_handler_addr_lower; ++ u32 int_handler_addr_upper; ++ u64 ihandler; ++ u32 intvec; ++ ++ /* use indirection register to get the intr handler */ ++ pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, 0x10); ++ pci_read_config_dword(pdev, pos + 4, &int_handler_addr_lower); ++ pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, 0x11); ++ pci_read_config_dword(pdev, pos + 4, &int_handler_addr_upper); ++ ++ ihandler = (u64) int_handler_addr_lower | ++ ((u64) int_handler_addr_upper << 32); ++ ++ /* ++ * kernels with CONFIG_PCI_MSI set the vector in the irq field of ++ * struct pci_device, so we use that to program the internal ++ * interrupt register (not config space) with that value. The BIOS ++ * must still have done the basic MSI setup. ++ */ ++ intvec = pdev->irq; ++ /* ++ * clear any vector bits there; normally not set but we'll overload ++ * this for some debug purposes (setting the HTC debug register ++ * value from software, rather than GPIOs), so it might be set on a ++ * driver reload. ++ */ ++ ihandler &= ~0xff0000; ++ /* x86 vector goes in intrinfo[23:16] */ ++ ihandler |= intvec << 16; ++ ipath_cdbg(VERBOSE, "ihandler lower %x, upper %x, intvec %x, " ++ "interruptconfig %llx\n", int_handler_addr_lower, ++ int_handler_addr_upper, intvec, ++ (unsigned long long) ihandler); ++ ++ /* can't program yet, so save for interrupt setup */ ++ dd->ipath_intconfig = ihandler; ++ /* keep going, so we find link control stuff also */ ++ ++ return ihandler != 0; + } + + /** +@@ -963,19 +971,12 @@ static int ipath_setup_ht_config(struct + static int ipath_setup_ht_config(struct ipath_devdata *dd, + struct pci_dev *pdev) + { +- int pos, ret; +- +- ret = __ht_create_irq(pdev, 0, ipath_ht_irq_update); +- if (ret < 0) { +- ipath_dev_err(dd, "Couldn't create interrupt handler: " +- "err %d\n", ret); +- goto bail; +- } +- dd->ipath_irq = ret; +- ret = 0; +- +- /* +- * Handle clearing CRC errors in linkctrl register if necessary. We ++ int pos, ret = 0; ++ int ihandler = 0; ++ ++ /* ++ * Read the capability info to find the interrupt info, and also ++ * handle clearing CRC errors in linkctrl register if necessary. We + * do this early, before we ever enable errors or hardware errors, + * mostly to avoid causing the chip to enter freeze mode. + */ +@@ -999,8 +1000,16 @@ static int ipath_setup_ht_config(struct + } + if (!(cap_type & 0xE0)) + slave_or_pri_blk(dd, pdev, pos, cap_type); ++ else if (cap_type == HT_INTR_DISC_CONFIG) ++ ihandler = set_int_handler(dd, pdev, pos); + } while ((pos = pci_find_next_capability(pdev, pos, + PCI_CAP_ID_HT))); ++ ++ if (!ihandler) { ++ ipath_dev_err(dd, "Couldn't find interrupt handler in " ++ "config space\n"); ++ ret = -ENODEV; ++ } + + bail: + return ret; +@@ -1351,6 +1360,25 @@ static void ipath_ht_quiet_serdes(struct + ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val); + } + ++static int ipath_ht_intconfig(struct ipath_devdata *dd) ++{ ++ int ret; ++ ++ if (!dd->ipath_intconfig) { ++ ipath_dev_err(dd, "No interrupts enabled, couldn't setup " ++ "interrupt address\n"); ++ ret = 1; ++ goto bail; ++ } ++ ++ ipath_write_kreg(dd, dd->ipath_kregs->kr_interruptconfig, ++ dd->ipath_intconfig); /* interrupt address */ ++ ret = 0; ++ ++bail: ++ return ret; ++} ++ + /** + * ipath_pe_put_tid - write a TID in chip + * @dd: the infinipath device +@@ -1546,14 +1574,6 @@ static int ipath_ht_get_base_info(struct + return 0; + } + +-static void ipath_ht_free_irq(struct ipath_devdata *dd) +-{ +- free_irq(dd->ipath_irq, dd); +- ht_destroy_irq(dd->ipath_irq); +- dd->ipath_irq = 0; +- dd->ipath_intconfig = 0; +-} +- + /** + * ipath_init_iba6110_funcs - set up the chip-specific function pointers + * @dd: the infinipath device +@@ -1577,7 +1597,6 @@ void ipath_init_iba6110_funcs(struct ipa + dd->ipath_f_cleanup = ipath_setup_ht_cleanup; + dd->ipath_f_setextled = ipath_setup_ht_setextled; + dd->ipath_f_get_base_info = ipath_ht_get_base_info; +- dd->ipath_f_free_irq = ipath_ht_free_irq; + + /* + * initialize chip-specific variables +diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_iba6120.c +--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c Thu Mar 08 14:02:44 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c Thu Mar 08 14:04:08 2007 -0800 +@@ -856,7 +856,6 @@ static int ipath_setup_pe_config(struct + ipath_dev_err(dd, "pci_enable_msi failed: %d, " + "interrupts may not work\n", ret); + /* continue even if it fails, we may still be OK... */ +- dd->ipath_irq = pdev->irq; + + if ((pos = pci_find_capability(dd->pcidev, PCI_CAP_ID_MSI))) { + u16 control; +@@ -1324,12 +1323,6 @@ done: + return 0; + } + +-static void ipath_pe_free_irq(struct ipath_devdata *dd) +-{ +- free_irq(dd->ipath_irq, dd); +- dd->ipath_irq = 0; +-} +- + /** + * ipath_init_iba6120_funcs - set up the chip-specific function pointers + * @dd: the infinipath device +@@ -1356,7 +1349,6 @@ void ipath_init_iba6120_funcs(struct ipa + dd->ipath_f_cleanup = ipath_setup_pe_cleanup; + dd->ipath_f_setextled = ipath_setup_pe_setextled; + dd->ipath_f_get_base_info = ipath_pe_get_base_info; +- dd->ipath_f_free_irq = ipath_pe_free_irq; + + /* initialize chip-specific variables */ + dd->ipath_f_tidtemplate = ipath_pe_tidtemplate; +diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_intr.c +--- a/drivers/infiniband/hw/ipath/ipath_intr.c Thu Mar 08 14:02:44 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_intr.c Thu Mar 08 14:04:08 2007 -0800 +@@ -732,14 +732,14 @@ static void ipath_bad_intr(struct ipath_ + * linuxbios development work, and it may happen in + * the future again. + */ +- if (dd->pcidev && dd->ipath_irq) { ++ if (dd->pcidev && dd->pcidev->irq) { + ipath_dev_err(dd, "Now %u unexpected " + "interrupts, unregistering " + "interrupt handler\n", + *unexpectp); +- ipath_dbg("free_irq of irq %d\n", +- dd->ipath_irq); +- dd->ipath_f_free_irq(dd); ++ ipath_dbg("free_irq of irq %x\n", ++ dd->pcidev->irq); ++ free_irq(dd->pcidev->irq, dd); + } + } + if (ipath_read_kreg32(dd, dd->ipath_kregs->kr_intmask)) { +@@ -775,7 +775,7 @@ static void ipath_bad_regread(struct ipa + if (allbits == 2) { + ipath_dev_err(dd, "Still bad interrupt status, " + "unregistering interrupt\n"); +- dd->ipath_f_free_irq(dd); ++ free_irq(dd->pcidev->irq, dd); + } else if (allbits > 2) { + if ((allbits % 10000) == 0) + printk("."); +diff -r 0a8c1ca4ad6d drivers/infiniband/hw/ipath/ipath_kernel.h +--- a/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Mar 08 14:02:44 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Thu Mar 08 14:04:08 2007 -0800 +@@ -213,8 +213,6 @@ struct ipath_devdata { + void (*ipath_f_setextled)(struct ipath_devdata *, u64, u64); + /* fill out chip-specific fields */ + int (*ipath_f_get_base_info)(struct ipath_portdata *, void *); +- /* free irq */ +- void (*ipath_f_free_irq)(struct ipath_devdata *); + struct ipath_ibdev *verbs_dev; + struct timer_list verbs_timer; + /* total dwords sent (summed from counter) */ +@@ -332,8 +330,6 @@ struct ipath_devdata { + /* so we can rewrite it after a chip reset */ + u32 ipath_pcibar1; + +- /* interrupt number */ +- int ipath_irq; + /* HT/PCI Vendor ID (here for NodeInfo) */ + u16 ipath_vendorid; + /* HT/PCI Device ID (here for NodeInfo) */ diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch --- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/ipath-17-ipath_intr-2.6.18.patch 2009-02-03 00:44:23.000000000 +0530 @@ -0,0 +1,26 @@ +BACKPORT - interrupt handler signature changed in 2.6.19 + +diff -r 8e3a2c4c9490 drivers/infiniband/hw/ipath/ipath_intr.c +--- a/drivers/infiniband/hw/ipath/ipath_intr.c Wed Jan 31 16:04:27 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_intr.c Wed Jan 31 16:11:22 2007 -0800 +@@ -897,7 +897,7 @@ static void handle_urcv(struct ipath_dev + } + } + +-irqreturn_t ipath_intr(int irq, void *data) ++irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *ignored) + { + struct ipath_devdata *dd = data; + u32 istat, chk0rcv = 0; +diff -r 8e3a2c4c9490 drivers/infiniband/hw/ipath/ipath_kernel.h +--- a/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Jan 31 16:04:27 2007 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Jan 31 16:11:22 2007 -0800 +@@ -637,7 +637,7 @@ struct sk_buff *ipath_alloc_skb(struct i + + extern int ipath_diag_inuse; + +-irqreturn_t ipath_intr(int irq, void *devid); ++irqreturn_t ipath_intr(int irq, void *devid, struct pt_regs *); + int ipath_decode_err(char *buf, size_t blen, ipath_err_t err); + #if __IPATH_INFO || __IPATH_DBG + extern const char *ipath_ibcstatus_str[]; diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch --- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/linux_genalloc_to_2_6_20.patch 2009-02-03 00:44:23.000000000 +0530 @@ -0,0 +1,17 @@ +diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile +index 163d991..2cd239f 100644 +--- a/drivers/infiniband/core/Makefile ++++ b/drivers/infiniband/core/Makefile +@@ -30,3 +30,5 @@ ib_ucm-y := ucm.o + + ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ + uverbs_marshall.o ++ ++ib_core-y += genalloc.o +diff --git a/drivers/infiniband/core/genalloc.c b/drivers/infiniband/core/genalloc.c +new file mode 100644 +index 0000000..96a48fe +--- /dev/null ++++ b/drivers/infiniband/core/genalloc.c +@@ -0,0 +1 @@ ++#include "src/genalloc.c" diff -ruN ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch --- ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch 1970-01-01 05:30:00.000000000 +0530 +++ ofa_kernel-1.2_try2/kernel_patches/backport/2.6.18-EL5.1/open-iscsi-tx-hash-fixes.patch 2009-02-03 00:44:23.000000000 +0530 @@ -0,0 +1,277 @@ +Index: gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.c +=================================================================== +--- gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check.orig/drivers/scsi/iscsi_tcp.c ++++ gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.c +@@ -108,8 +108,8 @@ iscsi_hdr_digest(struct iscsi_conn *conn + { + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; + +- crypto_hash_digest(&tcp_conn->tx_hash, &buf->sg, buf->sg.length, crc); +- buf->sg.length = tcp_conn->hdr_size; ++ crypto_digest_digest(tcp_conn->tx_tfm, &buf->sg, 1, crc); ++ buf->sg.length += sizeof(uint32_t); + } + + static inline int +@@ -468,8 +468,7 @@ iscsi_tcp_hdr_recv(struct iscsi_conn *co + + sg_init_one(&sg, (u8 *)hdr, + sizeof(struct iscsi_hdr) + ahslen); +- crypto_hash_digest(&tcp_conn->rx_hash, &sg, sg.length, +- (u8 *)&cdgst); ++ crypto_digest_digest(tcp_conn->rx_tfm, &sg, 1, (u8 *)&cdgst); + rdgst = *(uint32_t*)((char*)hdr + sizeof(struct iscsi_hdr) + + ahslen); + if (cdgst != rdgst) { +@@ -649,9 +648,10 @@ iscsi_ctask_copy(struct iscsi_tcp_conn * + * byte counters. + **/ + static inline int +-iscsi_tcp_copy(struct iscsi_conn *conn, int buf_size) ++iscsi_tcp_copy(struct iscsi_conn *conn) + { + struct iscsi_tcp_conn *tcp_conn = conn->dd_data; ++ int buf_size = tcp_conn->in.datalen; + int buf_left = buf_size - tcp_conn->data_copied; + int size = min(tcp_conn->in.copy, buf_left); + int rc; +@@ -676,7 +676,7 @@ iscsi_tcp_copy(struct iscsi_conn *conn, + } + + static inline void +-partial_sg_digest_update(struct hash_desc *desc, struct scatterlist *sg, ++partial_sg_digest_update(struct crypto_tfm *tfm, struct scatterlist *sg, + int offset, int length) + { + struct scatterlist temp; +@@ -684,7 +684,7 @@ partial_sg_digest_update(struct hash_des + memcpy(&temp, sg, sizeof(struct scatterlist)); + temp.offset = offset; + temp.length = length; +- crypto_hash_update(desc, &temp, length); ++ crypto_digest_update(tfm, &temp, 1); + } + + static void +@@ -693,7 +693,7 @@ iscsi_recv_digest_update(struct iscsi_tc + struct scatterlist tmp; + + sg_init_one(&tmp, buf, len); +- crypto_hash_update(&tcp_conn->rx_hash, &tmp, len); ++ crypto_digest_update(tcp_conn->rx_tfm, &tmp, 1); + } + + static int iscsi_scsi_data_in(struct iscsi_conn *conn) +@@ -747,12 +747,12 @@ static int iscsi_scsi_data_in(struct isc + if (!rc) { + if (conn->datadgst_en) { + if (!offset) +- crypto_hash_update( +- &tcp_conn->rx_hash, ++ crypto_digest_update( ++ &tcp_conn->rx_tfm, + &sg[i], sg[i].length); + else + partial_sg_digest_update( +- &tcp_conn->rx_hash, ++ &tcp_conn->rx_tfm, + &sg[i], + sg[i].offset + offset, + sg[i].length - offset); +@@ -766,10 +766,9 @@ static int iscsi_scsi_data_in(struct isc + /* + * data-in is complete, but buffer not... + */ +- partial_sg_digest_update(&tcp_conn->rx_hash, +- &sg[i], +- sg[i].offset, +- sg[i].length-rc); ++ partial_sg_digest_update(tcp_conn->rx_tfm, ++ &sg[i], ++ sg[i].offset, sg[i].length-rc); + rc = 0; + break; + } +@@ -813,7 +812,7 @@ iscsi_data_recv(struct iscsi_conn *conn) + * Collect data segment to the connection's data + * placeholder + */ +- if (iscsi_tcp_copy(conn, tcp_conn->in.datalen)) { ++ if (iscsi_tcp_copy(conn)) { + rc = -EAGAIN; + goto exit; + } +@@ -887,7 +886,7 @@ more: + rc = iscsi_tcp_hdr_recv(conn); + if (!rc && tcp_conn->in.datalen) { + if (conn->datadgst_en) +- crypto_hash_init(&tcp_conn->rx_hash); ++ crypto_digest_init(tcp_conn->rx_tfm); + tcp_conn->in_progress = IN_PROGRESS_DATA_RECV; + } else if (rc) { + iscsi_conn_failure(conn, rc); +@@ -900,15 +899,10 @@ more: + + debug_tcp("extra data_recv offset %d copy %d\n", + tcp_conn->in.offset, tcp_conn->in.copy); +- rc = iscsi_tcp_copy(conn, sizeof(uint32_t)); +- if (rc) { +- if (rc == -EAGAIN) +- goto again; +- iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED); +- return 0; +- } +- +- memcpy(&recv_digest, conn->data, sizeof(uint32_t)); ++ skb_copy_bits(tcp_conn->in.skb, tcp_conn->in.offset, ++ &recv_digest, 4); ++ tcp_conn->in.offset += 4; ++ tcp_conn->in.copy -= 4; + if (recv_digest != tcp_conn->in.datadgst) { + debug_tcp("iscsi_tcp: data digest error!" + "0x%x != 0x%x\n", recv_digest, +@@ -944,14 +938,13 @@ more: + tcp_conn->in.padding); + memset(pad, 0, tcp_conn->in.padding); + sg_init_one(&sg, pad, tcp_conn->in.padding); +- crypto_hash_update(&tcp_conn->rx_hash, +- &sg, sg.length); ++ crypto_digest_update(tcp_conn->rx_tfm, ++ &sg, 1); + } +- crypto_hash_final(&tcp_conn->rx_hash, +- (u8 *) &tcp_conn->in.datadgst); ++ crypto_digest_final(tcp_conn->rx_tfm, ++ (u8 *) & tcp_conn->in.datadgst); + debug_tcp("rx digest 0x%x\n", tcp_conn->in.datadgst); + tcp_conn->in_progress = IN_PROGRESS_DDIGEST_RECV; +- tcp_conn->data_copied = 0; + } else + tcp_conn->in_progress = IN_PROGRESS_WAIT_HEADER; + } +@@ -1193,7 +1186,7 @@ static inline void + iscsi_data_digest_init(struct iscsi_tcp_conn *tcp_conn, + struct iscsi_tcp_cmd_task *tcp_ctask) + { +- crypto_hash_init(&tcp_conn->tx_hash); ++ crypto_digest_init(tcp_conn->tx_tfm); + tcp_ctask->digest_count = 4; + } + +@@ -1449,9 +1442,8 @@ iscsi_send_padding(struct iscsi_conn *co + iscsi_buf_init_iov(&tcp_ctask->sendbuf, (char*)&tcp_ctask->pad, + tcp_ctask->pad_count); + if (conn->datadgst_en) +- crypto_hash_update(&tcp_conn->tx_hash, +- &tcp_ctask->sendbuf.sg, +- tcp_ctask->sendbuf.sg.length); ++ crypto_digest_update(tcp_conn->tx_tfm, ++ &tcp_ctask->sendbuf.sg, 1); + } else if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_PAD)) + return 0; + +@@ -1483,7 +1475,7 @@ iscsi_send_digest(struct iscsi_conn *con + tcp_conn = conn->dd_data; + + if (!(tcp_ctask->xmstate & XMSTATE_W_RESEND_DATA_DIGEST)) { +- crypto_hash_final(&tcp_conn->tx_hash, (u8*)digest); ++ crypto_digest_final(tcp_conn->tx_tfm, (u8*)digest); + iscsi_buf_init_iov(buf, (char*)digest, 4); + } + tcp_ctask->xmstate &= ~XMSTATE_W_RESEND_DATA_DIGEST; +@@ -1517,7 +1509,7 @@ iscsi_send_data(struct iscsi_cmd_task *c + rc = iscsi_sendpage(conn, sendbuf, count, &buf_sent); + *sent = *sent + buf_sent; + if (buf_sent && conn->datadgst_en) +- partial_sg_digest_update(&tcp_conn->tx_hash, ++ partial_sg_digest_update(tcp_conn->tx_tfm, + &sendbuf->sg, sendbuf->sg.offset + offset, + buf_sent); + if (!iscsi_buf_left(sendbuf) && *sg != tcp_ctask->bad_sg) { +@@ -1774,22 +1766,18 @@ iscsi_tcp_conn_create(struct iscsi_cls_s + /* initial operational parameters */ + tcp_conn->hdr_size = sizeof(struct iscsi_hdr); + +- tcp_conn->tx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->tx_hash.flags = 0; +- if (IS_ERR(tcp_conn->tx_hash.tfm)) ++ tcp_conn->tx_tfm = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->tx_tfm) + goto free_tcp_conn; + +- tcp_conn->rx_hash.tfm = crypto_alloc_hash("crc32c", 0, +- CRYPTO_ALG_ASYNC); +- tcp_conn->rx_hash.flags = 0; +- if (IS_ERR(tcp_conn->rx_hash.tfm)) ++ tcp_conn->rx_tfm = crypto_alloc_tfm("crc32c", 0); ++ if (!tcp_conn->rx_tfm) + goto free_tx_tfm; + + return cls_conn; + + free_tx_tfm: +- crypto_free_hash(tcp_conn->tx_hash.tfm); ++ crypto_free_tfm(tcp_conn->tx_tfm); + free_tcp_conn: + kfree(tcp_conn); + tcp_conn_alloc_fail: +@@ -1823,11 +1811,10 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_ + iscsi_tcp_release_conn(conn); + iscsi_conn_teardown(cls_conn); + +- if (tcp_conn->tx_hash.tfm) +- crypto_free_hash(tcp_conn->tx_hash.tfm); +- if (tcp_conn->rx_hash.tfm) +- crypto_free_hash(tcp_conn->rx_hash.tfm); +- ++ if (tcp_conn->tx_tfm) ++ crypto_free_tfm(tcp_conn->tx_tfm); ++ if (tcp_conn->rx_tfm) ++ crypto_free_tfm(tcp_conn->rx_tfm); + kfree(tcp_conn); + } + +@@ -1835,11 +1822,9 @@ static void + iscsi_tcp_conn_stop(struct iscsi_cls_conn *cls_conn, int flag) + { + struct iscsi_conn *conn = cls_conn->dd_data; +- struct iscsi_tcp_conn *tcp_conn = conn->dd_data; + + iscsi_conn_stop(cls_conn, flag); + iscsi_tcp_release_conn(conn); +- tcp_conn->hdr_size = sizeof(struct iscsi_hdr); + } + + static int +Index: gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.h +=================================================================== +--- gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check.orig/drivers/scsi/iscsi_tcp.h ++++ gen2_devel_kernel-20070129-1858_linux-2.6.18.6_check/drivers/scsi/iscsi_tcp.h +@@ -49,7 +49,6 @@ + #define ISCSI_SG_TABLESIZE SG_ALL + #define ISCSI_TCP_MAX_CMD_LEN 16 + +-struct crypto_hash; + struct socket; + + /* Socket connection recieve helper */ +@@ -82,7 +81,6 @@ struct iscsi_tcp_conn { + * stop to terminate */ + /* iSCSI connection-wide sequencing */ + int hdr_size; /* PDU header size */ +- + /* control data */ + struct iscsi_tcp_recv in; /* TCP receive context */ + int in_progress; /* connection state machine */ +@@ -93,8 +91,8 @@ struct iscsi_tcp_conn { + void (*old_write_space)(struct sock *); + + /* data and header digests */ +- struct hash_desc tx_hash; /* CRC32C (Tx) */ +- struct hash_desc rx_hash; /* CRC32C (Rx) */ ++ struct crypto_tfm *tx_tfm; /* CRC32C (Tx) */ ++ struct crypto_tfm *rx_tfm; /* CRC32C (Rx) */ + + /* MIB custom statistics */ + uint32_t sendpage_failures_cnt; diff -ruN ofa_kernel-1.2/ofed_scripts/configure ofa_kernel-1.2_try2/ofed_scripts/configure --- ofa_kernel-1.2/ofed_scripts/configure 2009-02-03 02:12:23.000000000 +0530 +++ ofa_kernel-1.2_try2/ofed_scripts/configure 2009-02-03 02:13:18.000000000 +0530 @@ -218,9 +218,12 @@ 2.6.17*) echo 2.6.17 ;; - 2.6.18-*fc[56]*|2.6.18-*el5*) + 2.6.18-*fc[56]*|2.6.18-8.el5) echo 2.6.18_FC6 ;; + 2.6.18-53.el5) + echo 2.6.18-EL5.1 + ;; 2.6.18*) echo 2.6.18 ;; -regards Devesh Sharma From devesh28 at gmail.com Tue Feb 3 06:11:34 2009 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 3 Feb 2009 19:41:34 +0530 Subject: ***SPAM*** Re: ***SPAM*** [ofa-general][PATCH v2] compiling OFED-1.2 with RHEL5.1 Message-ID: <309a667c0902030611j6d0f23eav52afd361e378f968@mail.gmail.com> Hello list here is the second patch inline diff -ruN a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c --- a/drivers/infiniband/core/mad.c 2009-02-03 02:37:18.000000000 +0530 +++ b/drivers/infiniband/core/mad.c 2009-02-03 02:37:08.000000000 +0530 @@ -2955,7 +2955,6 @@ sizeof(struct ib_mad_private), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); if (!ib_mad_cache) { printk(KERN_ERR PFX "Couldn't create ib_mad cache\n"); diff -ruN a/drivers/infiniband/hw/amso1100/c2_vq.c b/drivers/infiniband/hw/amso1100/c2_vq.c --- a/drivers/infiniband/hw/amso1100/c2_vq.c 2009-02-03 02:31:19.000000000 +0530 +++ b/drivers/infiniband/hw/amso1100/c2_vq.c 2009-02-03 02:51:30.000000000 +0530 @@ -85,7 +85,7 @@ (char) ('0' + c2dev->devnum)); c2dev->host_msg_cache = kmem_cache_create(c2dev->vq_cache_name, c2dev->rep_vq.msg_size, 0, - SLAB_HWCACHE_ALIGN, NULL, NULL); + SLAB_HWCACHE_ALIGN, NULL); if (c2dev->host_msg_cache == NULL) { return -ENOMEM; } diff -ruN a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c --- a/drivers/infiniband/hw/ehca/ehca_av.c 2009-02-03 02:31:19.000000000 +0530 +++ b/drivers/infiniband/hw/ehca/ehca_av.c 2009-02-03 02:49:53.000000000 +0530 @@ -257,7 +257,7 @@ av_cache = kmem_cache_create("ehca_cache_av", sizeof(struct ehca_av), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (!av_cache) return -ENOMEM; return 0; diff -ruN a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c --- a/drivers/infiniband/hw/ehca/ehca_cq.c 2009-02-03 02:37:18.000000000 +0530 +++ b/drivers/infiniband/hw/ehca/ehca_cq.c 2009-02-03 02:49:08.000000000 +0530 @@ -396,7 +396,7 @@ cq_cache = kmem_cache_create("ehca_cache_cq", sizeof(struct ehca_cq), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (!cq_cache) return -ENOMEM; return 0; diff -ruN a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c --- a/drivers/infiniband/hw/ehca/ehca_main.c 2009-02-03 02:37:18.000000000 +0530 +++ b/drivers/infiniband/hw/ehca/ehca_main.c 2009-02-03 02:49:30.000000000 +0530 @@ -165,7 +165,7 @@ ctblk_cache = kmem_cache_create("ehca_cache_ctblk", EHCA_PAGESIZE, H_CB_ALIGNMENT, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (!ctblk_cache) { ehca_gen_err("Cannot create ctblk SLAB cache."); ehca_cleanup_mrmw_cache(); diff -ruN a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c 2009-02-03 02:37:18.000000000 +0530 +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c 2009-02-03 02:51:00.000000000 +0530 @@ -2234,13 +2234,13 @@ mr_cache = kmem_cache_create("ehca_cache_mr", sizeof(struct ehca_mr), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (!mr_cache) return -ENOMEM; mw_cache = kmem_cache_create("ehca_cache_mw", sizeof(struct ehca_mw), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (!mw_cache) { kmem_cache_destroy(mr_cache); mr_cache = NULL; diff -ruN a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c --- a/drivers/infiniband/hw/ehca/ehca_pd.c 2009-02-03 02:31:19.000000000 +0530 +++ b/drivers/infiniband/hw/ehca/ehca_pd.c 2009-02-03 02:50:11.000000000 +0530 @@ -101,7 +101,7 @@ pd_cache = kmem_cache_create("ehca_cache_pd", sizeof(struct ehca_pd), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (!pd_cache) return -ENOMEM; return 0; diff -ruN a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c --- a/drivers/infiniband/hw/ehca/ehca_qp.c 2009-02-03 02:37:18.000000000 +0530 +++ b/drivers/infiniband/hw/ehca/ehca_qp.c 2009-02-03 02:50:29.000000000 +0530 @@ -1440,7 +1440,7 @@ qp_cache = kmem_cache_create("ehca_cache_qp", sizeof(struct ehca_qp), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (!qp_cache) return -ENOMEM; return 0; diff -ruN a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c --- a/drivers/infiniband/ulp/iser/iscsi_iser.c 2009-02-03 02:31:19.000000000 +0530 +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c 2009-02-03 02:51:56.000000000 +0530 @@ -625,7 +625,7 @@ ig.desc_cache = kmem_cache_create("iser_descriptors", sizeof (struct iser_desc), 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); + NULL); if (ig.desc_cache == NULL) return -ENOMEM; diff -ruN a/net/rds/connection.c b/net/rds/connection.c --- a/net/rds/connection.c 2009-02-03 02:31:19.000000000 +0530 +++ b/net/rds/connection.c 2009-02-03 07:00:14.000000000 +0530 @@ -332,7 +332,7 @@ { rds_conn_slab = kmem_cache_create("rds_connection", sizeof(struct rds_connection), - 0, 0, NULL, NULL); + 0, 0, NULL); if (rds_conn_slab == NULL) return -ENOMEM; diff -ruN a/net/rds/ib_recv.c b/net/rds/ib_recv.c --- a/net/rds/ib_recv.c 2009-02-03 02:37:18.000000000 +0530 +++ b/net/rds/ib_recv.c 2009-02-03 07:00:50.000000000 +0530 @@ -752,13 +752,13 @@ rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming", sizeof(struct rds_ib_incoming), - 0, 0, NULL, NULL); + 0, 0, NULL); if (rds_ib_incoming_slab == NULL) goto out; rds_ib_frag_slab = kmem_cache_create("rds_ib_frag", sizeof(struct rds_page_frag), - 0, 0, NULL, NULL); + 0, 0, NULL); if (rds_ib_frag_slab == NULL) kmem_cache_destroy(rds_ib_incoming_slab); else diff -ruN a/net/rds/tcp.c b/net/rds/tcp.c --- a/net/rds/tcp.c 2009-02-03 02:31:19.000000000 +0530 +++ b/net/rds/tcp.c 2009-02-03 07:01:35.000000000 +0530 @@ -254,7 +254,7 @@ rds_tcp_conn_slab = kmem_cache_create("rds_tcp_connection", sizeof(struct rds_tcp_connection), - 0, 0, NULL, NULL); + 0, 0, NULL); if (rds_tcp_conn_slab == NULL) { ret = -ENOMEM; goto out; diff -ruN a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c --- a/net/rds/tcp_recv.c 2009-02-03 02:31:19.000000000 +0530 +++ b/net/rds/tcp_recv.c 2009-02-03 07:01:59.000000000 +0530 @@ -344,7 +344,7 @@ { rds_tcp_incoming_slab = kmem_cache_create("rds_tcp_incoming", sizeof(struct rds_tcp_incoming), - 0, 0, NULL, NULL); + 0, 0, NULL); if (rds_tcp_incoming_slab == NULL) return -ENOMEM; return 0; regards Devesh Sharma From tziporet at dev.mellanox.co.il Tue Feb 3 06:14:00 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 3 Feb 2009 16:14:00 +0200 Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes In-Reply-To: <1233654242.1364.39.camel@pyren.uio.no> References: <1233654242.1364.39.camel@pyren.uio.no> Message-ID: <3d47233f0902030614i29e567f8i46ea3df632936ac6@mail.gmail.com> I am looking here how to help you. Can you specify which FW version are you using? Also - please make sure you have the most updated BIOS for the AMD system Tziporet On Tue, Feb 3, 2009 at 11:44 AM, Ole Widar Saastad wrote: > > I have problems using the OFED 1.4 software on the Sun x4600 nodes. > Need help to get this to work. We plan to run GPFS over IB on these > nodes in addition to MPI. > > Sun 4600 nodes with 8 quad core cpus, > Quad-Core AMD Opteron(tm) Processor 8380 > > OS is Rocks release 4. > centos-release-4-4.2/x86_64/ > > Linux compute-0-0.local 2.6.9-67.0.15.ELlargesmp #1 SMP Thu May 8 > 11:03:57 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > > > Needless to say our 300+ nodes (SUN x2200 with quad core) runs fine with > OFED 1.4 (and 1.3), they have the almost the same kernel : > Linux compute-4-0.local 2.6.9-67.0.15.ELsmp #1 SMP Thu May 8 10:50:20 > EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > Same except ELsmp and not ELlargesmp. > > More information: > > dmesg prints out the following error message : > > Losing some ticks... checking if CPU frequency changed. > modulecmd[17499]: segfault at 0000007fc0b01688 rip 000000000060aa38 rsp > 0000007fbfffcfd8 error 6 > mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) > mlx4_core: Initializing 0000:02:00.0 > ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 19 (level, low) -> IRQ 193 > PCI: Setting latency timer of device 0000:02:00.0 to 64 > mlx4_core 0000:02:00.0: Requested number of MACs is too much for port 1, > reducing to 1. > MSI INIT SUCCESS > mlx4_core 0000:02:00.0: command 0x13 failed: fw status = 0x1 > mlx4_core 0000:02:00.0: SW2HW_EQ failed (-5) > mlx4_core 0000:02:00.0: Failed to initialize event queue table, aborting. > mlx4_core: probe of 0000:02:00.0 failed with error -5 > > The following software is installed: > > Select Option [1-5]:3 > kernel-ib > libibverbs > libibverbs-devel > libibverbs-utils > libmthca > libmlx4 > libcxgb3 > libnes > libipathverbs > libibcommon > libibcommon-devel > libibumad > libibumad-devel > ofed-docs > ofed-scripts > ibvexdmtools > qlgc_vnic_daemon > > > Just to be sure the card is present : > lspci returns : > 02:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0) > > > -- > Ole W. Saastad, dr. scient. > Scientific Computing Group, USIT, University of Oslo > http://hpc.uio.no > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From devesh28 at gmail.com Tue Feb 3 06:14:42 2009 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 3 Feb 2009 19:44:42 +0530 Subject: ***SPAM*** Re: ***SPAM*** [ofa-general][CONFIG SCRIPT] compiling OFED-1.2 with RHEL5.1 Message-ID: <309a667c0902030614j7a5bf5b7k5342fd021948fc2@mail.gmail.com> This configuration script has to be run before following normal compiling procedure. It must be run from Top Level OFED-1.2 directory with both the patches in the same directory. #!/bin/bash ofed_top_dir=$(pwd) package_name=ofa_kernel package=ofa_kernel-1.2 package_rel=0 echo Installing ${package_name} source rpm: if ! ( set -x && rpm -i --define "_topdir $(pwd)" SRPMS/${package}-${package_rel}.src.rpm && set +x ); then echo "Failed to install ${package}-${package_rel}.src.rpm" exit 1 fi cd SOURCES tar zxf ofa_kernel-1.2.tgz cd ofa_kernel-1.2 patch -p1<${ofed_top_dir}/OFED-1.2_RHEL5.1_fix.patch cp ${ofed_top_dir}/kmem_cache_create_fix.patch ${ofed_top_dir}/SOURCES/ofa_kernel-1.2/kernel_patches/backport/2.6.18-EL5.1/ cd - tar zcf ofa_kernel-1.2.tgz ofa_kernel-1.2 cd ${ofed_top_dir} echo Rebuilding ${package_name} source rpm: if ! ( set -x && rpmbuild -bs --define "_topdir $(pwd)" SPECS/${package_name}.spec && set +x ); then echo Failed to create ${package}-${package_rel}.src.rpm exit 1 fi rm -rf SOURCES/${package}* -regards Devesh Sharma From sashak at voltaire.com Tue Feb 3 06:22:48 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 16:22:48 +0200 Subject: [ofa-general] Re: [ofw] saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> Message-ID: <20090203142248.GM11874@sashak.voltaire.com> On 14:11 Mon 02 Feb , Sean Hefty wrote: > Forwarding to general list and copying Sasha. > > >Hello, > > The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in > >opensm\user\include\vendor\osm_vendor_al.h) is missing > >the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'. > >saquery.c expects to find ca_names in osm_vendor_t. > > > >A couple of observations: > >1) Windows currently supports a much older version of opensm than what OFED 1.4 > >tools expect. > > > >2) saquery uses OpenSM mad interfaces while it 'could' be using libibmad > >interfaces. > > If libibmad from saquery, then OpenSM would not need libibmad references for > >Windows. > > > >3) How bad is it to create libibmad dependencies for OpenSM? Why we need to? Dependencies without reason is not a good thing. > > > >4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces; > >the rest use > > libibmad. True. > > > >Most of the OFED diagnostic tools support the cmd-line option '-C ca_name'. > >This cmd-line input is resolved thru > >libibmad interfaces. > >Saquery is no exception in that it expects to match the '-C ca_name' against > >osm_vendor_t.ca_names[]. 'ibstat -l' lists > >CA names. > > > >The question becomes how best to resolve the missing ca_names? > > > >1) modify saquery to call libibmad interface to get CA names; That is possible I guess. > > leaves > >osm_vendor_t unmodified. > > So far, saquery is the only diag pgm which uses OSM mad interfaces; > >expecting ca_names > > in osm_vendor_t. OpenSM (osm_vendor_ibumad layer) uses this too for port finding/choosing. > > > >2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and populate > >ca_names > > from OpenSM code? How OpenSM in WinOF choose a port to use? > > Creates libibmad dependencies for opensm. ca_names[][] by itself doesn't create such dependencies. For instance osm_vendor_ibumad.c has ca_names[][] and doesn't have any libibmad dependency. Sasha From sashak at voltaire.com Tue Feb 3 06:27:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 16:27:18 +0200 Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> Message-ID: <20090203142718.GN11874@sashak.voltaire.com> On 14:51 Mon 02 Feb , Sean Hefty wrote: > >>4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces; > >>the rest use libibmad. > > Looking briefly at the saquery code, I don't understand the benefit to using the > opensm vendor interfaces, versus using libibmad or even libibumad directly, and > switching to libibumad looks doable. (It's not clear to me that there are > benefits to using libibmad over libibumad for saquery.) > > - osm_bind_handle_t looks like it could map to a libibumad port_id (int). > - osmv_query_sa() could map to umad_send(), followed by umad_recv() to > obtain the result. (Replace osmv_query_sa with a new function.) > - There are a couple other calls that are used to loop through all returned > attributes in a response MAD. We could use the MAD attribute offset > directly. (Update loops where osmv_get_query_* is called.) > > Are there technical reasons why the opensm vendor library was chosen for > saquery? AFAIK there are no such reasons. > Would there be any objection to changing saquery to use libibumad > directly? Not from me. Sasha From sashak at voltaire.com Tue Feb 3 06:30:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 3 Feb 2009 16:30:40 +0200 Subject: [ofa-general] RE: [ofw] saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: <9632920386E943489C39D8637052F404@amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <964AF74A7D394FAE8385EBB8DC7449D5@amr.corp.intel.com> <20090202150658.0af72134.weiny2@llnl.gov> <9632920386E943489C39D8637052F404@amr.corp.intel.com> Message-ID: <20090203143032.GO11874@sashak.voltaire.com> On 15:19 Mon 02 Feb , Sean Hefty wrote: > > libibumad does require the user to provide the address to the SA. Providing a > libibumad helper function to fill out ib_mad_addr_t for the local SA seems > reasonable. I guess we can look at what it would take to convert it in detail > to see if anything is still missing from the lower libraries. There are ib_resolve_smlid() and ib_resolve_smlid_via() functions in libibmad already. Sasha From halr at obsidianresearch.com Tue Feb 3 06:57:33 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Tue, 03 Feb 2009 07:57:33 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix Message-ID: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> Sasha, Trivial description change to osm_node_get_num_physp. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-opensm-osm_node.h-osm_node_get_num_physp-descriptio.patch Type: application/mbox Size: 844 bytes Desc: not available URL: From halr at obsidianresearch.com Tue Feb 3 06:57:36 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Tue, 03 Feb 2009 07:57:36 -0700 Subject: [ofa-general] [PATCH] opensm/osm_perfmgr.c: Increase size of memory allocation in __collect_guids Message-ID: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> Sasha, Patch to increase size of monitored node in osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual port number. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-opensm-osm_perfmgr.c-Increase-size-of-memory-alloca.patch Type: application/mbox Size: 1508 bytes Desc: not available URL: From halr at obsidianresearch.com Tue Feb 3 06:57:50 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Tue, 03 Feb 2009 07:57:50 -0700 Subject: [ofa-general] [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port, allow queries on enhanced SP0 Message-ID: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca> Sasha, Patch to osm_perfmgr_db.c to only error port 0 queries when not enhanced SP0. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-opensm-osm_perfmgr_db.c-In-bad_node_port-allow-que.patch Type: application/mbox Size: 3685 bytes Desc: not available URL: From jon at opengridcomputing.com Tue Feb 3 07:31:25 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 3 Feb 2009 09:31:25 -0600 Subject: [ofa-general] Support for CXGB3 RNIC on P6 In-Reply-To: References: Message-ID: <20090203153124.GA13472@opengridcomputing.com> On Tue, Feb 03, 2009 at 08:55:14AM +0530, Krishna Kumar2 wrote: > > Hi, > > My colleague (at a different site) is trying to get couple of Chelsio RNIC > adapters working on > p6 systems but for some reason the cards aren't recognized on bootup. The > same cards works > on my xseries systems, and following are the messages I get (there are no > messages on his p6 > systems): > > Feb 1 11:42:49 localhost kernel: Chelsio T3 Network Driver - version > 1.1.1-ko > Feb 1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: PCI INT A -> GSI 17 > (level, low) -> IRQ 17 > Feb 1 11:42:49 localhost kernel: input: Power Button (FF) as > /class/input/input1 > Feb 1 11:42:49 localhost kernel: ACPI: Power Button (FF) [PWRF] > Feb 1 11:42:49 localhost kernel: cxgb3 0000:22:00.0: Port 0 using 4 queue > sets. > Feb 1 11:42:49 localhost kernel: eth2: Chelsio T310 10GBASE-R RNIC (rev 4) > PCI Express x8 MSI-X > Feb 1 11:42:49 localhost kernel: eth2: 128MB CM, 256MB PMTX, 256MB PMRX, > S/N: PT49070050 > > Is this revision of cxgb3 (rev4) not supported on p6? Or are we missing > something to get it to work? Does the adapter show up on his system when he runs lspci? > > thanks, > > - KK > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Feb 3 08:12:39 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 03 Feb 2009 08:12:39 -0800 Subject: [ofa-general] Support for CXGB3 RNIC on P6 In-Reply-To: (Krishna Kumar2's message of "Tue, 3 Feb 2009 08:55:14 +0530") References: Message-ID: > My colleague (at a different site) is trying to get couple of Chelsio RNIC > adapters working on > p6 systems but for some reason the cards aren't recognized on bootup. What do you mean by "aren't recognized on bootup"? More details on the specific problem are needed to diagnose it. - R. From jackm at dev.mellanox.co.il Tue Feb 3 08:16:41 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 3 Feb 2009 18:16:41 +0200 Subject: [ofa-general] Kernel panic in IPoIB stability testing Message-ID: <200902031816.41784.jackm@dev.mellanox.co.il> We saw the following kernel panic when testing ipoib stability intensively by simultaneously (i.e., in separate processes, with random wait intervals) doing: - ifconfig up/down - opensm up/down - ipoib ping - arp delete - driver up/down ib0: ib_sa_path_rec_get failed: -11 ib0: ib_sa_path_rec_get failed: -11 Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec PGD 224ea0067 PUD 225ae9067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0 CPU 2 Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1 RIP: 0010:[] [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec RSP: 0018:ffff810121ee7de0 EFLAGS: 00010046 RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002 RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500 RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000 R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30 R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3 FS: 00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0 Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860) Stack: ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850 ffffffffffffffff 7fffffffffffffff ffffffffffffffff ffff810121ee8688 ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4 Call Trace: [] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6 [] run_workqueue+0x94/0xe5 [] worker_thread+0x0/0x122 [] keventd_create_kthread+0x0/0x61 [] worker_thread+0xf0/0x122 [] default_wake_function+0x0/0xe [] keventd_create_kthread+0x0/0x61 [] keventd_create_kthread+0x0/0x61 [] kthread+0xfe/0x132 [] child_rip+0xa/0x11 [] keventd_create_kthread+0x0/0x61 [] kthread+0x0/0x132 [] child_rip+0x0/0x11 Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49 RIP [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec RSP CR2: 0000000000000000 <0>Kernel panic - not syncing: Fatal exception In objdump -ld, we get: ipoib_mark_paths_invalid(): /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365 13f7: c7 83 e0 00 00 00 00 movl $0x0,0xe0(%rbx) 13fe: 00 00 00 /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361 1401: 4c 89 e3 mov %r12,%rbx ==> 1404: 4d 8b a4 24 d0 00 00 mov 0xd0(%r12),%r12 140b: 00 140c: 48 8d 93 d0 00 00 00 lea 0xd0(%rbx),%rdx 1413: 48 8d 45 38 lea 0x38(%rbp),%rax 1417: 49 81 ec d0 00 00 00 sub $0xd0,%r12 141e: 48 39 c2 cmp %rax,%rdx 1421: 0f 85 4b ff ff ff jne 1372 -------------------------------- and in the source code, we get: void ipoib_mark_paths_invalid(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; spin_lock_irq(&priv->lock); ==> list_for_each_entry_safe(path, tp, &priv->path_list, list) { ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n", be16_to_cpu(path->pathrec.dlid), IPOIB_GID_ARG(path->pathrec.dgid)); path->valid = 0; } spin_unlock_irq(&priv->lock); } -------------------------------------------- Any ideas? - Jack From yosefe at Voltaire.COM Tue Feb 3 08:47:28 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Tue, 03 Feb 2009 18:47:28 +0200 Subject: [ofa-general] Kernel panic in IPoIB stability testing In-Reply-To: <200902031816.41784.jackm@dev.mellanox.co.il> References: <200902031816.41784.jackm@dev.mellanox.co.il> Message-ID: <49887520.9080906@Voltaire.COM> What kernel and ofed version is it? Jack Morgenstein wrote: > We saw the following kernel panic when testing ipoib stability intensively > by simultaneously (i.e., in separate processes, with random wait intervals) doing: > - ifconfig up/down > - opensm up/down > - ipoib ping > - arp delete > - driver up/down > > ib0: ib_sa_path_rec_get failed: -11 > ib0: ib_sa_path_rec_get failed: -11 > Unable to handle kernel NULL pointer dereference at 0000000000000000 > RIP: [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec > PGD 224ea0067 PUD 225ae9067 PMD 0 > Oops: 0000 [1] SMP > last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0 > CPU 2 > Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth > sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) > ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec > i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac > cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd > Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1 > RIP: 0010:[] [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec > RSP: 0018:ffff810121ee7de0 EFLAGS: 00010046 > RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002 > RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500 > RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000 > R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30 > R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3 > FS: 00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0 > Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860) > Stack: ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850 ffffffffffffffff 7fffffffffffffff > ffffffffffffffff ffff810121ee8688 ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4 > Call Trace: [] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6 > [] run_workqueue+0x94/0xe5 > [] worker_thread+0x0/0x122 > [] keventd_create_kthread+0x0/0x61 > [] worker_thread+0xf0/0x122 > [] default_wake_function+0x0/0xe > [] keventd_create_kthread+0x0/0x61 > [] keventd_create_kthread+0x0/0x61 > [] kthread+0xfe/0x132 > [] child_rip+0xa/0x11 > [] keventd_create_kthread+0x0/0x61 > [] kthread+0x0/0x132 > [] child_rip+0x0/0x11 > > Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49 > RIP [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec > RSP > CR2: 0000000000000000 > <0>Kernel panic - not syncing: Fatal exception > > In objdump -ld, we get: > ipoib_mark_paths_invalid(): > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365 > 13f7: c7 83 e0 00 00 00 00 movl $0x0,0xe0(%rbx) > 13fe: 00 00 00 > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361 > 1401: 4c 89 e3 mov %r12,%rbx > ==> 1404: 4d 8b a4 24 d0 00 00 mov 0xd0(%r12),%r12 > 140b: 00 > 140c: 48 8d 93 d0 00 00 00 lea 0xd0(%rbx),%rdx > 1413: 48 8d 45 38 lea 0x38(%rbp),%rax > 1417: 49 81 ec d0 00 00 00 sub $0xd0,%r12 > 141e: 48 39 c2 cmp %rax,%rdx > 1421: 0f 85 4b ff ff ff jne 1372 > -------------------------------- > and in the source code, we get: > > void ipoib_mark_paths_invalid(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ipoib_path *path, *tp; > > spin_lock_irq(&priv->lock); > > ==> list_for_each_entry_safe(path, tp, &priv->path_list, list) { > ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n", > be16_to_cpu(path->pathrec.dlid), > IPOIB_GID_ARG(path->pathrec.dgid)); > path->valid = 0; > } > > spin_unlock_irq(&priv->lock); > } > -------------------------------------------- > Any ideas? > > - Jack > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- --Yossi From sean.hefty at intel.com Tue Feb 3 09:37:28 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 3 Feb 2009 09:37:28 -0800 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: References: Message-ID: >cma_acquire_dev --> cma_set_qkey/ps=IPOIB --> ib_sa_get_mcmember_rec where the >latter returns EADDRNOTAVAIL since when the port went down the core multicast >code Why is ib_sa_get_mcmember_rec being called? Or is this issue separate from what udaddy is showing? >I assume there must be a way to defer this resolving to a later stage such >that binding would be possible when the port is down, thoughts? Can you determine what call in the kernel is actually failing during the bind? (I can try testing later, but I'm not near any systems currently.) I'm wondering if the failure is coming from rdma_translate_ip()->ip_dev_find(). - Sean From yosefe at Voltaire.COM Tue Feb 3 09:56:40 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Tue, 03 Feb 2009 19:56:40 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <200902031816.41784.jackm@dev.mellanox.co.il> References: <200902031816.41784.jackm@dev.mellanox.co.il> Message-ID: <49888558.3050506@Voltaire.COM> I think it comes from unicast_arp_send. Consider this scenario: - paths are flushed (opensm up/down). - unicast_arp_send() is called with a path in priv->path_list. path->valid is 0. - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet) (see the prints just before the panic). - unicast_arp_send calls() path_free(). - path memory is overwritten. - __ipoib_dev_flush() is called again. - mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic because path->list became invalid. --Yossi Jack Morgenstein wrote: > We saw the following kernel panic when testing ipoib stability intensively > by simultaneously (i.e., in separate processes, with random wait intervals) doing: > - ifconfig up/down > - opensm up/down > - ipoib ping > - arp delete > - driver up/down > > ib0: ib_sa_path_rec_get failed: -11 > ib0: ib_sa_path_rec_get failed: -11 > Unable to handle kernel NULL pointer dereference at 0000000000000000 > RIP: [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec > PGD 224ea0067 PUD 225ae9067 PMD 0 > Oops: 0000 [1] SMP > last sysfs file: /class/infiniband/mlx4_0/ports/2/pkeys/0 > CPU 2 > Modules linked in: netconsole nfsd exportfs autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth > sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) > ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_mod video sbs i2c_ec > i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_core(U) ide_cd sg k8_edac > cdrom edac_mc bnx2 shpchp serio_raw pcspkr sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd > Pid: 2051, comm: ipoib Not tainted 2.6.18-8.el5 #1 > RIP: 0010:[] [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec > RSP: 0018:ffff810121ee7de0 EFLAGS: 00010046 > RAX: ffff810121ee8538 RBX: ffffffffffffff30 RCX: 0000000000000002 > RDX: ffff8102237a1f90 RSI: ffff8102261e90c0 RDI: ffff810121ee8500 > RBP: ffff810121ee8500 R08: ffff810121ee6000 R09: 0000000000000000 > R10: ffff810005116400 R11: 0000000000000002 R12: ffffffffffffff30 > R13: 0000000000000000 R14: ffff810121ee8688 R15: ffffffff883ae8b3 > FS: 00002aaaaaace2a0(0000) GS:ffff810127c4f3c0(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000000000 CR3: 0000000224eef000 CR4: 00000000000006e0 > Process ipoib (pid: 2051, threadinfo ffff810121ee6000, task ffff810227ebb860) > Stack: ffff810121ee8500 ffff810121ee84f0 ffff810121ee8000 ffffffff883ae850 ffffffffffffffff 7fffffffffffffff > ffffffffffffffff ffff810121ee8688 ffff810121ee8690 ffff810125d932c0 0000000000000282 ffffffff8004b2b4 > Call Trace: [] :ib_ipoib:__ipoib_ib_dev_flush+0x175/0x1b6 > [] run_workqueue+0x94/0xe5 > [] worker_thread+0x0/0x122 > [] keventd_create_kthread+0x0/0x61 > [] worker_thread+0xf0/0x122 > [] default_wake_function+0x0/0xe > [] keventd_create_kthread+0x0/0x61 > [] keventd_create_kthread+0x0/0x61 > [] kthread+0xfe/0x132 > [] child_rip+0xa/0x11 > [] keventd_create_kthread+0x0/0x61 > [] kthread+0x0/0x132 > [] child_rip+0x0/0x11 > > Code: 4d 8b a4 24 d0 00 00 00 48 8d 93 d0 00 00 00 48 8d 45 38 49 > RIP [] :ib_ipoib:ipoib_mark_paths_invalid+0xbc/0xec > RSP > CR2: 0000000000000000 > <0>Kernel panic - not syncing: Fatal exception > > In objdump -ld, we get: > ipoib_mark_paths_invalid(): > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:365 > 13f7: c7 83 e0 00 00 00 00 movl $0x0,0xe0(%rbx) > 13fe: 00 00 00 > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/ulp/ipoib/ipoib_main.c:361 > 1401: 4c 89 e3 mov %r12,%rbx > ==> 1404: 4d 8b a4 24 d0 00 00 mov 0xd0(%r12),%r12 > 140b: 00 > 140c: 48 8d 93 d0 00 00 00 lea 0xd0(%rbx),%rdx > 1413: 48 8d 45 38 lea 0x38(%rbp),%rax > 1417: 49 81 ec d0 00 00 00 sub $0xd0,%r12 > 141e: 48 39 c2 cmp %rax,%rdx > 1421: 0f 85 4b ff ff ff jne 1372 > -------------------------------- > and in the source code, we get: > > void ipoib_mark_paths_invalid(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ipoib_path *path, *tp; > > spin_lock_irq(&priv->lock); > > ==> list_for_each_entry_safe(path, tp, &priv->path_list, list) { > ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n", > be16_to_cpu(path->pathrec.dlid), > IPOIB_GID_ARG(path->pathrec.dgid)); > path->valid = 0; > } > > spin_unlock_irq(&priv->lock); > } > -------------------------------------------- > Any ideas? > > - Jack From krkumar2 at in.ibm.com Tue Feb 3 10:21:02 2009 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 3 Feb 2009 23:51:02 +0530 Subject: [ofa-general] Support for CXGB3 RNIC on P6 In-Reply-To: References: Message-ID: Sorry for the vagueness. There are no messages in the /var/log (or the console/dmesg). To answer Jon's question: lspci doesn't show the device. No lights come up on the adapter, which I guess is the normal behavior till the device is succesfully probed and recognized. thanks, - KK Roland Dreier wrote on 02/03/2009 09:42:39 PM: > Roland Dreier > 02/03/2009 09:42 PM > > To > > Krishna Kumar2/India/IBM at IBMIN > > cc > > openfabrics > > Subject > > Re: [ofa-general] Support for CXGB3 RNIC on P6 > > > My colleague (at a different site) is trying to get couple of Chelsio RNIC > > adapters working on > > p6 systems but for some reason the cards aren't recognized on bootup. > > What do you mean by "aren't recognized on bootup"? More details on the > specific problem are needed to diagnose it. > > - R. From jon at opengridcomputing.com Tue Feb 3 10:43:44 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 3 Feb 2009 12:43:44 -0600 Subject: [ofa-general] Support for CXGB3 RNIC on P6 In-Reply-To: References: Message-ID: <20090203184344.GB13472@opengridcomputing.com> On Tue, Feb 03, 2009 at 11:51:02PM +0530, Krishna Kumar2 wrote: > Sorry for the vagueness. There are no messages in the /var/log (or the > console/dmesg). > To answer Jon's question: lspci doesn't show the device. Can you please include the output of dmesg of the failing system as well as lspci output (lspci -x). Thanks, Jon > > No lights come up on the adapter, which I guess is the normal behavior till > the device is > succesfully probed and recognized. > > thanks, > > - KK > > Roland Dreier wrote on 02/03/2009 09:42:39 PM: > > > Roland Dreier > > 02/03/2009 09:42 PM > > > > To > > > > Krishna Kumar2/India/IBM at IBMIN > > > > cc > > > > openfabrics > > > > Subject > > > > Re: [ofa-general] Support for CXGB3 RNIC on P6 > > > > > My colleague (at a different site) is trying to get couple of Chelsio > RNIC > > > adapters working on > > > p6 systems but for some reason the cards aren't recognized on bootup. > > > > What do you mean by "aren't recognized on bootup"? More details on the > > specific problem are needed to diagnose it. > > > > - R. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ralph.campbell at qlogic.com Tue Feb 3 11:26:12 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 3 Feb 2009 11:26:12 -0800 Subject: [ofa-general] Possible memory leak and null pointer dereference in local_completions() Message-ID: <1233689172.23327.155.camel@chromite.mv.qlogic.com> I was doing some tests with different MAD packets and then reading the infiniband/core/mad.c code. handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local on the mad_agent_priv->local_work work queue with local->mad_priv == NULL if device->process_mad() returns IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and (!ib_response_mad(&mad_priv->mad.mad) || !mad_agent_priv->agent.recv_handler). In this case, local_completions() will be called with local->mad_priv == NULL. The code does check for this case and skips calling recv_mad_agent->agent.recv_handler(). This means recv == 0 so kmem_cache_free() is called with a NULL pointer. Even if local->mad_priv != NULL, I don't see how local->mad_priv is freed when recv == 1. Thus, it appears to be a memory leak. So, I'm proposing the following patch: diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 5c54fc2..93d80e5 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work) struct ib_mad_local_private *local; struct ib_mad_agent_private *recv_mad_agent; unsigned long flags; - int recv = 0; struct ib_wc wc; struct ib_mad_send_wc mad_send_wc; @@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work) goto local_send_completion; } - recv = 1; /* * Defined behavior is to complete response * before request @@ -2422,7 +2420,7 @@ local_send_completion: spin_lock_irqsave(&mad_agent_priv->lock, flags); atomic_dec(&mad_agent_priv->refcount); - if (!recv) + if (local->mad_priv) kmem_cache_free(ib_mad_cache, local->mad_priv); kfree(local); } From swise at opengridcomputing.com Tue Feb 3 13:05:24 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 03 Feb 2009 15:05:24 -0600 Subject: [ofa-general] Support for CXGB3 RNIC on P6 In-Reply-To: References: Message-ID: <4988B194.6010706@opengridcomputing.com> Krishna Kumar2 wrote: > Sorry for the vagueness. There are no messages in the /var/log (or the > console/dmesg). > To answer Jon's question: lspci doesn't show the device. > > No lights come up on the adapter, which I guess is the normal behavior till > the device is > succesfully probed and recognized. > True. > thanks, > > - KK > > What pci-e slot config? 4x? 8x? From weiny2 at llnl.gov Tue Feb 3 15:47:32 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 3 Feb 2009 15:47:32 -0800 Subject: [ofa-general] [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <475BCB11F74B45BB8D8794BAEEC380C2@amr.corp.intel.com> References: <20090202185425.729a80b3.weiny2@llnl.gov> <475BCB11F74B45BB8D8794BAEEC380C2@amr.corp.intel.com> Message-ID: <20090203154732.1fc07a44.weiny2@llnl.gov> On Mon, 2 Feb 2009 21:29:16 -0800 "Sean Hefty" wrote: > >@@ -595,21 +595,21 @@ typedef struct ib_vendor_call { > > #define MAD_DEF_RETRIES 3 > > #define MAD_DEF_TIMEOUT_MS 1000 > > > >-enum { > >+typedef enum { > > IB_DEST_LID, > > IB_DEST_DRPATH, > > IB_DEST_GUID, > > IB_DEST_DRSLID, > >-}; > >+} mad_dest_t; > > > >-enum { > >+typedef enum { > > IB_NODE_CA = 1, > > IB_NODE_SWITCH, > > IB_NODE_ROUTER, > > NODE_RNIC, > > > > IB_NODE_MAX = NODE_RNIC > >-}; > >+} mad_node_type_t; > > For consistency, should these be named enums? (MAD_DEST and MAD_NODE_TYPE) Sure, patch attached. Ira >From ec5d9def3e92ee7d5ac245401c99de49c5a90e0e Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Mon, 2 Feb 2009 10:21:18 -0800 Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces Signed-off-by: weiny2 at llnl.gov --- libibmad/include/infiniband/mad.h | 38 ++++++++++++++++++------------------ libibmad/src/fields.c | 22 ++++++++++---------- libibmad/src/resolve.c | 10 ++++---- 3 files changed, 35 insertions(+), 35 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 9ff4a3e..61d0a73 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -203,7 +203,7 @@ typedef struct ib_field { ib_mad_dump_fn *def_dump_fn; } ib_field_t; -enum MAD_FIELDS { +typedef enum MAD_FIELDS { IB_NO_FIELD, IB_GID_PREFIX_F, @@ -525,7 +525,7 @@ enum MAD_FIELDS { IB_GUID_GUID0_F, IB_FIELD_LAST_ /* must be last */ -}; +} mad_field_t; /* * SA RMPP section @@ -595,21 +595,21 @@ typedef struct ib_vendor_call { #define MAD_DEF_RETRIES 3 #define MAD_DEF_TIMEOUT_MS 1000 -enum { +typedef enum MAD_DEST { IB_DEST_LID, IB_DEST_DRPATH, IB_DEST_GUID, IB_DEST_DRSLID, -}; +} mad_dest_t; -enum { +typedef enum MAD_NODE_TYPE { IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER, NODE_RNIC, IB_NODE_MAX = NODE_RNIC -}; +} mad_node_type_t; /******************************************************************************/ @@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey) } /* fields.c */ -MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); -MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field); +MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val); /* field must be byte aligned */ -MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); -MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field); +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val); -MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); -MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); -MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val); -MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val); -MAD_EXPORT int mad_print_field(int field, const char *name, void *val); -MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); -MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); +MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val); +MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val); +MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val); +MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val); +MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val); +MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val); +MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val); /* mad.c */ MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, @@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout); MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id); + mad_dest_t dest, ib_portid_t * sm_id); MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid); @@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout, const void *srcport); int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id, + mad_dest_t dest, ib_portid_t * sm_id, const void *srcport); int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, const void *srcport); diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index d5a1eb4..d435a2f 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f, memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8); } -uint32_t mad_get_field(void *buf, int base_offs, int field) +uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field) { return _get_field(buf, base_offs, ib_mad_f + field); } -void mad_set_field(void *buf, int base_offs, int field, uint32_t val) +void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val) { _set_field(buf, base_offs, ib_mad_f + field, val); } -uint64_t mad_get_field64(void *buf, int base_offs, int field) +uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field) { return _get_field64(buf, base_offs, ib_mad_f + field); } -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val) +void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val) { _set_field64(buf, base_offs, ib_mad_f + field, val); } -void mad_set_array(void *buf, int base_offs, int field, void *val) +void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val) { _set_array(buf, base_offs, ib_mad_f + field, val); } -void mad_get_array(void *buf, int base_offs, int field, void *val) +void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val) { _get_array(buf, base_offs, ib_mad_f + field, val); } -void mad_decode_field(uint8_t * buf, int field, void *val) +void mad_decode_field(uint8_t * buf, mad_field_t field, void *val) { const ib_field_t *f = ib_mad_f + field; @@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val) _get_array(buf, 0, f, val); } -void mad_encode_field(uint8_t * buf, int field, void *val) +void mad_encode_field(uint8_t * buf, mad_field_t field, void *val) { const ib_field_t *f = ib_mad_f + field; @@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val, valsz ? valsz : ALIGN(f->bitlen, 8) / 8); } -int mad_print_field(int field, const char *name, void *val) +int mad_print_field(mad_field_t field, const char *name, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return -1; return _mad_print_field(ib_mad_f + field, name, val, 0); } -char *mad_dump_field(int field, char *buf, int bufsz, void *val) +char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); } -char *mad_dump_val(int field, char *buf, int bufsz, void *val) +char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index b62360b..faac1f9 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, } int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id, + mad_dest_t dest, ib_portid_t * sm_id, const void *srcport) { uint64_t guid; @@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, ib_portid_t selfportid = { 0 }; int selfport = 0; - switch (dest_type) { + switch (dest) { case IB_DEST_LID: lid = strtol(addr_str, 0, 0); if (!IB_LID_VALID(lid)) @@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, return 0; default: - IBWARN("bad dest_type %d", dest_type); + IBWARN("bad dest %d", dest); } return -1; } -int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type, +int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest, ib_portid_t * sm_id) { - return ib_resolve_portid_str_via(portid, addr_str, dest_type, + return ib_resolve_portid_str_via(portid, addr_str, dest, sm_id, NULL); } -- 1.5.4.5 From arlin.r.davis at intel.com Tue Feb 3 16:17:13 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 3 Feb 2009 16:17:13 -0800 Subject: [ofa-general] RE: dapl attribute bug In-Reply-To: <49871E6A.9000901@opengridcomputing.com> References: <49871E6A.9000901@opengridcomputing.com> Message-ID: >The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl code >maps this to the linux ib_device_attr->max_mr_size which is u64. > >This causes dapltest to fail in some cases when running over chelsio >which sets max_mr_size to 0x100000000 (4GB). The dapl code truncates >the value to 0. See dapl/openib_cma/dapl_ib_util.c. > >I'm not sure what the fix should be, but maybe the dapl code >should set >anything over 32 bits to 0xffffffff? > This attribute changed with DAT 2.0 to match the 32-bit ibv_sge length field. Since there are no direct max lmr segments mappings I will need add some checks when setting max_lmr_block_size from max_mr_size. Thanks. -arlin From purdy at sgi.com Tue Feb 3 18:09:08 2009 From: purdy at sgi.com (Dale Purdy) Date: Tue, 3 Feb 2009 20:09:08 -0600 Subject: [ofa-general] ibdiagnet and ibdmchk credit loop checks Message-ID: <20090204020908.GA29008@sgi.com> The ibdiagnet and ibdmchk utilities can report on credit loops in the topology, but are heavily oriented towards UpDown routing. Each of these utilities will try to rank the switches and automaticly determine root nodes for an UpDown routing engine. It is important to check for credit loops with other routing engines, but these utilities can give incorrect information with the other routing engines. If ibdmchk thinks it finds root nodes, it determines credit loops by checking whether the up/down rules are followed w.r.t. those roots, which can be wrong. If ibdmchk fails to find root nodes, it falls back to doing a real credit loop by doing a DFS in the dependency graph. This can be overridden by supplying an explicit root_guids file. Why doesn't it just do the real credit loop check in general? Presumably checking the up/down rules is less costly when UpDown routing is actually being used. The following change fixes ibdmchk so that it only uses root nodes and up/down rules when UpDown routing is being used (by specifying -u on the command line) and otherwise does a real credit loop check: diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp index 1c18c1c..d8a4202 100644 --- a/ibdm/src/osm_check.cpp +++ b/ibdm/src/osm_check.cpp @@ -568,7 +568,7 @@ int main (int argc, char **argv) { rootNodes = SubnMgtFindRootNodesByMinHop(&fabric); } - if (!rootNodes.empty()) { + if (UseUpDown && !rootNodes.empty()) { cout << "-I- Recognized " << rootNodes.size() << " root nodes:" << endl; for (list ::iterator nI = rootNodes.begin(); nI != rootNodes.end(); nI++) { ibdiagnet -r is a slightly different story. It is more useful for checking a running machine. However, it doesn't seem to have any options for indicating whether UpDown routing is being used, or for supplying a root_guids file, and just does the up/down rule checking against its idea of root nodes whether that makes sense or not. Can ibdiagnet be changed to just do the real credit loop check? I am not familiar with tcl and haven't been able to determine what to change. -- Dale From jackm at dev.mellanox.co.il Tue Feb 3 22:46:48 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 4 Feb 2009 08:46:48 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <49888558.3050506@Voltaire.COM> References: <200902031816.41784.jackm@dev.mellanox.co.il> <49888558.3050506@Voltaire.COM> Message-ID: <200902040846.48370.jackm@dev.mellanox.co.il> On Tuesday 03 February 2009 19:56, Yossi Etigin wrote: > I think it comes from unicast_arp_send. > Consider this scenario: > - paths are flushed (opensm up/down). > - unicast_arp_send() is called with a path in priv->path_list. path->valid is 0. > - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet) > (see the prints just before the panic). > - unicast_arp_send calls() path_free(). > - path memory is overwritten. > - __ipoib_dev_flush() is called again. > - mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic > because path->list became invalid. I think you are right. In unicast_arp_send, we have the following code: path = __path_find(dev, phdr->hwaddr + 4); if (!path || !path->valid) { if (!path) path = path_rec_create(dev, phdr->hwaddr + 4); if (path) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); if (path_rec_start(dev, path)) { spin_unlock(&priv->lock); path_free(dev, path); return; } else __path_add(dev, path); } else { ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } spin_unlock(&priv->lock); return; } It was originally written without the path->valid check in the "if", and so was based on the path record being allocated within the "if". In this case, the path record was not yet inserted into the path list. When you added the "valid" processing, you did not take this into account. You need code something like the following: path = __path_find(dev, phdr->hwaddr + 4); if (!path || !path->valid) { int had_path = 0; if (!path) path = path_rec_create(dev, phdr->hwaddr + 4); else had_path = 1; if (path) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); if (path_rec_start(dev, path)) { if (had_path) /* detach from path list here under spinlock */ spin_unlock(&priv->lock); path_free(dev, path); return; } else if (!had_path) __path_add(dev, path); } else { ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } spin_unlock(&priv->lock); return; } - Jack From mkatiyar at gmail.com Tue Feb 3 22:54:05 2009 From: mkatiyar at gmail.com (Manish Katiyar) Date: Wed, 4 Feb 2009 12:24:05 +0530 Subject: [ofa-general] ***SPAM*** Re: [PATCH] : Define debugging variables only when CONFIG_INFINIBAND_NES_DEBUG is enabled In-Reply-To: References: Message-ID: On Tue, Jan 27, 2009 at 11:58 PM, Manish Katiyar wrote: > Below patch removes following compilation warnings : > drivers/infiniband/hw/nes/nes_cm.c:781: warning: unused variable 'tmp_addr' > drivers/infiniband/hw/nes/nes_cm.c:820: warning: unused variable 'tmp_addr' > Hi, Any feedback on this ? Thanks - manish > > Signed-off-by: Manish Katiyar > --- > drivers/infiniband/hw/nes/nes_cm.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/hw/nes/nes_cm.c > b/drivers/infiniband/hw/nes/nes_cm.c > index a01b448..2b34859 100644 > --- a/drivers/infiniband/hw/nes/nes_cm.c > +++ b/drivers/infiniband/hw/nes/nes_cm.c > @@ -778,7 +778,9 @@ static struct nes_cm_node *find_node(struct > nes_cm_core *cm_core, > unsigned long flags; > struct list_head *hte; > struct nes_cm_node *cm_node; > +#ifdef CONFIG_INFINIBAND_NES_DEBUG > __be32 tmp_addr = cpu_to_be32(loc_addr); > +#endif > > /* get a handle on the hte */ > hte = &cm_core->connected_nodes; > @@ -817,7 +819,9 @@ static struct nes_cm_listener > *find_listener(struct nes_cm_core *cm_core, > { > unsigned long flags; > struct nes_cm_listener *listen_node; > +#ifdef CONFIG_INFINIBAND_NES_DEBUG > __be32 tmp_addr = cpu_to_be32(dst_addr); > +#endif > > /* walk list and find cm_node associated with this session ID */ > spin_lock_irqsave(&cm_core->listen_list_lock, flags); > -- > 1.5.4.3 > > > Thanks - > Manish > From ogerlitz at voltaire.com Tue Feb 3 23:11:43 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Feb 2009 09:11:43 +0200 Subject: [ofa-general] Re: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: References: Message-ID: <49893FAF.3090007@voltaire.com> Sean Hefty wrote: > Why is ib_sa_get_mcmember_rec being called? Or is this issue separate from what > udaddy is showing? The IPOIB port space allows for UD interoperability between IPoIB and RDMA-CM based apps. For that end, among other params such as the mgid derivation, the qkey used for the UD QP must be the same. To achieve that, a query on the broadcast group is done to the core multicast data-base to retrieve the associated record from which the qkey is extracted. When the port goes down, this db is being flushed and ib_sa_get_mcmember_rec returns EADDRNOTAVAIL which is exactly what udaddy is getting (you can also get it with mckey). > Can you determine what call in the kernel is actually failing during the bind? mcast_find being called from ib_sa_get_mcmember_rec Or. From jackm at dev.mellanox.co.il Tue Feb 3 23:20:25 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 4 Feb 2009 09:20:25 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <200902040846.48370.jackm@dev.mellanox.co.il> References: <200902031816.41784.jackm@dev.mellanox.co.il> <49888558.3050506@Voltaire.COM> <200902040846.48370.jackm@dev.mellanox.co.il> Message-ID: <200902040920.26062.jackm@dev.mellanox.co.il> On Wednesday 04 February 2009 08:46, Jack Morgenstein wrote: > On Tuesday 03 February 2009 19:56, Yossi Etigin wrote: > > I think it comes from unicast_arp_send. > > Consider this scenario: > > - paths are flushed (opensm up/down). > > - unicast_arp_send() is called with a path in priv->path_list. path->valid is 0. > > - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet) > > (see the prints just before the panic). > > - unicast_arp_send calls() path_free(). > > - path memory is overwritten. > > - __ipoib_dev_flush() is called again. > > - mark_paths_invalid() tries to iterate over priv->path_list and gets kernel panic > > because path->list became invalid. > > I think you are right. How about this: path = __path_find(dev, phdr->hwaddr + 4); if (!path || !path->valid) { int had_path = 0; if (!path) path = path_rec_create(dev, phdr->hwaddr + 4); else had_path = 1; if (path) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); if (path_rec_start(dev, path)) { if (had_path) { list_del(&path->list); rb_erase(&path->rb_node, &priv->path_tree); } spin_unlock(&priv->lock); path_free(dev, path); return; } else if (!had_path) __path_add(dev, path); } else { ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } spin_unlock(&priv->lock); return; } - Jack From dorfman.eli at gmail.com Wed Feb 4 00:00:05 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Wed, 04 Feb 2009 10:00:05 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] libibmad/src/dump.c fix dump functions for big endian machines Message-ID: <49894B05.1090608@gmail.com> fix dump functions for big endian machines Signed-off-by: Eli Dorfman --- libibmad/src/dump.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c index 1cf5232..3b49158 100644 --- a/libibmad/src/dump.c +++ b/libibmad/src/dump.c @@ -46,10 +46,10 @@ void mad_dump_int(char *buf, int bufsz, void *val, int valsz) { switch (valsz) { case 1: - snprintf(buf, bufsz, "%d", *(uint8_t *) val); + snprintf(buf, bufsz, "%d", *(uint32_t *) val & 0xff); break; case 2: - snprintf(buf, bufsz, "%d", *(uint16_t *) val); + snprintf(buf, bufsz, "%d", *(uint32_t *) val & 0xffff); break; case 3: case 4: @@ -71,10 +71,10 @@ void mad_dump_uint(char *buf, int bufsz, void *val, int valsz) { switch (valsz) { case 1: - snprintf(buf, bufsz, "%u", *(uint8_t *) val); + snprintf(buf, bufsz, "%u", *(uint32_t *) val & 0xff); break; case 2: - snprintf(buf, bufsz, "%u", *(uint16_t *) val); + snprintf(buf, bufsz, "%u", *(uint32_t *) val & 0xffff); break; case 3: case 4: @@ -96,10 +96,10 @@ void mad_dump_hex(char *buf, int bufsz, void *val, int valsz) { switch (valsz) { case 1: - snprintf(buf, bufsz, "0x%02x", *(uint8_t *) val); + snprintf(buf, bufsz, "0x%02x", *(uint32_t *) val & 0xff); break; case 2: - snprintf(buf, bufsz, "0x%04x", *(uint16_t *) val); + snprintf(buf, bufsz, "0x%04x", *(uint32_t *) val & 0xffff); break; case 3: snprintf(buf, bufsz, "0x%06x", *(uint32_t *) val & 0xffffff); @@ -132,10 +132,10 @@ void mad_dump_rhex(char *buf, int bufsz, void *val, int valsz) { switch (valsz) { case 1: - snprintf(buf, bufsz, "%02x", *(uint8_t *) val); + snprintf(buf, bufsz, "%02x", *(uint32_t *) val & 0xff); break; case 2: - snprintf(buf, bufsz, "%04x", *(uint16_t *) val); + snprintf(buf, bufsz, "%04x", *(uint32_t *) val & 0xffff); break; case 3: snprintf(buf, bufsz, "%06x", *(uint32_t *) val & 0xffffff); -- 1.5.5 From kliteyn at dev.mellanox.co.il Wed Feb 4 02:19:08 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 04 Feb 2009 12:19:08 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on index port order incrementation In-Reply-To: <4981DC18.9030400@ext.bull.net> References: <4981DC18.9030400@ext.bull.net> Message-ID: <49896B9C.8040006@dev.mellanox.co.il> Hi Nicolas, Nicolas Morey Chaisemartin wrote: > Hello, > > While doing some routing analysis on fat tree using ibsim we found a > "bug" in the fat-tree algorithm. > Problem happens with a 4 level Fat tree as below: > > > L3 L3 > ___________________|__|____________________ > / / \ \ <= All > the L2 are connected on 2 L3 switches > L2-1 L2-2 L2-1 L2-2 > / / \ \ <== > The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning. > L1 L1 L1 L1 > /|\ /|\ /|\ /|\ > ==Fully mixed to L1== ==Fully mixed to L1== <=== We have > multiple set. In each set, all L0 lead to all L1 of their set. > > L0 L0 L0 L0 > / \ / \ / \ / \ > CN CN .. CN CN .... CN CN .. CN CN > > > To detail: > We have a bunch of sets. Each set contains compute node, L0 and L1 > switches. > Plus a common top of L2 and L3 switches. > > In each set, there are groups of compute nodes. Each group is connected > to a single L0 switch. > In a given set, all L0 are connected to all L1. > > The Nth L1 of a set is connected to the Nth L2 and only to this one. (so > through a L2, the Nth L1 can only see the Nth L1 of the other sets) > All the L2 are connected to a couple of L3. > > > If we dont put the L3. We have a perfectly equilibrated fat tree and > well equilibrated routes. > But when we add the L3, it introduce a huge difference. As it is not > necessary, no route is going through L3 (which is fine). > However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and > 1/4 is twice overused (compared to the equilibrate state). > > This comes from the down_port_groups_idx which is incremented each time > the algorithm goes down through a node whether it creates routes to HCA > (port != switch) > or not. As route coming up from a L1 reaches only one L2, the algorithm > goes through all the other L2 while going down, incrementing their index. > Our case here is a bit specific but in a case where your L1 doesn't have > full connectivity to all your L2, and another switch rank above, the > problem may appear. > > To avoid this problem, I've changed the > __osm_ftree_fabric_route_upgoing_by_going_down function so it returns a > value to indicate if routes to HCA (in fact to leaf switch) were created. > With this information, we only increase the index when the algorithm has > created routes to HCA. > After applying this patch and measuring the link usage, we are at > perfect equilibrium (L2<->L3 links are still not used but that is to be > expected). Great! I've actually seen this problem on a real clusters, but couldn't understand what's cusing the lack of equilibrity. See couple of questions below. > Signed-off-by: Nicolas Morey-Chaisemartin > > --- > opensm/opensm/osm_ucast_ftree.c | 23 ++++++++++++++--------- > 1 files changed, 14 insertions(+), 9 deletions(-) > > ------------------------------------------------------------------------ > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index ebe6612..3474876 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, > * assign-up-going-port-by-descending-down to r-port node (recursion) > */ > > -static void > +static int > __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > IN ftree_sw_t * p_sw, > IN ftree_sw_t * p_prev_sw, > @@ -1932,21 +1932,23 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > uint16_t i; > uint16_t j; > uint16_t k; > + uint8_t created_route=0; > > /* we shouldn't enter here if both real_lid and main_path are false */ > CL_ASSERT(is_real_lid || is_main_path); > > /* if there is no down-going ports */ > if (p_sw->down_port_groups_num == 0) > - return; > + return 1; Shouldn't it return 0? > - /* promote the index that indicates which group should we > - start with when going through all the downgoing groups */ > - p_sw->down_port_groups_idx = > - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; > + /* If we are on a leaf switch we should be creating routes for real HCA */ > + /* This flag will be returned so upper layers will incrementent shift index */ > + if(p_sw->is_leaf == TRUE){ > + created_route=1; > + } The "is_leaf" flag will be TRUE only on leaf switches that have CNs connected to them. If we want to solve the problem for all routes (CNs, IO nodes, management nodes), the "created_route" flag should be updated elsewhere (see below). > /* foreach down-going port group (in indexing order) */ > - i = p_sw->down_port_groups_idx; > + i = (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; > for (k = 0; k < p_sw->down_port_groups_num; k++) { I think that since p_sw->down_port_groups_idx is promoted below, there is no need to increase the starting value of i. > + if(created_route) > + p_sw->down_port_groups_idx = > + (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; ... > + return created_route; > } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ > > /***************************************************/ How about something like this: @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, * assign-up-going-port-by-descending-down to r-port node (recursion) */ -static void +static boolean_t __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw, IN ftree_sw_t * p_prev_sw, @@ -1932,18 +1932,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, uint16_t i; uint16_t j; uint16_t k; + boolean_t created_route = FALSE; /* we shouldn't enter here if both real_lid and main_path are false */ CL_ASSERT(is_real_lid || is_main_path); /* if there is no down-going ports */ if (p_sw->down_port_groups_num == 0) - return; - - /* promote the index that indicates which group should we - start with when going through all the downgoing groups */ - p_sw->down_port_groups_idx = - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; + return FALSE; /* foreach down-going port group (in indexing order) */ i = p_sw->down_port_groups_idx; @@ -1952,9 +1948,12 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, p_group = p_sw->down_port_groups[i]; i = (i + 1) % p_sw->down_port_groups_num; - /* Skip this port group unless it points to a switch */ - if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) + /* If this port group doesn't point to a switch, mark + that the route was created and skip to the next group */ + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) { + created_route = TRUE; continue; + } if (p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid)) { @@ -2073,16 +2072,25 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* Recursion step: Assign upgoing ports by stepping down, starting on REMOTE switch */ - __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ - NULL, /* prev. position - NULL to mark that we went down and not up */ - target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ - is_real_lid, /* whether the target LID is real or dummy */ - is_main_path, /* whether this is path to HCA that should by tracked by counters */ - highest_rank_in_route); /* highest visited point in the tree before going down */ + created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, + p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ + NULL, /* prev. position - NULL to mark that we went down and not up */ + target_lid, /* LID that we're routing to */ + target_rank, /* rank of the LID that we're routing to */ + is_real_lid, /* whether the target LID is real or dummy */ + is_main_path, /* whether this is path to HCA that should by tracked by counters */ + highest_rank_in_route); /* highest visited point in the tree before going down */ } /* done scanning all the down-going port groups */ + /* if the route was created, promote the index that + indicates which group should we start with when + going through all the downgoing groups */ + if (created_route) + p_sw->down_port_groups_idx = + (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; + + return created_route; } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ /***************************************************/ From nicolas.morey-chaisemartin at ext.bull.net Wed Feb 4 02:37:43 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Wed, 04 Feb 2009 11:37:43 +0100 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on index port order incrementation In-Reply-To: <49896B9C.8040006@dev.mellanox.co.il> References: <4981DC18.9030400@ext.bull.net> <49896B9C.8040006@dev.mellanox.co.il> Message-ID: <49896FF7.8060908@ext.bull.net> Yevgeny Kliteynik wrote: > Hi Nicolas, > > Nicolas Morey Chaisemartin wrote: >> Hello, >> >> While doing some routing analysis on fat tree using ibsim we found a >> "bug" in the fat-tree algorithm. >> Problem happens with a 4 level Fat tree as below: >> >> >> L3 L3 >> ___________________|__|____________________ >> / / \ \ <= >> All the L2 are connected on 2 L3 switches >> L2-1 L2-2 L2-1 L2-2 >> / / \ \ >> <== The Nth L1 of a set leads only to the Nth L2 (L2-N). With some >> pruning. >> L1 L1 L1 L1 >> /|\ /|\ /|\ /|\ >> ==Fully mixed to L1== ==Fully mixed to L1== <=== We >> have multiple set. In each set, all L0 lead to all L1 of their set. >> >> L0 L0 L0 L0 >> / \ / \ / \ / \ >> CN CN .. CN CN .... CN CN .. CN CN >> >> >> To detail: >> We have a bunch of sets. Each set contains compute node, L0 and L1 >> switches. >> Plus a common top of L2 and L3 switches. >> >> In each set, there are groups of compute nodes. Each group is >> connected to a single L0 switch. >> In a given set, all L0 are connected to all L1. >> >> The Nth L1 of a set is connected to the Nth L2 and only to this one. >> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets) >> All the L2 are connected to a couple of L3. >> >> >> If we dont put the L3. We have a perfectly equilibrated fat tree and >> well equilibrated routes. >> But when we add the L3, it introduce a huge difference. As it is not >> necessary, no route is going through L3 (which is fine). >> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and >> 1/4 is twice overused (compared to the equilibrate state). >> >> This comes from the down_port_groups_idx which is incremented each >> time the algorithm goes down through a node whether it creates routes >> to HCA (port != switch) >> or not. As route coming up from a L1 reaches only one L2, the >> algorithm goes through all the other L2 while going down, >> incrementing their index. >> Our case here is a bit specific but in a case where your L1 doesn't >> have full connectivity to all your L2, and another switch rank above, >> the problem may appear. >> >> To avoid this problem, I've changed the >> __osm_ftree_fabric_route_upgoing_by_going_down function so it returns >> a value to indicate if routes to HCA (in fact to leaf switch) were >> created. >> With this information, we only increase the index when the algorithm >> has created routes to HCA. >> After applying this patch and measuring the link usage, we are at >> perfect equilibrium (L2<->L3 links are still not used but that is to >> be expected). > > Great! I've actually seen this problem on a real clusters, but > couldn't understand what's cusing the lack of equilibrity. > > See couple of questions below. > >> Signed-off-by: Nicolas Morey-Chaisemartin >> >> --- >> opensm/opensm/osm_ucast_ftree.c | 23 ++++++++++++++--------- >> 1 files changed, 14 insertions(+), 9 deletions(-) >> >> ------------------------------------------------------------------------ >> >> diff --git a/opensm/opensm/osm_ucast_ftree.c >> b/opensm/opensm/osm_ucast_ftree.c >> index ebe6612..3474876 100644 >> --- a/opensm/opensm/osm_ucast_ftree.c >> +++ b/opensm/opensm/osm_ucast_ftree.c >> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN >> cl_map_item_t * const p_map_item, >> * assign-up-going-port-by-descending-down to r-port node >> (recursion) >> */ >> >> -static void >> +static int >> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * >> p_ftree, >> IN ftree_sw_t * p_sw, >> IN ftree_sw_t * p_prev_sw, >> @@ -1932,21 +1932,23 @@ >> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * >> p_ftree, >> uint16_t i; >> uint16_t j; >> uint16_t k; >> + uint8_t created_route=0; >> >> /* we shouldn't enter here if both real_lid and main_path are >> false */ >> CL_ASSERT(is_real_lid || is_main_path); >> >> /* if there is no down-going ports */ >> if (p_sw->down_port_groups_num == 0) >> - return; >> + return 1; > > Shouldn't it return 0? Probably yes. I was thinking to the case where (taking notations from my scheme above) a L0 wouldn't have any CN (beaucse they are shutdown, broken, or for future extension). In this case, I think it'll smooth things a bit and not desequilibrate the network. >> - /* promote the index that indicates which group should we >> - start with when going through all the downgoing groups */ >> - p_sw->down_port_groups_idx = >> - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; >> + /* If we are on a leaf switch we should be creating routes for >> real HCA */ >> + /* This flag will be returned so upper layers will incrementent >> shift index */ >> + if(p_sw->is_leaf == TRUE){ >> + created_route=1; >> + } > > The "is_leaf" flag will be TRUE only on leaf switches that have CNs > connected to them. > If we want to solve the problem for all routes (CNs, IO nodes, > management nodes), > the "created_route" flag should be updated elsewhere (see below). I was not aware of this, so yes it should be done somewhere else (though it can also be done here). > >> /* foreach down-going port group (in indexing order) */ >> - i = p_sw->down_port_groups_idx; >> + i = (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; >> for (k = 0; k < p_sw->down_port_groups_num; k++) { > > I think that since p_sw->down_port_groups_idx is promoted below, > there is no need to increase the starting value of i. > I tried to but I had some problem (segfault probably due to the %p_sw->down_port_groups_num). If it works without incrementing, it's fine with me. >> + if(created_route) >> + p_sw->down_port_groups_idx = + >> (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; > ... >> + return created_route; >> } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ >> >> /***************************************************/ > > How about something like this: > > @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN > cl_map_item_t * const p_map_item, > * assign-up-going-port-by-descending-down to r-port node > (recursion) > */ > > -static void > +static boolean_t > __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * > p_ftree, > IN ftree_sw_t * p_sw, > IN ftree_sw_t * p_prev_sw, > @@ -1932,18 +1932,14 @@ > __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * > p_ftree, > uint16_t i; > uint16_t j; > uint16_t k; > + boolean_t created_route = FALSE; > > /* we shouldn't enter here if both real_lid and main_path are > false */ > CL_ASSERT(is_real_lid || is_main_path); > > /* if there is no down-going ports */ > if (p_sw->down_port_groups_num == 0) > - return; > - > - /* promote the index that indicates which group should we > - start with when going through all the downgoing groups */ > - p_sw->down_port_groups_idx = > - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; > + return FALSE; > > /* foreach down-going port group (in indexing order) */ > i = p_sw->down_port_groups_idx; > @@ -1952,9 +1948,12 @@ > __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * > p_ftree, > p_group = p_sw->down_port_groups[i]; > i = (i + 1) % p_sw->down_port_groups_num; > > - /* Skip this port group unless it points to a switch */ > - if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) > + /* If this port group doesn't point to a switch, mark > + that the route was created and skip to the next group */ > + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) { > + created_route = TRUE; > continue; > + } > > if (p_prev_sw > && (p_group->remote_base_lid == p_prev_sw->base_lid)) { > @@ -2073,16 +2072,25 @@ > __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * > p_ftree, > > /* Recursion step: > Assign upgoing ports by stepping down, starting on REMOTE > switch */ > - __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, > p_remote_sw, /* remote switch - used as a route-upgoing alg. start > point */ > - NULL, /* prev. position - NULL > to mark that we went down and not up */ > - target_lid, /* LID that we're > routing to */ > - target_rank, /* rank of the LID > that we're routing to */ > - is_real_lid, /* whether the > target LID is real or dummy */ > - is_main_path, /* whether this > is path to HCA that should by tracked by counters */ > - highest_rank_in_route); /* > highest visited point in the tree before going down */ > + created_route |= > __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, > + p_remote_sw, /* remote switch - used as a > route-upgoing alg. start point */ > + NULL, /* prev. position - NULL to mark that we > went down and not up */ > + target_lid, /* LID that we're routing to */ > + target_rank, /* rank of the LID that we're routing to */ > + is_real_lid, /* whether the target LID is real or > dummy */ > + is_main_path, /* whether this is path to HCA that > should by tracked by counters */ > + highest_rank_in_route); /* highest visited point in > the tree before going down */ > } > /* done scanning all the down-going port groups */ > > + /* if the route was created, promote the index that > + indicates which group should we start with when > + going through all the downgoing groups */ > + if (created_route) > + p_sw->down_port_groups_idx = > + (p_sw->down_port_groups_idx + 1) % > p_sw->down_port_groups_num; > + > + return created_route; > } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ > > /***************************************************/ > > > > > > That seems good. I'm going to think a bit more about the case where there are no downports. Best regards Nicolas From vlad at lists.openfabrics.org Wed Feb 4 03:11:13 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 4 Feb 2009 03:11:13 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090204-0200 daily build status Message-ID: <20090204111113.8A537E60DCF@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From hal.rosenstock at gmail.com Wed Feb 4 04:29:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 07:29:36 -0500 Subject: [ofa-general] Possible memory leak and null pointer dereference in local_completions() In-Reply-To: <1233689172.23327.155.camel@chromite.mv.qlogic.com> References: <1233689172.23327.155.camel@chromite.mv.qlogic.com> Message-ID: On Tue, Feb 3, 2009 at 2:26 PM, Ralph Campbell wrote: > I was doing some tests with different MAD packets and > then reading the infiniband/core/mad.c code. > > handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local > on the mad_agent_priv->local_work work queue with > local->mad_priv == NULL if device->process_mad() returns > IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and > (!ib_response_mad(&mad_priv->mad.mad) || > !mad_agent_priv->agent.recv_handler). > > In this case, local_completions() will be called with > local->mad_priv == NULL. The code does check for this > case and skips calling recv_mad_agent->agent.recv_handler(). > This means recv == 0 so kmem_cache_free() is called with a > NULL pointer. That could be fixed by changing the check for !recv prior to the kmem_cache_free there to a check for (!recv && local->mad_priv). > Even if local->mad_priv != NULL, I don't see how local->mad_priv > is freed when recv == 1. Thus, it appears to be a memory leak. For those cases, it's either freed in local_completions (as recv is set to 1 for local->mad_priv != NULL except when there is no mad recv agent but that is another bug (see below)) or earlier in the else clause of the IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY of handle_outgoing_dr_smp(). That's another issue that this points out where recv = 1 needs to be moved up in local_completions. Would you try the untested patch below and see if it fixes the problem you found ? Thanks. -- Hal diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 5c54fc2..cca87e6 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2371,13 +2371,13 @@ static void local_completions(struct work_struct *work) list_del(&local->completion_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); if (local->mad_priv) { + recv = 1; recv_mad_agent = local->recv_mad_agent; if (!recv_mad_agent) { printk(KERN_ERR PFX "No receive MAD agent for lo goto local_send_completion; } - recv = 1; /* * Defined behavior is to complete response * before request @@ -2422,7 +2422,7 @@ local_send_completion: spin_lock_irqsave(&mad_agent_priv->lock, flags); atomic_dec(&mad_agent_priv->refcount); - if (!recv) + if (!recv && local->mad_priv) kmem_cache_free(ib_mad_cache, local->mad_priv); kfree(local); } > So, I'm proposing the following patch: > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 5c54fc2..93d80e5 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work) > struct ib_mad_local_private *local; > struct ib_mad_agent_private *recv_mad_agent; > unsigned long flags; > - int recv = 0; > struct ib_wc wc; > struct ib_mad_send_wc mad_send_wc; > > @@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work) > goto local_send_completion; > } > > - recv = 1; > /* > * Defined behavior is to complete response > * before request > @@ -2422,7 +2420,7 @@ local_send_completion: > > spin_lock_irqsave(&mad_agent_priv->lock, flags); > atomic_dec(&mad_agent_priv->refcount); > - if (!recv) > + if (local->mad_priv) > kmem_cache_free(ib_mad_cache, local->mad_priv); > kfree(local); > } > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From monis at Voltaire.COM Wed Feb 4 05:30:03 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 04 Feb 2009 15:30:03 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <200902040846.48370.jackm@dev.mellanox.co.il> References: <200902031816.41784.jackm@dev.mellanox.co.il> <49888558.3050506@Voltaire.COM> <200902040846.48370.jackm@dev.mellanox.co.il> Message-ID: <4989985B.6010707@Voltaire.COM> > It was originally written without the path->valid check in the "if", and so was based on the path record > being allocated within the "if". In this case, the path record was not yet inserted into the path list. > When you added the "valid" processing, you did not take this into account. > > You need code something like the following: > > path = __path_find(dev, phdr->hwaddr + 4); > if (!path || !path->valid) { > int had_path = 0; > if (!path) > path = path_rec_create(dev, phdr->hwaddr + 4); > else > had_path = 1; > if (path) { > /* put pseudoheader back on for next time */ > skb_push(skb, sizeof *phdr); > __skb_queue_tail(&path->queue, skb); > > if (path_rec_start(dev, path)) { > if (had_path) > /* detach from path list here under spinlock */ > spin_unlock(&priv->lock); > path_free(dev, path); > return; > } else if (!had_path) > __path_add(dev, path); > } else { > ++dev->stats.tx_dropped; > dev_kfree_skb_any(skb); > } > > spin_unlock(&priv->lock); > return; > } I hope I'm not missing something but __path_rec() checks for path existence and returns -EEXIST if the path is not added. ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, sizeof (union ib_gid)); if (ret < 0) n = &pn->rb_left; else if (ret > 0) n = &pn->rb_right; else return -EEXIST; } so the code you suggest may improve performance but I don't see how it solves the bug. From monis at Voltaire.COM Wed Feb 4 05:33:38 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 04 Feb 2009 15:33:38 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <49888558.3050506@Voltaire.COM> References: <200902031816.41784.jackm@dev.mellanox.co.il> <49888558.3050506@Voltaire.COM> Message-ID: <49899932.5060507@Voltaire.COM> Yossi Etigin wrote: > I think it comes from unicast_arp_send. > > Consider this scenario: > - paths are flushed (opensm up/down). > - unicast_arp_send() is called with a path in priv->path_list. > path->valid is 0. > - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - > no sm ah (yet) > (see the prints just before the panic). > - unicast_arp_send calls() path_free(). > - path memory is overwritten. > - __ipoib_dev_flush() is called again. > - mark_paths_invalid() tries to iterate over priv->path_list and gets > kernel panic > because path->list became invalid. > > --Yossi > I agree with Yossi's analysis. Isn't the fix just as simple as this? void ipoib_mark_paths_invalid(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; spin_lock_irq(&priv->lock); list_for_each_entry_safe(path, tp, &priv->path_list, list) { ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n", be16_to_cpu(path->pathrec.dlid), IPOIB_GID_ARG(path->pathrec.dgid)); - path->valid = 0; + if (path) + path->valid = 0; } spin_unlock_irq(&priv->lock); } From jackm at dev.mellanox.co.il Wed Feb 4 05:45:22 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 4 Feb 2009 15:45:22 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <49899932.5060507@Voltaire.COM> References: <200902031816.41784.jackm@dev.mellanox.co.il> <49888558.3050506@Voltaire.COM> <49899932.5060507@Voltaire.COM> Message-ID: <200902041545.22662.jackm@dev.mellanox.co.il> On Wednesday 04 February 2009 15:33, Moni Shoua wrote: > Isn't the fix just as simple as this? > > void ipoib_mark_paths_invalid(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ipoib_path *path, *tp; > > spin_lock_irq(&priv->lock); > > list_for_each_entry_safe(path, tp, &priv->path_list, list) { > ipoib_dbg(priv, "mark path LID 0x%04x GID " IPOIB_GID_FMT " invalid\n", > be16_to_cpu(path->pathrec.dlid), > IPOIB_GID_ARG(path->pathrec.dgid)); > - path->valid = 0; > + if (path) > + path->valid = 0; > } > > spin_unlock_irq(&priv->lock); > } > I doubt it. You are leaving a deleted path record as part of the path list. This is list corruption (since the list pointers themselves are part of the path record structure -- what if this returned storage is re-allocated?). I think the correct fix (after your previous posted comment) is: path = __path_find(dev, phdr->hwaddr + 4); if (!path || !path->valid) { int had_path = 0; if (!path) path = path_rec_create(dev, phdr->hwaddr + 4); else had_path = 1; if (path) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); if (path_rec_start(dev, path)) { if (had_path) { list_del(&path->list); rb_erase(&path->rb_node, &priv->path_tree); } spin_unlock_irqrestore(&priv->lock, flags); path_free(dev, path); return; } else __path_add(dev, path); } else { ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } spin_unlock_irqrestore(&priv->lock, flags); return; } My only question here is: Do we have to worry about netif_tx_lock_bh(dev) (as taken in ipoib_flush_paths)? (If we do, we have a problem). - Jack From monis at Voltaire.COM Wed Feb 4 07:45:28 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 04 Feb 2009 17:45:28 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <200902041545.22662.jackm@dev.mellanox.co.il> References: <200902031816.41784.jackm@dev.mellanox.co.il> <49888558.3050506@Voltaire.COM> <49899932.5060507@Voltaire.COM> <200902041545.22662.jackm@dev.mellanox.co.il> Message-ID: <4989B818.102@Voltaire.COM> > I doubt it. You are leaving a deleted path record as part of the path list. > This is list corruption (since the list pointers themselves are part of the > path record structure -- what if this returned storage is re-allocated?). > You are right. > I think the correct fix (after your previous posted comment) is: > path = __path_find(dev, phdr->hwaddr + 4); > if (!path || !path->valid) { > int had_path = 0; > if (!path) > path = path_rec_create(dev, phdr->hwaddr + 4); > else > had_path = 1; > if (path) { > /* put pseudoheader back on for next time */ > skb_push(skb, sizeof *phdr); > __skb_queue_tail(&path->queue, skb); > > if (path_rec_start(dev, path)) { > if (had_path) { > list_del(&path->list); > rb_erase(&path->rb_node, > &priv->path_tree); > } > spin_unlock_irqrestore(&priv->lock, flags); > path_free(dev, path); > return; > } else > __path_add(dev, path); > } else { > ++dev->stats.tx_dropped; > dev_kfree_skb_any(skb); > } > > spin_unlock_irqrestore(&priv->lock, flags); > return; > } > > My only question here is: > Do we have to worry about netif_tx_lock_bh(dev) (as taken in ipoib_flush_paths)? > (If we do, we have a problem). > Besides the locking issue that I hadn't think about yet what if we this fix looks the right thing to do. But what if we leave the path without freeing it even if path_rec_start() fails? This would leave a path which is not valid in path_list which is not forbidden state as I conclude (after all this is the state the function was called) In this way, I think that we don't have to worry about locks. and the code will look like this if (!path || !path->valid) { if (!path) path = path_rec_create(dev, phdr->hwaddr + 4); if (path) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); if (!path->query && path_rec_start(dev, path)) { spin_unlock_irqrestore(&priv->lock, flags); - path_free(dev, path); return; } else __path_add(dev, path); } else { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); } spin_unlock_irqrestore(&priv->lock, flags); return; } From jackm at dev.mellanox.co.il Wed Feb 4 08:03:57 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 4 Feb 2009 18:03:57 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <4989B818.102@Voltaire.COM> References: <200902031816.41784.jackm@dev.mellanox.co.il> <200902041545.22662.jackm@dev.mellanox.co.il> <4989B818.102@Voltaire.COM> Message-ID: <200902041803.57457.jackm@dev.mellanox.co.il> On Wednesday 04 February 2009 17:45, Moni Shoua wrote: > Besides the locking issue that I hadn't think about yet what if we this fix looks the right thing to do. > But what if we leave the path without freeing it even if path_rec_start() fails? > This would leave a path which is not valid in path_list which is not forbidden state as > I conclude (after all this is the state the function was called) > In this way, I think that we don't  have to worry about locks. > > and the code will look like this > >         if (!path || !path->valid) { >                 if (!path) >                         path = path_rec_create(dev, phdr->hwaddr + 4); >                 if (path) { >                         /* put pseudoheader back on for next time */ >                         skb_push(skb, sizeof *phdr); >                         __skb_queue_tail(&path->queue, skb); > >                         if (!path->query && path_rec_start(dev, path)) { >                                 spin_unlock_irqrestore(&priv->lock, flags); > -                               path_free(dev, path); >                                 return; >                         } else >                                 __path_add(dev, path); >                 } else { >                         ++priv->stats.tx_dropped; >                         dev_kfree_skb_any(skb); >                 } > >                 spin_unlock_irqrestore(&priv->lock, flags); >                 return; >         } > Still need some correction. If the path did not exist previously (i.e, !path = TRUE, and, below, had_path = 0), then need to call path_free or we will have a leak. Maybe the correct patch is:        path = __path_find(dev, phdr->hwaddr + 4);         if (!path || !path->valid) {                 int had_path = 0;                 if (!path)                         path = path_rec_create(dev, phdr->hwaddr + 4);                 else                         had_path = 1;                 if (path) {                         /* put pseudoheader back on for next time */                         skb_push(skb, sizeof *phdr);                         __skb_queue_tail(&path->queue, skb);                         if (!path->query && path_rec_start(dev, path)) {                                 spin_unlock_irqrestore(&priv->lock, flags); if (!had_path)                                  path_free(dev, path);                                 return;                         } else                                 __path_add(dev, path);                 } else {                         ++dev->stats.tx_dropped;                         dev_kfree_skb_any(skb);                 } From halr at obsidianresearch.com Wed Feb 4 08:14:48 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 04 Feb 2009 09:14:48 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters Message-ID: <1233764088.8992.458.camel@bertha1.edm.orcorp.ca> Sasha, Trivial path to ib_types.h to add xmit_wait field to PortCounters. Also, updated a reference from IBA 1.2 to 1.2.1. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-opensm-include-iba-ib_types.h-Add-xmit_wait-for-Por.patch Type: application/mbox Size: 1123 bytes Desc: not available URL: From halr at obsidianresearch.com Wed Feb 4 08:15:07 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 04 Feb 2009 09:15:07 -0700 Subject: [ofa-general] [PATCH 0/3] OpenSM/PerfMgr improvements Message-ID: <1233764107.8992.459.camel@bertha1.edm.orcorp.ca> Sasha, Following patch series improves PerfMgr: 1 - cosmetic cleanups 2 - Move ESP0 determination into __malloc_node 3 - Move ESP0 determination into monitored node These patches are based on previous PerfMgr patches sent over the last couple days. -- Hal From halr at obsidianresearch.com Wed Feb 4 08:15:10 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 04 Feb 2009 09:15:10 -0700 Subject: [ofa-general] [PATCH 1/3] opensm/PerfMgr: Mainly cosmetic changes Message-ID: <1233764110.8992.460.camel@bertha1.edm.orcorp.ca> Sasha, Cosmetic changes to PerfMgr: Eliminated unneeded extra parentheses Made some formatting consistent Simplified some internal names Also, removed inline from __init_monitored_nodes declaration -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-opensm-PerfMgr-Mainly-cosmetic-changes.patch Type: application/mbox Size: 19659 bytes Desc: not available URL: From halr at obsidianresearch.com Wed Feb 4 08:15:15 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 04 Feb 2009 09:15:15 -0700 Subject: [ofa-general] [PATCH 2/3] opensm/osm_perfmgr_db.(h c): Move ESP0 determination into __malloc_node Message-ID: <1233764115.8992.461.camel@bertha1.edm.orcorp.ca> Sasha, This patch moves the ESP0 determination once per db_node allocation rather than in bad_node_port in the PerfMgr. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-opensm-osm_perfmgr_db.-h-c-Move-ESP0-determination.patch Type: application/mbox Size: 5609 bytes Desc: not available URL: From halr at obsidianresearch.com Wed Feb 4 08:15:19 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 04 Feb 2009 09:15:19 -0700 Subject: [ofa-general] [PATCH 3/3] opensm/PerfMgr: Move ESP0 determination in monitored node Message-ID: <1233764119.8992.462.camel@bertha1.edm.orcorp.ca> Sasha, This patch moves the ESP0 determination into monitored node and copies into db_node when needed. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0004-opensm-PerfMgr-Move-ESP0-determination-in-monitored.patch Type: application/mbox Size: 5405 bytes Desc: not available URL: From monis at Voltaire.COM Wed Feb 4 08:16:15 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 04 Feb 2009 18:16:15 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <200902041803.57457.jackm@dev.mellanox.co.il> References: <200902031816.41784.jackm@dev.mellanox.co.il> <200902041545.22662.jackm@dev.mellanox.co.il> <4989B818.102@Voltaire.COM> <200902041803.57457.jackm@dev.mellanox.co.il> Message-ID: <4989BF4F.1060707@Voltaire.COM> > Still need some correction. If the path did not exist previously (i.e, !path = TRUE, > and, below, had_path = 0), then need to call path_free or we will have a leak. > True > Maybe the correct patch is: > path = __path_find(dev, phdr->hwaddr + 4); > if (!path || !path->valid) { > int had_path = 0; > if (!path) > path = path_rec_create(dev, phdr->hwaddr + 4); > else > had_path = 1; > if (path) { > /* put pseudoheader back on for next time */ > skb_push(skb, sizeof *phdr); > __skb_queue_tail(&path->queue, skb); > > if (!path->query && path_rec_start(dev, path)) { > spin_unlock_irqrestore(&priv->lock, flags); > if (!had_path) > path_free(dev, path); > return; > } else > __path_add(dev, path); > } else { > ++dev->stats.tx_dropped; > dev_kfree_skb_any(skb); > } This one looks good to me. Are you going to make a patch and submit it? I think it would be best if you run the same test on the patched IPoIB before submission. Do you agree? thanks From jackm at dev.mellanox.co.il Wed Feb 4 08:25:16 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 4 Feb 2009 18:25:16 +0200 Subject: [ofa-general] Re: Kernel panic in IPoIB stability testing In-Reply-To: <4989BF4F.1060707@Voltaire.COM> References: <200902031816.41784.jackm@dev.mellanox.co.il> <200902041803.57457.jackm@dev.mellanox.co.il> <4989BF4F.1060707@Voltaire.COM> Message-ID: <200902041825.16354.jackm@dev.mellanox.co.il> On Wednesday 04 February 2009 18:16, Moni Shoua wrote: > This one looks good  to me. > Are you going to make a patch and submit it? > > I think it would be best if you run the same test on the patched IPoIB before submission. > Do you agree? > I'll do a patch tomorrow. We'll run the test over the weekend. I'll submit it on Sunday if all is well. - Jack From sean.hefty at intel.com Wed Feb 4 08:41:35 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 4 Feb 2009 08:41:35 -0800 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <49893FAF.3090007@voltaire.com> References: <49893FAF.3090007@voltaire.com> Message-ID: <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> I was mixing up ib_sa_get_mcmember_rec and ib_sa_mcmember_rec_query. I'm following you now. There may be some way to defer setting the qkey if it's not available when binding, but how does allowing the bind to proceed help? Without the qkey, the QP is basically unusable. - Sean From sashak at voltaire.com Wed Feb 4 09:43:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 19:43:33 +0200 Subject: [ofa-general] Re: [PATCHv2] libibmad/(mad.h fields.c): Add support for PerfMgt ClassPortInfo In-Reply-To: <1233601115.8992.380.camel@bertha1.edm.orcorp.ca> References: <1233601115.8992.380.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204174333.GT11874@sashak.voltaire.com> On 11:58 Mon 02 Feb , Hal Rosenstock wrote: > > Attached is v2 of a patch to add support for PerfMgt ClassPortInfo attribute > into libibmad. Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 4 09:43:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 19:43:54 +0200 Subject: [ofa-general] Re: [PATCHv2] ibsim/sim_mad.c: Add sim support for PerfMgt ClassPortInfo In-Reply-To: <1233601126.8992.381.camel@bertha1.edm.orcorp.ca> References: <1233601126.8992.381.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204174354.GU11874@sashak.voltaire.com> On 11:58 Mon 02 Feb , Hal Rosenstock wrote: > Sasha, > > Attached is v2 of a patch to add simulator support for PerfMgt ClassPortInfo > (subsequent to previous libibmad patch). Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 4 10:14:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 20:14:21 +0200 Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090202185425.729a80b3.weiny2@llnl.gov> References: <20090202185425.729a80b3.weiny2@llnl.gov> Message-ID: <20090204181421.GV11874@sashak.voltaire.com> Hi Ira, On 18:54 Mon 02 Feb , Ira Weiny wrote: > Begining to clean up the libibmad interface. > > Ira > > > From 7e2f639905af92a6d4466d42af2e3e65bd717ffb Mon Sep 17 00:00:00 2001 > From: weiny2 at llnl.gov > Date: Mon, 2 Feb 2009 10:21:18 -0800 > Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces I don't understand how enum typedefing makes things cleaner - actually this will enforce me explicitly to verify an actual type in header files. Sometimes typedefs could help with porting, but it is not the case here. Sasha > > > Signed-off-by: weiny2 at llnl.gov > --- > libibmad/include/infiniband/mad.h | 38 ++++++++++++++++++------------------ > libibmad/src/fields.c | 22 ++++++++++---------- > libibmad/src/resolve.c | 10 ++++---- > 3 files changed, 35 insertions(+), 35 deletions(-) > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 9ff4a3e..f235ab0 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -203,7 +203,7 @@ typedef struct ib_field { > ib_mad_dump_fn *def_dump_fn; > } ib_field_t; > > -enum MAD_FIELDS { > +typedef enum MAD_FIELDS { > IB_NO_FIELD, > > IB_GID_PREFIX_F, > @@ -525,7 +525,7 @@ enum MAD_FIELDS { > IB_GUID_GUID0_F, > > IB_FIELD_LAST_ /* must be last */ > -}; > +} mad_field_t; > > /* > * SA RMPP section > @@ -595,21 +595,21 @@ typedef struct ib_vendor_call { > #define MAD_DEF_RETRIES 3 > #define MAD_DEF_TIMEOUT_MS 1000 > > -enum { > +typedef enum { > IB_DEST_LID, > IB_DEST_DRPATH, > IB_DEST_GUID, > IB_DEST_DRSLID, > -}; > +} mad_dest_t; > > -enum { > +typedef enum { > IB_NODE_CA = 1, > IB_NODE_SWITCH, > IB_NODE_ROUTER, > NODE_RNIC, > > IB_NODE_MAX = NODE_RNIC > -}; > +} mad_node_type_t; > > /******************************************************************************/ > > @@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey) > } > > /* fields.c */ > -MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); > -MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, > +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field); > +MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field, > uint32_t val); > /* field must be byte aligned */ > -MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); > -MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, > +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field); > +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field, > uint64_t val); > -MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); > -MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); > -MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val); > -MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val); > -MAD_EXPORT int mad_print_field(int field, const char *name, void *val); > -MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); > -MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); > +MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val); > +MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val); > +MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val); > +MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val); > +MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val); > +MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val); > +MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val); > > /* mad.c */ > MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, > @@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); > MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, > ib_portid_t * sm_id, int timeout); > MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > - int dest_type, ib_portid_t * sm_id); > + mad_dest_t dest, ib_portid_t * sm_id); > MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, > ibmad_gid_t * gid); > > @@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > ib_portid_t * sm_id, int timeout, const void *srcport); > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > - int dest_type, ib_portid_t * sm_id, > + mad_dest_t dest, ib_portid_t * sm_id, > const void *srcport); > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > const void *srcport); > diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c > index d5a1eb4..d435a2f 100644 > --- a/libibmad/src/fields.c > +++ b/libibmad/src/fields.c > @@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f, > memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8); > } > > -uint32_t mad_get_field(void *buf, int base_offs, int field) > +uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field) > { > return _get_field(buf, base_offs, ib_mad_f + field); > } > > -void mad_set_field(void *buf, int base_offs, int field, uint32_t val) > +void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val) > { > _set_field(buf, base_offs, ib_mad_f + field, val); > } > > -uint64_t mad_get_field64(void *buf, int base_offs, int field) > +uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field) > { > return _get_field64(buf, base_offs, ib_mad_f + field); > } > > -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val) > +void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val) > { > _set_field64(buf, base_offs, ib_mad_f + field, val); > } > > -void mad_set_array(void *buf, int base_offs, int field, void *val) > +void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val) > { > _set_array(buf, base_offs, ib_mad_f + field, val); > } > > -void mad_get_array(void *buf, int base_offs, int field, void *val) > +void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val) > { > _get_array(buf, base_offs, ib_mad_f + field, val); > } > > -void mad_decode_field(uint8_t * buf, int field, void *val) > +void mad_decode_field(uint8_t * buf, mad_field_t field, void *val) > { > const ib_field_t *f = ib_mad_f + field; > > @@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val) > _get_array(buf, 0, f, val); > } > > -void mad_encode_field(uint8_t * buf, int field, void *val) > +void mad_encode_field(uint8_t * buf, mad_field_t field, void *val) > { > const ib_field_t *f = ib_mad_f + field; > > @@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val, > valsz ? valsz : ALIGN(f->bitlen, 8) / 8); > } > > -int mad_print_field(int field, const char *name, void *val) > +int mad_print_field(mad_field_t field, const char *name, void *val) > { > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > return -1; > return _mad_print_field(ib_mad_f + field, name, val, 0); > } > > -char *mad_dump_field(int field, char *buf, int bufsz, void *val) > +char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val) > { > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > return 0; > return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); > } > > -char *mad_dump_val(int field, char *buf, int bufsz, void *val) > +char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val) > { > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > return 0; > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > index b62360b..faac1f9 100644 > --- a/libibmad/src/resolve.c > +++ b/libibmad/src/resolve.c > @@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > } > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > - int dest_type, ib_portid_t * sm_id, > + mad_dest_t dest, ib_portid_t * sm_id, > const void *srcport) > { > uint64_t guid; > @@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > ib_portid_t selfportid = { 0 }; > int selfport = 0; > > - switch (dest_type) { > + switch (dest) { > case IB_DEST_LID: > lid = strtol(addr_str, 0, 0); > if (!IB_LID_VALID(lid)) > @@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > return 0; > > default: > - IBWARN("bad dest_type %d", dest_type); > + IBWARN("bad dest %d", dest); > } > > return -1; > } > > -int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type, > +int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest, > ib_portid_t * sm_id) > { > - return ib_resolve_portid_str_via(portid, addr_str, dest_type, > + return ib_resolve_portid_str_via(portid, addr_str, dest, > sm_id, NULL); > } > > -- > 1.5.4.5 > From jgunthorpe at obsidianresearch.com Wed Feb 4 10:20:23 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 4 Feb 2009 11:20:23 -0700 Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090204181421.GV11874@sashak.voltaire.com> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> Message-ID: <20090204182023.GP7618@obsidianresearch.com> On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote: > I don't understand how enum typedefing makes things cleaner - actually > this will enforce me explicitly to verify an actual type in header > files. Sometimes typedefs could help with porting, but it is not the > case here. Not typedefing per say, but passing an enum through an int is not that great. You don't need the typedefs to do this, just 'enum MAD_FIELDS' for instance will do. Jason From sashak at voltaire.com Wed Feb 4 10:25:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 20:25:20 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix In-Reply-To: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204182520.GW11874@sashak.voltaire.com> Hi Hal, On 07:57 Tue 03 Feb , Hal Rosenstock wrote: > > Trivial description change to osm_node_get_num_physp. It makes some troubles for me to comment over attachments... :( In this comment line: +* Returns the number of physical ports (+1) for this node. "(+1)" will not be true for switch nodes. Sasha From sashak at voltaire.com Wed Feb 4 10:27:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 20:27:25 +0200 Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090204182023.GP7618@obsidianresearch.com> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> <20090204182023.GP7618@obsidianresearch.com> Message-ID: <20090204182725.GX11874@sashak.voltaire.com> On 11:20 Wed 04 Feb , Jason Gunthorpe wrote: > On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote: > > > I don't understand how enum typedefing makes things cleaner - actually > > this will enforce me explicitly to verify an actual type in header > > files. Sometimes typedefs could help with porting, but it is not the > > case here. > > Not typedefing per say, but passing an enum through an int is not that > great. You don't need the typedefs to do this, just 'enum MAD_FIELDS' > for instance will do. Yes, that would be fine to do. Sasha From weiny2 at llnl.gov Wed Feb 4 10:30:05 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 4 Feb 2009 10:30:05 -0800 Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090204181421.GV11874@sashak.voltaire.com> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> Message-ID: <20090204103005.4ef9256a.weiny2@llnl.gov> On Wed, 4 Feb 2009 20:14:21 +0200 Sasha Khapyorsky wrote: > Hi Ira, > > On 18:54 Mon 02 Feb , Ira Weiny wrote: > > Begining to clean up the libibmad interface. > > > > Ira > > > > > > From 7e2f639905af92a6d4466d42af2e3e65bd717ffb Mon Sep 17 00:00:00 2001 > > From: weiny2 at llnl.gov > > Date: Mon, 2 Feb 2009 10:21:18 -0800 > > Subject: [PATCH] Declare some enums as typedefs for cleaner function interfaces > > I don't understand how enum typedefing makes things cleaner - actually > this will enforce me explicitly to verify an actual type in header > files. Sometimes typedefs could help with porting, but it is not the > case here. Yes, this will force you to use the correct type. But I was looking at it from the user standpoint. If I give the user a uint8_t buffer and tell them to use this library to decode fields how do they know which values to pass in this call. uint32_t mad_get_field(void *buf, int base_offs, int field); Using mad_field_t or even enum MAD_FIELDS allows one to use tags/cscope to find the valid values for that parameter easily. Grepping will work but is still cumbersome. Again, I am trying to write a library which makes it easier for someone who might not be familiar with IB to extract diagnostic data. I understand you wanting the decoding of the data to be more flexible and abstract but we should make the interface for decoding that data easier to use. I feel the following patch does this. Would you prefer to use: uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field); ? Ira > > Sasha > > > > > > > Signed-off-by: weiny2 at llnl.gov > > --- > > libibmad/include/infiniband/mad.h | 38 ++++++++++++++++++------------------ > > libibmad/src/fields.c | 22 ++++++++++---------- > > libibmad/src/resolve.c | 10 ++++---- > > 3 files changed, 35 insertions(+), 35 deletions(-) > > > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > > index 9ff4a3e..f235ab0 100644 > > --- a/libibmad/include/infiniband/mad.h > > +++ b/libibmad/include/infiniband/mad.h > > @@ -203,7 +203,7 @@ typedef struct ib_field { > > ib_mad_dump_fn *def_dump_fn; > > } ib_field_t; > > > > -enum MAD_FIELDS { > > +typedef enum MAD_FIELDS { > > IB_NO_FIELD, > > > > IB_GID_PREFIX_F, > > @@ -525,7 +525,7 @@ enum MAD_FIELDS { > > IB_GUID_GUID0_F, > > > > IB_FIELD_LAST_ /* must be last */ > > -}; > > +} mad_field_t; > > > > /* > > * SA RMPP section > > @@ -595,21 +595,21 @@ typedef struct ib_vendor_call { > > #define MAD_DEF_RETRIES 3 > > #define MAD_DEF_TIMEOUT_MS 1000 > > > > -enum { > > +typedef enum { > > IB_DEST_LID, > > IB_DEST_DRPATH, > > IB_DEST_GUID, > > IB_DEST_DRSLID, > > -}; > > +} mad_dest_t; > > > > -enum { > > +typedef enum { > > IB_NODE_CA = 1, > > IB_NODE_SWITCH, > > IB_NODE_ROUTER, > > NODE_RNIC, > > > > IB_NODE_MAX = NODE_RNIC > > -}; > > +} mad_node_type_t; > > > > /******************************************************************************/ > > > > @@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey) > > } > > > > /* fields.c */ > > -MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); > > -MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, > > +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field); > > +MAD_EXPORT void mad_set_field(void *buf, int base_offs, mad_field_t field, > > uint32_t val); > > /* field must be byte aligned */ > > -MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); > > -MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, > > +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field); > > +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, mad_field_t field, > > uint64_t val); > > -MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); > > -MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); > > -MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val); > > -MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val); > > -MAD_EXPORT int mad_print_field(int field, const char *name, void *val); > > -MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); > > -MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); > > +MAD_EXPORT void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val); > > +MAD_EXPORT void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val); > > +MAD_EXPORT void mad_decode_field(uint8_t * buf, mad_field_t field, void *val); > > +MAD_EXPORT void mad_encode_field(uint8_t * buf, mad_field_t field, void *val); > > +MAD_EXPORT int mad_print_field(mad_field_t field, const char *name, void *val); > > +MAD_EXPORT char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val); > > +MAD_EXPORT char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val); > > > > /* mad.c */ > > MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, > > @@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); > > MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, > > ib_portid_t * sm_id, int timeout); > > MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > > - int dest_type, ib_portid_t * sm_id); > > + mad_dest_t dest, ib_portid_t * sm_id); > > MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, > > ibmad_gid_t * gid); > > > > @@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > ib_portid_t * sm_id, int timeout, const void *srcport); > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > - int dest_type, ib_portid_t * sm_id, > > + mad_dest_t dest, ib_portid_t * sm_id, > > const void *srcport); > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > > const void *srcport); > > diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c > > index d5a1eb4..d435a2f 100644 > > --- a/libibmad/src/fields.c > > +++ b/libibmad/src/fields.c > > @@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f, > > memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8); > > } > > > > -uint32_t mad_get_field(void *buf, int base_offs, int field) > > +uint32_t mad_get_field(void *buf, int base_offs, mad_field_t field) > > { > > return _get_field(buf, base_offs, ib_mad_f + field); > > } > > > > -void mad_set_field(void *buf, int base_offs, int field, uint32_t val) > > +void mad_set_field(void *buf, int base_offs, mad_field_t field, uint32_t val) > > { > > _set_field(buf, base_offs, ib_mad_f + field, val); > > } > > > > -uint64_t mad_get_field64(void *buf, int base_offs, int field) > > +uint64_t mad_get_field64(void *buf, int base_offs, mad_field_t field) > > { > > return _get_field64(buf, base_offs, ib_mad_f + field); > > } > > > > -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val) > > +void mad_set_field64(void *buf, int base_offs, mad_field_t field, uint64_t val) > > { > > _set_field64(buf, base_offs, ib_mad_f + field, val); > > } > > > > -void mad_set_array(void *buf, int base_offs, int field, void *val) > > +void mad_set_array(void *buf, int base_offs, mad_field_t field, void *val) > > { > > _set_array(buf, base_offs, ib_mad_f + field, val); > > } > > > > -void mad_get_array(void *buf, int base_offs, int field, void *val) > > +void mad_get_array(void *buf, int base_offs, mad_field_t field, void *val) > > { > > _get_array(buf, base_offs, ib_mad_f + field, val); > > } > > > > -void mad_decode_field(uint8_t * buf, int field, void *val) > > +void mad_decode_field(uint8_t * buf, mad_field_t field, void *val) > > { > > const ib_field_t *f = ib_mad_f + field; > > > > @@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val) > > _get_array(buf, 0, f, val); > > } > > > > -void mad_encode_field(uint8_t * buf, int field, void *val) > > +void mad_encode_field(uint8_t * buf, mad_field_t field, void *val) > > { > > const ib_field_t *f = ib_mad_f + field; > > > > @@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val, > > valsz ? valsz : ALIGN(f->bitlen, 8) / 8); > > } > > > > -int mad_print_field(int field, const char *name, void *val) > > +int mad_print_field(mad_field_t field, const char *name, void *val) > > { > > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > > return -1; > > return _mad_print_field(ib_mad_f + field, name, val, 0); > > } > > > > -char *mad_dump_field(int field, char *buf, int bufsz, void *val) > > +char *mad_dump_field(mad_field_t field, char *buf, int bufsz, void *val) > > { > > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > > return 0; > > return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); > > } > > > > -char *mad_dump_val(int field, char *buf, int bufsz, void *val) > > +char *mad_dump_val(mad_field_t field, char *buf, int bufsz, void *val) > > { > > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > > return 0; > > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > > index b62360b..faac1f9 100644 > > --- a/libibmad/src/resolve.c > > +++ b/libibmad/src/resolve.c > > @@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > } > > > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > - int dest_type, ib_portid_t * sm_id, > > + mad_dest_t dest, ib_portid_t * sm_id, > > const void *srcport) > > { > > uint64_t guid; > > @@ -101,7 +101,7 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > ib_portid_t selfportid = { 0 }; > > int selfport = 0; > > > > - switch (dest_type) { > > + switch (dest) { > > case IB_DEST_LID: > > lid = strtol(addr_str, 0, 0); > > if (!IB_LID_VALID(lid)) > > @@ -136,16 +136,16 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > return 0; > > > > default: > > - IBWARN("bad dest_type %d", dest_type); > > + IBWARN("bad dest %d", dest); > > } > > > > return -1; > > } > > > > -int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type, > > +int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, mad_dest_t dest, > > ib_portid_t * sm_id) > > { > > - return ib_resolve_portid_str_via(portid, addr_str, dest_type, > > + return ib_resolve_portid_str_via(portid, addr_str, dest, > > sm_id, NULL); > > } > > > > -- > > 1.5.4.5 > > From weiny2 at llnl.gov Wed Feb 4 10:30:54 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 4 Feb 2009 10:30:54 -0800 Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090204182725.GX11874@sashak.voltaire.com> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> <20090204182023.GP7618@obsidianresearch.com> <20090204182725.GX11874@sashak.voltaire.com> Message-ID: <20090204103054.177aa6e2.weiny2@llnl.gov> On Wed, 4 Feb 2009 20:27:25 +0200 Sasha Khapyorsky wrote: > On 11:20 Wed 04 Feb , Jason Gunthorpe wrote: > > On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote: > > > > > I don't understand how enum typedefing makes things cleaner - actually > > > this will enforce me explicitly to verify an actual type in header > > > files. Sometimes typedefs could help with porting, but it is not the > > > case here. > > > > Not typedefing per say, but passing an enum through an int is not that > > great. You don't need the typedefs to do this, just 'enum MAD_FIELDS' > > for instance will do. > > Yes, that would be fine to do. I will redo the patch with 'enum MAD_FIELDS'. Ira > > Sasha From hal.rosenstock at gmail.com Wed Feb 4 10:41:03 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 13:41:03 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix In-Reply-To: <20090204182520.GW11874@sashak.voltaire.com> References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> <20090204182520.GW11874@sashak.voltaire.com> Message-ID: Sasha, On Wed, Feb 4, 2009 at 1:25 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: >> >> Trivial description change to osm_node_get_num_physp. > > It makes some troubles for me to comment over attachments... :( > > In this comment line: > > +* Returns the number of physical ports (+1) for this node. > > "(+1)" will not be true for switch nodes. Are you sure about that ? It's not what I see regardless of whether base or enhanced SP0. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Feb 4 11:00:23 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 21:00:23 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size of memory allocation in __collect_guids In-Reply-To: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204190023.GY11874@sashak.voltaire.com> On 07:57 Tue 03 Feb , Hal Rosenstock wrote: > > Patch to increase size of monitored node in > osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual > port number. There are couple of validations like (port > p_mon_node->redir_tbl_size) in osm_perfmgr.c. Would it be correct after proposed change? Sasha From sashak at voltaire.com Wed Feb 4 11:02:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 21:02:56 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix In-Reply-To: References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> <20090204182520.GW11874@sashak.voltaire.com> Message-ID: <20090204190256.GZ11874@sashak.voltaire.com> On 13:41 Wed 04 Feb , Hal Rosenstock wrote: > Sasha, > > On Wed, Feb 4, 2009 at 1:25 PM, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: > >> > >> Trivial description change to osm_node_get_num_physp. > > > > It makes some troubles for me to comment over attachments... :( > > > > In this comment line: > > > > +* Returns the number of physical ports (+1) for this node. > > > > "(+1)" will not be true for switch nodes. > > Are you sure about that ? It's not what I see regardless of whether > base or enhanced SP0. For switch it will be an actual number of allocated physical ports (struct osm_physp) - port 0 olus number of external ports. For non switch nodes entry '0' is not used. Sasha From yosefe at Voltaire.COM Wed Feb 4 11:04:54 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 04 Feb 2009 21:04:54 +0200 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> Message-ID: <4989E6D6.5030109@Voltaire.COM> How about this patch? If no QKey - QP creation (and other stuff that need QKey) fail. However, rdma_resolve_addr() succeeds. --- When doing rdma_resolve_addr() and relevant port is down, the function fails and rdma_cm id is not bound to the device. Therefore, application does not have device handle and cannot wait for the port to become active. The function fails because ipoib is not joined to the multicast group and therefore sa does not have a multicast record to take a qkey from. The proposed patch is to make lazy qkey resolution - cma_set_qkey will set id_priv->qkey if it was not set, and will be called just before the qkey is really required. Signed-off-by: Yossi Etigin --- drivers/infiniband/core/cma.c | 41 +++++++++++++++++++++++++++-------------- 1 file changed, 27 insertions(+), 14 deletions(-) Index: b/drivers/infiniband/core/cma.c =================================================================== --- a/drivers/infiniband/core/cma.c 2009-02-04 20:40:20.000000000 +0200 +++ b/drivers/infiniband/core/cma.c 2009-02-04 20:57:59.000000000 +0200 @@ -296,21 +296,25 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_set_qkey(struct ib_device *device, u8 port_num, - enum rdma_port_space ps, - struct rdma_dev_addr *dev_addr, u32 *qkey) +static int cma_set_qkey(struct rdma_id_private *id_priv) { struct ib_sa_mcmember_rec rec; int ret = 0; - switch (ps) { + if (id_priv->qkey) + return; + + switch (id_priv->id.ps) { case RDMA_PS_UDP: - *qkey = RDMA_UDP_QKEY; + id_priv->qkey = RDMA_UDP_QKEY; break; case RDMA_PS_IPOIB: - ib_addr_get_mgid(dev_addr, &rec.mgid); - ret = ib_sa_get_mcmember_rec(device, port_num, &rec.mgid, &rec); - *qkey = be32_to_cpu(rec.qkey); + ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid); + ret = ib_sa_get_mcmember_rec(id_priv->id.device, + id_priv->id.port_num, &rec.mgid, + &rec); + if (!ret) + id_priv->qkey = be32_to_cpu(rec.qkey); break; default: break; @@ -340,12 +344,7 @@ static int cma_acquire_dev(struct rdma_i ret = ib_find_cached_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); if (!ret) { - ret = cma_set_qkey(cma_dev->device, - id_priv->id.port_num, - id_priv->id.ps, dev_addr, - &id_priv->qkey); - if (!ret) - cma_attach_to_dev(id_priv, cma_dev); + cma_attach_to_dev(id_priv, cma_dev); break; } } @@ -577,6 +576,10 @@ static int cma_ib_init_qp_attr(struct rd *qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT; if (cma_is_ud_ps(id_priv->id.ps)) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; + qp_attr->qkey = id_priv->qkey; *qp_attr_mask |= IB_QP_QKEY; } else { @@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i event.status = ib_event->param.sidr_rep_rcvd.status; break; } + ret = cma_set_qkey(id_priv); + if (ret) { + event.event = RDMA_CM_EVENT_ADDR_ERROR; + event.status = -EINVAL; + break; + } if (id_priv->qkey != rep->qkey) { event.event = RDMA_CM_EVENT_UNREACHABLE; event.status = -EINVAL; @@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma const void *private_data, int private_data_len) { struct ib_cm_sidr_rep_param rep; + int ret; memset(&rep, 0, sizeof rep); rep.status = status; if (status == IB_SIDR_SUCCESS) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; rep.qp_num = id_priv->qp_num; rep.qkey = id_priv->qkey; } -- --Yossi From sashak at voltaire.com Wed Feb 4 11:15:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 21:15:32 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port, allow queries on enhanced SP0 In-Reply-To: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca> References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204191523.GA11874@sashak.voltaire.com> On 07:57 Tue 03 Feb , Hal Rosenstock wrote: > > Patch to osm_perfmgr_db.c to only error port 0 queries when not enhanced > SP0. This: + osm_node = osm_get_node_by_guid(pm->subn, cl_hton64(node->node_guid)); + if (!osm_node) + return (PERFMGR_EVENT_DB_GUIDNOTFOUND); + if ((!(osm_node_get_type(osm_node) == IB_NODE_TYPE_SWITCH) || + !osm_node->sw || + !ib_switch_info_is_enhanced_port0(&osm_node->sw->switch_info)) && + (port == 0)) + return (PERFMGR_EVENT_DB_PORTNOTFOUND); (osm_get_node_by_guid()) is expensive operation. If you only need to determine port 0 type - store it as part of struct monitored_node structure. Another (even more universal) approach would be to keep there a reference to related osm_node object. Sasha From sashak at voltaire.com Wed Feb 4 11:29:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 21:29:14 +0200 Subject: [ofa-general] Re: [PATCH] libibmad: Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090204103005.4ef9256a.weiny2@llnl.gov> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> <20090204103005.4ef9256a.weiny2@llnl.gov> Message-ID: <20090204192914.GB11874@sashak.voltaire.com> On 10:30 Wed 04 Feb , Ira Weiny wrote: > > > > I don't understand how enum typedefing makes things cleaner - actually > > this will enforce me explicitly to verify an actual type in header > > files. Sometimes typedefs could help with porting, but it is not the > > case here. > > Yes, this will force you to use the correct type. Not "typedef" will do it, but proper prototypes. > Again, I am trying to write a library which makes it easier for someone who > might not be familiar with IB to extract diagnostic data. I understand you > wanting the decoding of the data to be more flexible and abstract but we should > make the interface for decoding that data easier to use. I feel the following > patch does this. > > Would you prefer to use: > > uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field); > > ? Yes, this would be correct and clear. Sasha From sashak at voltaire.com Wed Feb 4 11:38:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 21:38:34 +0200 Subject: [ofa-general] Re: [PATCH] libibmad/src/dump.c fix dump functions for big endian machines In-Reply-To: <49894B05.1090608@gmail.com> References: <49894B05.1090608@gmail.com> Message-ID: <20090204193834.GC11874@sashak.voltaire.com> On 10:00 Wed 04 Feb , Eli Dorfman (Voltaire) wrote: > fix dump functions for big endian machines > > Signed-off-by: Eli Dorfman Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 4 11:42:51 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 21:42:51 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters In-Reply-To: <1233764088.8992.458.camel@bertha1.edm.orcorp.ca> References: <1233764088.8992.458.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204194251.GD11874@sashak.voltaire.com> On 09:14 Wed 04 Feb , Hal Rosenstock wrote: > > Trivial path to ib_types.h to add xmit_wait field to PortCounters. Also, > updated a reference from IBA 1.2 to 1.2.1. Applied, Thanks. Sasha From sashak at voltaire.com Wed Feb 4 11:56:28 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 21:56:28 +0200 Subject: [ofa-general] Re: [PATCH 1/3] opensm/PerfMgr: Mainly cosmetic changes In-Reply-To: <1233764110.8992.460.camel@bertha1.edm.orcorp.ca> References: <1233764110.8992.460.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204195628.GE11874@sashak.voltaire.com> On 09:15 Wed 04 Feb , Hal Rosenstock wrote: > Sasha, > > Cosmetic changes to PerfMgr: > Eliminated unneeded extra parentheses > Made some formatting consistent > Simplified some internal names > Also, removed inline from __init_monitored_nodes declaration Applied. Thanks. Sasha From hal.rosenstock at gmail.com Wed Feb 4 11:54:41 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 14:54:41 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port, allow queries on enhanced SP0 In-Reply-To: <20090204191523.GA11874@sashak.voltaire.com> References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca> <20090204191523.GA11874@sashak.voltaire.com> Message-ID: On Wed, Feb 4, 2009 at 2:15 PM, Sasha Khapyorsky wrote: > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: >> >> Patch to osm_perfmgr_db.c to only error port 0 queries when not enhanced >> SP0. > > This: > > + osm_node = osm_get_node_by_guid(pm->subn, cl_hton64(node->node_guid)); > + if (!osm_node) > + return (PERFMGR_EVENT_DB_GUIDNOTFOUND); > + if ((!(osm_node_get_type(osm_node) == IB_NODE_TYPE_SWITCH) || > + !osm_node->sw || > + !ib_switch_info_is_enhanced_port0(&osm_node->sw->switch_info)) && > + (port == 0)) > + return (PERFMGR_EVENT_DB_PORTNOTFOUND); > > (osm_get_node_by_guid()) is expensive operation. If you only need to > determine port 0 type - store it as part of struct monitored_node > structure. Another (even more universal) approach would be to keep there > a reference to related osm_node object. This was done later in the patch series. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ralph.campbell at qlogic.com Wed Feb 4 11:58:05 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 04 Feb 2009 11:58:05 -0800 Subject: [ofa-general] Possible memory leak and null pointer dereference in local_completions() In-Reply-To: References: <1233689172.23327.155.camel@chromite.mv.qlogic.com> Message-ID: <1233777486.23327.172.camel@chromite.mv.qlogic.com> On Wed, 2009-02-04 at 04:29 -0800, Hal Rosenstock wrote: > On Tue, Feb 3, 2009 at 2:26 PM, Ralph Campbell > wrote: > > I was doing some tests with different MAD packets and > > then reading the infiniband/core/mad.c code. > > > > handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local > > on the mad_agent_priv->local_work work queue with > > local->mad_priv == NULL if device->process_mad() returns > > IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and > > (!ib_response_mad(&mad_priv->mad.mad) || > > !mad_agent_priv->agent.recv_handler). > > > > In this case, local_completions() will be called with > > local->mad_priv == NULL. The code does check for this > > case and skips calling recv_mad_agent->agent.recv_handler(). > > This means recv == 0 so kmem_cache_free() is called with a > > NULL pointer. > > That could be fixed by changing the check for !recv prior to the > kmem_cache_free there to a check for (!recv && local->mad_priv). This is what we did to continue making progress so I know it works. > > Even if local->mad_priv != NULL, I don't see how local->mad_priv > > is freed when recv == 1. Thus, it appears to be a memory leak. > > For those cases, it's either freed in local_completions (as recv is > set to 1 for local->mad_priv != NULL except when there is no mad recv > agent but that is another bug (see below)) or earlier in the else > clause of the IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY of > handle_outgoing_dr_smp(). That's another issue that this points out > where recv = 1 needs to be moved up in local_completions. The other problem I noticed with setting recv = 1, is that recv = 0 is outside the while (!list_empty) loop so it is never reset back to zero. I'm not really following you about recv = 1 needs to be moved up in local_completions. What I was really looking for was a confirmation that the original code had a memory leak. I don't see any reason to special case the call to kmem_cache_free(). It seems to me that it is needed any time local->mad_priv != NULL. The NULL pointer bug is easily fixed in a number of different ways. > Would you try the untested patch below and see if it fixes the problem > you found ? Thanks. We are in the middle of moving our office so I won't be able to reproduce this until next week. > -- Hal > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 5c54fc2..cca87e6 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -2371,13 +2371,13 @@ static void local_completions(struct work_struct *work) > list_del(&local->completion_list); > spin_unlock_irqrestore(&mad_agent_priv->lock, flags); > if (local->mad_priv) { > + recv = 1; > recv_mad_agent = local->recv_mad_agent; > if (!recv_mad_agent) { > printk(KERN_ERR PFX "No receive MAD agent for lo > goto local_send_completion; > } > > - recv = 1; > /* > * Defined behavior is to complete response > * before request > @@ -2422,7 +2422,7 @@ local_send_completion: > > spin_lock_irqsave(&mad_agent_priv->lock, flags); > atomic_dec(&mad_agent_priv->refcount); > - if (!recv) > + if (!recv && local->mad_priv) > kmem_cache_free(ib_mad_cache, local->mad_priv); > kfree(local); > } > > > So, I'm proposing the following patch: > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > index 5c54fc2..93d80e5 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work) > > struct ib_mad_local_private *local; > > struct ib_mad_agent_private *recv_mad_agent; > > unsigned long flags; > > - int recv = 0; > > struct ib_wc wc; > > struct ib_mad_send_wc mad_send_wc; > > > > @@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work) > > goto local_send_completion; > > } > > > > - recv = 1; > > /* > > * Defined behavior is to complete response > > * before request > > @@ -2422,7 +2420,7 @@ local_send_completion: > > > > spin_lock_irqsave(&mad_agent_priv->lock, flags); > > atomic_dec(&mad_agent_priv->refcount); > > - if (!recv) > > + if (local->mad_priv) > > kmem_cache_free(ib_mad_cache, local->mad_priv); > > kfree(local); > > } > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From hal.rosenstock at gmail.com Wed Feb 4 12:03:33 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 15:03:33 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix In-Reply-To: <20090204190256.GZ11874@sashak.voltaire.com> References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> <20090204182520.GW11874@sashak.voltaire.com> <20090204190256.GZ11874@sashak.voltaire.com> Message-ID: On Wed, Feb 4, 2009 at 2:02 PM, Sasha Khapyorsky wrote: > On 13:41 Wed 04 Feb , Hal Rosenstock wrote: >> Sasha, >> >> On Wed, Feb 4, 2009 at 1:25 PM, Sasha Khapyorsky wrote: >> > Hi Hal, >> > >> > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: >> >> >> >> Trivial description change to osm_node_get_num_physp. >> > >> > It makes some troubles for me to comment over attachments... :( >> > >> > In this comment line: >> > >> > +* Returns the number of physical ports (+1) for this node. >> > >> > "(+1)" will not be true for switch nodes. >> >> Are you sure about that ? It's not what I see regardless of whether >> base or enhanced SP0. > > For switch it will be an actual number of allocated physical ports > (struct osm_physp) - port 0 olus number of external ports. For non > switch nodes entry '0' is not used. Right. In my terms, physical is another name for an external port and port 0 is not a physical (external) port so I think we're quibbling about words. What do you think it should say ? -- Hal > Sasha > From swise at opengridcomputing.com Wed Feb 4 12:20:45 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 04 Feb 2009 14:20:45 -0600 Subject: [ofa-general] Re: dapl attribute bug In-Reply-To: References: <49871E6A.9000901@opengridcomputing.com> Message-ID: <4989F89D.8020905@opengridcomputing.com> Davis, Arlin R wrote: > > > >> The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl code >> maps this to the linux ib_device_attr->max_mr_size which is u64. >> >> This causes dapltest to fail in some cases when running over chelsio >> which sets max_mr_size to 0x100000000 (4GB). The dapl code truncates >> the value to 0. See dapl/openib_cma/dapl_ib_util.c. >> >> I'm not sure what the fix should be, but maybe the dapl code >> should set >> anything over 32 bits to 0xffffffff? >> >> > > This attribute changed with DAT 2.0 to match the 32-bit ibv_sge > length field. Since there are no direct max lmr segments mappings > I will need add some checks when setting max_lmr_block_size from > max_mr_size. Thanks. > > -arlin I'll test your fix when its ready. Lemme know. Steve. From swise at opengridcomputing.com Wed Feb 4 12:26:12 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 04 Feb 2009 14:26:12 -0600 Subject: [ofa-general] [PATCH 2.6.30 1/2] RDMA/cxgb3: sgl/pbl offset calculation is 64b. Message-ID: <20090204202612.27031.78831.stgit@dell3.ogc.int> From: Steve Wise The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 7 ++----- 1 files changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 19661b2..2cf6f13 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -195,15 +195,12 @@ static int build_inv_stag(union t3_wr *wqe, struct ib_send_wr *wr, return 0; } -/* - * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. - */ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list, u32 num_sgle, u32 * pbl_addr, u8 * page_size) { int i; struct iwch_mr *mhp; - u32 offset; + u64 offset; for (i = 0; i < num_sgle; i++) { mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8); @@ -235,7 +232,7 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list, return -EINVAL; } offset = sg_list[i].addr - mhp->attr.va_fbo; - offset += ((u32) mhp->attr.va_fbo) % + offset += ((u64) mhp->attr.va_fbo) % (1UL << (12 + mhp->attr.page_size)); pbl_addr[i] = ((mhp->attr.pbl_addr - rhp->rdev.rnic_info.pbl_base) >> 3) + From swise at opengridcomputing.com Wed Feb 4 12:26:14 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 04 Feb 2009 14:26:14 -0600 Subject: [ofa-general] [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection termination fixes. In-Reply-To: <20090204202612.27031.78831.stgit@dell3.ogc.int> References: <20090204202612.27031.78831.stgit@dell3.ogc.int> Message-ID: <20090204202614.27031.22248.stgit@dell3.ogc.int> From: Steve Wise The poll and flush code needs to handle all send opcodes: SEND, SEND_WITH_SE, SEND_WITH_INV, and SEND_WITH_SE_INV. Ignore TERM indications if the connection already gone. Ignore hw recv completions if the RQ is empty. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 13 +++++++++++-- drivers/infiniband/hw/cxgb3/cxio_wr.h | 6 ++++++ drivers/infiniband/hw/cxgb3/iwch_cm.c | 3 +++ drivers/infiniband/hw/cxgb3/iwch_ev.c | 5 ----- 4 files changed, 20 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 4dcf08b..c2740e7 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -450,7 +450,7 @@ static int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq) if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe)) return 0; - if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) && + if (CQE_SEND_OPCODE(*cqe) && RQ_TYPE(*cqe) && Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) return 0; @@ -1204,11 +1204,12 @@ int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, } /* incoming SEND with no receive posted failures */ - if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) && + if (CQE_SEND_OPCODE(*hw_cqe) && RQ_TYPE(*hw_cqe) && Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) { ret = -1; goto skip_cqe; } + BUG_ON((*cqe_flushed == 0) && !SW_CQE(*hw_cqe)); goto proc_cqe; } @@ -1223,6 +1224,13 @@ int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, * then we complete this with TPT_ERR_MSN and mark the wq in * error. */ + + if (Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) { + wq->error = 1; + ret = -1; + goto skip_cqe; + } + if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) { wq->error = 1; hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN)); @@ -1277,6 +1285,7 @@ proc_cqe: cxio_hal_pblpool_free(wq->rdev, wq->rq[Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)].pbl_addr, T3_STAG0_PBL_SIZE); + BUG_ON(Q_EMPTY(wq->rq_rptr, wq->rq_wptr)); wq->rq_rptr++; } diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h index 04618f7..ff9be1a 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_wr.h +++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h @@ -604,6 +604,12 @@ struct t3_cqe { #define CQE_STATUS(x) (G_CQE_STATUS(be32_to_cpu((x).header))) #define CQE_OPCODE(x) (G_CQE_OPCODE(be32_to_cpu((x).header))) +#define CQE_SEND_OPCODE(x)( \ + (G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND) || \ + (G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND_WITH_SE) || \ + (G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND_WITH_INV) || \ + (G_CQE_OPCODE(be32_to_cpu((x).header)) == T3_SEND_WITH_SE_INV)) + #define CQE_LEN(x) (be32_to_cpu((x).len)) /* used for RQ completion processing */ diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 44e936e..8699947 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1678,6 +1678,9 @@ static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) { struct iwch_ep *ep = ctx; + if (state_read(&ep->com) != FPDU_MODE) + return CPL_RET_BUF_DONE; + PDBG("%s ep %p\n", __func__, ep); skb_pull(skb, sizeof(struct cpl_rdma_terminate)); PDBG("%s saving %d bytes of term msg\n", __func__, skb->len); diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c index 7b67a67..743c5d8 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_ev.c +++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c @@ -179,11 +179,6 @@ void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb) case TPT_ERR_BOUND: case TPT_ERR_INVALIDATE_SHARED_MR: case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: - printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x " - "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __func__, - CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), - CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), - CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1); break; From sashak at voltaire.com Wed Feb 4 12:47:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 22:47:31 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix In-Reply-To: References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> <20090204182520.GW11874@sashak.voltaire.com> <20090204190256.GZ11874@sashak.voltaire.com> Message-ID: <20090204204731.GG11874@sashak.voltaire.com> On 15:03 Wed 04 Feb , Hal Rosenstock wrote: > > > > For switch it will be an actual number of allocated physical ports > > (struct osm_physp) - port 0 olus number of external ports. For non > > switch nodes entry '0' is not used. > > Right. In my terms, physical is another name for an external port and > port 0 is not a physical (external) port so I think we're quibbling > about words. What do you think it should say ? I don't really have a good opinion :(. Maybe something like: * Returns the number of osm_physp ports allocated for this for node * (for switches it is number of external physical ports plus port * 0 and number of physical ports + 1 for non-switch nodes). It is long... :( Sasha From hal.rosenstock at gmail.com Wed Feb 4 12:50:44 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 15:50:44 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix In-Reply-To: <20090204204731.GG11874@sashak.voltaire.com> References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> <20090204182520.GW11874@sashak.voltaire.com> <20090204190256.GZ11874@sashak.voltaire.com> <20090204204731.GG11874@sashak.voltaire.com> Message-ID: On Wed, Feb 4, 2009 at 3:47 PM, Sasha Khapyorsky wrote: > On 15:03 Wed 04 Feb , Hal Rosenstock wrote: >> > >> > For switch it will be an actual number of allocated physical ports >> > (struct osm_physp) - port 0 olus number of external ports. For non >> > switch nodes entry '0' is not used. >> >> Right. In my terms, physical is another name for an external port and >> port 0 is not a physical (external) port so I think we're quibbling >> about words. What do you think it should say ? > > I don't really have a good opinion :(. Maybe something like: > > * Returns the number of osm_physp ports allocated for this for node > * (for switches it is number of external physical ports plus port > * 0 and number of physical ports + 1 for non-switch nodes). Fine with me. Let me know if you want a patch for this. -- Hal > It is long... :( > > Sasha > From sashak at voltaire.com Wed Feb 4 12:55:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 22:55:20 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port, allow queries on enhanced SP0 In-Reply-To: References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca> <20090204191523.GA11874@sashak.voltaire.com> Message-ID: <20090204205520.GH11874@sashak.voltaire.com> On 14:54 Wed 04 Feb , Hal Rosenstock wrote: > > > > (osm_get_node_by_guid()) is expensive operation. If you only need to > > determine port 0 type - store it as part of struct monitored_node > > structure. Another (even more universal) approach would be to keep there > > a reference to related osm_node object. > > This was done later in the patch series. Good, but why do we need this intermediate version then? It would be better to do right things from beginning I think (and also this patch depends on previous one where redirection table size was changed so I cannot apply it anyway until things will be clarified or fixed there). Sasha From sashak at voltaire.com Wed Feb 4 12:59:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 22:59:25 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_node.h: osm_node_get_num_physp description fix In-Reply-To: References: <1233673053.8992.406.camel@bertha1.edm.orcorp.ca> <20090204182520.GW11874@sashak.voltaire.com> <20090204190256.GZ11874@sashak.voltaire.com> <20090204204731.GG11874@sashak.voltaire.com> Message-ID: <20090204205917.GI11874@sashak.voltaire.com> On 15:50 Wed 04 Feb , Hal Rosenstock wrote: > On Wed, Feb 4, 2009 at 3:47 PM, Sasha Khapyorsky wrote: > > On 15:03 Wed 04 Feb , Hal Rosenstock wrote: > >> > > >> > For switch it will be an actual number of allocated physical ports > >> > (struct osm_physp) - port 0 olus number of external ports. For non > >> > switch nodes entry '0' is not used. > >> > >> Right. In my terms, physical is another name for an external port and > >> port 0 is not a physical (external) port so I think we're quibbling > >> about words. What do you think it should say ? > > > > I don't really have a good opinion :(. Maybe something like: > > > > * Returns the number of osm_physp ports allocated for this for node > > * (for switches it is number of external physical ports plus port > > * 0 and number of physical ports + 1 for non-switch nodes). > > Fine with me. Let me know if you want a patch for this. If we are out of ideas then yes, send a new patch (still hope that you will find better description during this... :)). Sasha From hal.rosenstock at gmail.com Wed Feb 4 12:58:10 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 15:58:10 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size of memory allocation in __collect_guids In-Reply-To: <20090204190023.GY11874@sashak.voltaire.com> References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> <20090204190023.GY11874@sashak.voltaire.com> Message-ID: On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky wrote: > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: >> >> Patch to increase size of monitored node in >> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual >> port number. > > There are couple of validations like (port > p_mon_node->redir_tbl_size) > in osm_perfmgr.c. Would it be correct after proposed change? I see an issue with those tests which I will fix in a subsequent patch. -- Hal > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Feb 4 13:01:41 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 16:01:41 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port, allow queries on enhanced SP0 In-Reply-To: <20090204205520.GH11874@sashak.voltaire.com> References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca> <20090204191523.GA11874@sashak.voltaire.com> <20090204205520.GH11874@sashak.voltaire.com> Message-ID: On Wed, Feb 4, 2009 at 3:55 PM, Sasha Khapyorsky wrote: > On 14:54 Wed 04 Feb , Hal Rosenstock wrote: >> > >> > (osm_get_node_by_guid()) is expensive operation. If you only need to >> > determine port 0 type - store it as part of struct monitored_node >> > structure. Another (even more universal) approach would be to keep there >> > a reference to related osm_node object. >> >> This was done later in the patch series. > > Good, but why do we need this intermediate version then? Just as a time saver; it's just the path I took in development. > It would be better to do right things from beginning I think Sure it's better but does it really matter ? > (and also this patch > depends on previous one where redirection table size was changed so I > cannot apply it anyway until things will be clarified or fixed there). I think that position is extreme. I don't think I broke anything that wasn't already broken. Anyhow, if you really want, I'll produce one patch for these changes. -- Hal > Sasha > From sashak at voltaire.com Wed Feb 4 13:11:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 23:11:06 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size of memory allocation in __collect_guids In-Reply-To: References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> <20090204190023.GY11874@sashak.voltaire.com> Message-ID: <20090204211106.GJ11874@sashak.voltaire.com> On 15:58 Wed 04 Feb , Hal Rosenstock wrote: > On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky wrote: > > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: > >> > >> Patch to increase size of monitored node in > >> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual > >> port number. > > > > There are couple of validations like (port > p_mon_node->redir_tbl_size) > > in osm_perfmgr.c. Would it be correct after proposed change? > > I see an issue with those tests which I will fix in a subsequent patch. Could you fix this and post v2? - putting bugs in a main stream is bad in general and practically also complicates things like bisecting. Sasha From hal.rosenstock at gmail.com Wed Feb 4 13:09:20 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 16:09:20 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size of memory allocation in __collect_guids In-Reply-To: <20090204211106.GJ11874@sashak.voltaire.com> References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> <20090204190023.GY11874@sashak.voltaire.com> <20090204211106.GJ11874@sashak.voltaire.com> Message-ID: On Wed, Feb 4, 2009 at 4:11 PM, Sasha Khapyorsky wrote: > On 15:58 Wed 04 Feb , Hal Rosenstock wrote: >> On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky wrote: >> > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: >> >> >> >> Patch to increase size of monitored node in >> >> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual >> >> port number. >> > >> > There are couple of validations like (port > p_mon_node->redir_tbl_size) >> > in osm_perfmgr.c. Would it be correct after proposed change? >> >> I see an issue with those tests which I will fix in a subsequent patch. > > Could you fix this and post v2? I can. > - putting bugs in a main stream is bad > in general and practically also complicates things like bisecting. It was leaving an old bug in rather than adding a new one. -- Hal > Sasha > From halr at obsidianresearch.com Wed Feb 4 13:26:08 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 04 Feb 2009 14:26:08 -0700 Subject: [ofa-general] [PATCHv2] opensm/osm_node.h: Fix osm_node_get_num_physp description Message-ID: <1233782768.8992.469.camel@bertha1.edm.orcorp.ca> Sasha, v2 of patch to update/fix opensm/include/opensm/osm_node.h as requested. -- Hal -------------- next part -------------- opensm/include/opensm/osm_node.h: Fix osm_node_num_physp description Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h index 50b3598..fec24ba 100644 --- a/opensm/include/opensm/osm_node.h +++ b/opensm/include/opensm/osm_node.h @@ -269,7 +269,10 @@ static inline uint8_t osm_node_get_type(IN const osm_node_t * const p_node) * osm_node_get_num_physp * * DESCRIPTION -* Returns the type of this node. +* Returns the number of osm_physp ports allocated for this node. +* For switches, it is the number of external physical ports plus +* port 0. For CAs and routers, it is the number of external physical +* ports plus 1. * * SYNOPSIS */ From sashak at voltaire.com Wed Feb 4 13:40:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 23:40:18 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: In bad_node_port, allow queries on enhanced SP0 In-Reply-To: References: <1233673070.8992.408.camel@bertha1.edm.orcorp.ca> <20090204191523.GA11874@sashak.voltaire.com> <20090204205520.GH11874@sashak.voltaire.com> Message-ID: <20090204214018.GK11874@sashak.voltaire.com> On 16:01 Wed 04 Feb , Hal Rosenstock wrote: > > I think that position is extreme. I don't think I broke anything that > wasn't already broken. At least after fast look: (port_num > p_mon_node->redir_tbl_size) and similar tests look broken, using osm_get_node_by_guid() likely slows down existing PerfMgr. :( Both triggered by those patches. I would be fine with subsequent patches if there would no degradations. > Anyhow, if you really want, I'll produce one patch for these changes. Thanks. Sasha From hal.rosenstock at gmail.com Wed Feb 4 13:37:47 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 16:37:47 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size of memory allocation in __collect_guids In-Reply-To: References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> <20090204190023.GY11874@sashak.voltaire.com> <20090204211106.GJ11874@sashak.voltaire.com> Message-ID: On Wed, Feb 4, 2009 at 4:09 PM, Hal Rosenstock wrote: > On Wed, Feb 4, 2009 at 4:11 PM, Sasha Khapyorsky wrote: >> On 15:58 Wed 04 Feb , Hal Rosenstock wrote: >>> On Wed, Feb 4, 2009 at 2:00 PM, Sasha Khapyorsky wrote: >>> > On 07:57 Tue 03 Feb , Hal Rosenstock wrote: >>> >> >>> >> Patch to increase size of monitored node in >>> >> osm_perfmgr.c::__collect_guids. Redirection table is indexed by actual >>> >> port number. >>> > >>> > There are couple of validations like (port > p_mon_node->redir_tbl_size) >>> > in osm_perfmgr.c. Would it be correct after proposed change? >>> >>> I see an issue with those tests which I will fix in a subsequent patch. >> >> Could you fix this and post v2? > > I can. Would you push the latest changes you've accepted up to the management repo on the OFA server as they impact this ? -- Hal >> - putting bugs in a main stream is bad >> in general and practically also complicates things like bisecting. > > It was leaving an old bug in rather than adding a new one. > > -- Hal > >> Sasha >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Feb 4 13:42:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 23:42:53 +0200 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Increase size of memory allocation in __collect_guids In-Reply-To: References: <1233673056.8992.407.camel@bertha1.edm.orcorp.ca> <20090204190023.GY11874@sashak.voltaire.com> <20090204211106.GJ11874@sashak.voltaire.com> Message-ID: <20090204214253.GL11874@sashak.voltaire.com> On 16:37 Wed 04 Feb , Hal Rosenstock wrote: > > Would you push the latest changes you've accepted up to the management > repo on the OFA server as they impact this ? Sure. Pushing now. Sasha From sashak at voltaire.com Wed Feb 4 13:45:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 4 Feb 2009 23:45:00 +0200 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_node.h: Fix osm_node_get_num_physp description In-Reply-To: <1233782768.8992.469.camel@bertha1.edm.orcorp.ca> References: <1233782768.8992.469.camel@bertha1.edm.orcorp.ca> Message-ID: <20090204214500.GM11874@sashak.voltaire.com> On 14:26 Wed 04 Feb , Hal Rosenstock wrote: > Sasha, > > v2 of patch to update/fix opensm/include/opensm/osm_node.h as requested. > > -- Hal > opensm/include/opensm/osm_node.h: Fix osm_node_num_physp description > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From chien.tin.tung at intel.com Wed Feb 4 13:48:11 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 4 Feb 2009 14:48:11 -0700 Subject: [ofa-general] RE: [PATCH] : Define debugging variables only when CONFIG_INFINIBAND_NES_DEBUG is enabled In-Reply-To: References: Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830320A21FD5@azsmsx501.amr.corp.intel.com> >> Below patch removes following compilation warnings : >> drivers/infiniband/hw/nes/nes_cm.c:781: warning: unused >variable 'tmp_addr' >> drivers/infiniband/hw/nes/nes_cm.c:820: warning: unused >variable 'tmp_addr' >> > >Any feedback on this ? Manish, Thank you for the patch to take care of the warnings. Upon closer examination on the usage of tmp_addr in the subsequent NES_DEBUG, it seems to be nonsense. I am creating a patch to take out tmp_addr and the subsequent NES_DEBUG. Thanks, Chien From or.gerlitz at gmail.com Wed Feb 4 13:52:07 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 4 Feb 2009 23:52:07 +0200 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> Message-ID: <15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com> On Wed, Feb 4, 2009 at 6:41 PM, Sean Hefty wrote: > There may be some way to defer setting the qkey if it's not available when binding, but how > does allowing the bind to proceed help? Without the qkey, the QP is basically unusable. We have two usage cases: - an rdma-cm based app wants to determine if the route for a multicast group leads to IPoIB interface/device based on the outcome of rdma_bind_addr etc - for HA scheme, an app want to resolve the device/port and then use IB events as a trigger to actually start doing things such as QP creation, Joining multicast groups, etc Or From halr at obsidianresearch.com Wed Feb 4 14:06:06 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 04 Feb 2009 15:06:06 -0700 Subject: [ofa-general] [PATCHv2] opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation Message-ID: <1233785166.8992.473.camel@bertha1.edm.orcorp.ca> Sasha, Attached is a revised patch superceeding any outstanding perfmgr patches. This version fixes esp0 perfmgr operation. It determines ESP0 for the monitored node and subsequently copies this into the db node. Also, it fixes redirection table size and port number validation. -- Hal -------------- next part -------------- opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation Determine ESP0 for monitored node and copy into db node Also, fix redirection table size and port number validation Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index c8a4add..87fae37 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -100,6 +100,7 @@ typedef struct _monitored_node { cl_map_item_t map_item; struct _monitored_node *next; uint64_t guid; + boolean_t esp0; char *name; uint32_t redir_tbl_size; redir_t redir_port[1]; /* redirection on a per port basis */ diff --git a/opensm/include/opensm/osm_perfmgr_db.h b/opensm/include/opensm/osm_perfmgr_db.h index 5c96378..cb5c40a 100644 --- a/opensm/include/opensm/osm_perfmgr_db.h +++ b/opensm/include/opensm/osm_perfmgr_db.h @@ -134,6 +134,7 @@ typedef struct _db_port { typedef struct _db_node { cl_map_item_t map_item; /* must be first */ uint64_t node_guid; + boolean_t esp0; _db_port_t *ports; uint8_t num_ports; char node_name[NODE_NAME_SIZE]; @@ -155,7 +156,8 @@ perfmgr_db_t *perfmgr_db_construct(struct osm_perfmgr *perfmgr); void perfmgr_db_destroy(perfmgr_db_t * db); perfmgr_db_err_t perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid, - uint8_t num_ports, char *node_name); + boolean_t esp0, uint8_t num_ports, + char *node_name); perfmgr_db_err_t perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port, diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index a2ce50f..b01d612 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -438,7 +438,7 @@ static void __collect_guids(cl_map_item_t * const p_map_item, void *context) if (cl_qmap_get(&pm->monitored_map, node_guid) == cl_qmap_end(&pm->monitored_map)) { /* if not already in our map add it */ - size = node->node_info.num_ports; + size = osm_node_get_num_physp(node); mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * size); if (!mon_node) { OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C06: " @@ -449,7 +449,15 @@ static void __collect_guids(cl_map_item_t * const p_map_item, void *context) memset(mon_node, 0, sizeof(*mon_node) + sizeof(redir_t) * size); mon_node->guid = node_guid; mon_node->name = strdup(node->print_desc); - mon_node->redir_tbl_size = size + 1; + mon_node->redir_tbl_size = size; + /* check for enhanced switch port 0 */ + if (node && osm_node_get_type(node) == IB_NODE_TYPE_SWITCH && + node->sw && + ib_switch_info_is_enhanced_port0(&node->sw->switch_info)) + mon_node->esp0 = TRUE; + else + mon_node->esp0 = FALSE; + cl_qmap_insert(&(pm->monitored_map), node_guid, (cl_map_item_t *) mon_node); } @@ -491,8 +499,8 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) node_guid = cl_ntoh64(node->node_info.node_guid); /* make sure we have a database object ready to store this information */ - if (perfmgr_db_create_entry(pm->db, node_guid, num_ports, - node->print_desc) != + if (perfmgr_db_create_entry(pm->db, node_guid, mon_node->esp0, + num_ports, node->print_desc) != PERFMGR_EVENT_DB_SUCCESS) { OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C08: DB create entry failed for 0x%" @@ -501,10 +509,8 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) goto Exit; } - /* if switch, check for enhanced port 0 */ - if (osm_node_get_type(node) == IB_NODE_TYPE_SWITCH && - node->sw && - ib_switch_info_is_enhanced_port0(&node->sw->switch_info)) + /* check for switch enhanced port 0 */ + if (mon_node->esp0) startport = 0; /* issue the query for each port */ @@ -1136,7 +1142,7 @@ static void osm_pc_rcv_process(void *context, void *data) /* LID redirection support (easier than GID redirection) */ cl_plock_acquire(pm->lock); /* Now, validate port number */ - if (port > p_mon_node->redir_tbl_size) { + if (port >= p_mon_node->redir_tbl_size) { cl_plock_release(pm->lock); OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C13: " "Invalid port num %d for GUID 0x%016" diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c index bff9a0f..ef47ce3 100644 --- a/opensm/opensm/osm_perfmgr_db.c +++ b/opensm/opensm/osm_perfmgr_db.c @@ -90,14 +90,15 @@ static inline perfmgr_db_err_t bad_node_port(_db_node_t * node, uint8_t port) { if (!node) return (PERFMGR_EVENT_DB_GUIDNOTFOUND); - if (port == 0 || port >= node->num_ports) + if (port >= node->num_ports || (!node->esp0 && port == 0)) return (PERFMGR_EVENT_DB_PORTNOTFOUND); return (PERFMGR_EVENT_DB_SUCCESS); } /** ========================================================================= */ -static _db_node_t *__malloc_node(uint64_t guid, uint8_t num_ports, char *name) +static _db_node_t *__malloc_node(uint64_t guid, boolean_t esp0, + uint8_t num_ports, char *name) { int i = 0; time_t cur_time = 0; @@ -110,6 +111,7 @@ static _db_node_t *__malloc_node(uint64_t guid, uint8_t num_ports, char *name) goto free_rc; rc->num_ports = num_ports; rc->node_guid = guid; + rc->esp0 = esp0; cur_time = time(NULL); for (i = 0; i < num_ports; i++) { @@ -151,14 +153,15 @@ static perfmgr_db_err_t __insert(perfmgr_db_t * db, _db_node_t * node) /********************************************************************** **********************************************************************/ perfmgr_db_err_t -perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid, +perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid, boolean_t esp0, uint8_t num_ports, char *name) { perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; cl_plock_excl_acquire(&db->lock); if (!_get(db, guid)) { - _db_node_t *pc_node = __malloc_node(guid, num_ports, name); + _db_node_t *pc_node = __malloc_node(guid, esp0, num_ports, + name); if (!pc_node) { rc = PERFMGR_EVENT_DB_NOMEM; goto Exit; From hal.rosenstock at gmail.com Wed Feb 4 15:35:15 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 4 Feb 2009 18:35:15 -0500 Subject: [ofa-general] Possible memory leak and null pointer dereference in local_completions() In-Reply-To: <1233777486.23327.172.camel@chromite.mv.qlogic.com> References: <1233689172.23327.155.camel@chromite.mv.qlogic.com> <1233777486.23327.172.camel@chromite.mv.qlogic.com> Message-ID: On Wed, Feb 4, 2009 at 2:58 PM, Ralph Campbell wrote: > On Wed, 2009-02-04 at 04:29 -0800, Hal Rosenstock wrote: >> On Tue, Feb 3, 2009 at 2:26 PM, Ralph Campbell >> wrote: >> > I was doing some tests with different MAD packets and >> > then reading the infiniband/core/mad.c code. >> > >> > handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local >> > on the mad_agent_priv->local_work work queue with >> > local->mad_priv == NULL if device->process_mad() returns >> > IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and >> > (!ib_response_mad(&mad_priv->mad.mad) || >> > !mad_agent_priv->agent.recv_handler). >> > >> > In this case, local_completions() will be called with >> > local->mad_priv == NULL. The code does check for this >> > case and skips calling recv_mad_agent->agent.recv_handler(). >> > This means recv == 0 so kmem_cache_free() is called with a >> > NULL pointer. >> >> That could be fixed by changing the check for !recv prior to the >> kmem_cache_free there to a check for (!recv && local->mad_priv). > > This is what we did to continue making progress so I know > it works. > >> > Even if local->mad_priv != NULL, I don't see how local->mad_priv >> > is freed when recv == 1. Thus, it appears to be a memory leak. >> >> For those cases, it's either freed in local_completions (as recv is >> set to 1 for local->mad_priv != NULL except when there is no mad recv >> agent but that is another bug (see below)) or earlier in the else >> clause of the IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY of >> handle_outgoing_dr_smp(). That's another issue that this points out >> where recv = 1 needs to be moved up in local_completions. > > The other problem I noticed with setting recv = 1, is that recv = 0 > is outside the while (!list_empty) loop so it is never reset back > to zero. > > I'm not really following you about recv = 1 needs to be moved up in > local_completions. I was referring to handling the case where local->mad_priv != NULL and there is no mad recv agent: if (local->mad_priv) { recv_mad_agent = local->recv_mad_agent; if (!recv_mad_agent) { printk(KERN_ERR PFX "No receive MAD agent for lo cal completion\n"); goto local_send_completion; } That was another case where there was a leak so I moved recv = 1 from below this to above it just after the check of local->mad_priv in the patch I proposed. > What I was really looking for was a confirmation that the original > code had a memory leak. I need to look at this further for this. Haven't looked at this code much in the past couple years. > I don't see any reason to special case the > call to kmem_cache_free(). It seems to me that it is needed any time > local->mad_priv != NULL. > The NULL pointer bug is easily fixed in a number of different ways. I agree that if it turns out that this case was missed, then your patch is simpler but it will take me a little bit to check this out. >> Would you try the untested patch below and see if it fixes the problem >> you found ? Thanks. > > We are in the middle of moving our office so I won't be able to > reproduce this until next week. I no longer have any test bed setup for this. Any chance you can regress with the Mellanox HCAs to be sure this works there ? Part of that testing should be running OpenSM as it creates some of those cases. -- Hal >> -- Hal >> >> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c >> index 5c54fc2..cca87e6 100644 >> --- a/drivers/infiniband/core/mad.c >> +++ b/drivers/infiniband/core/mad.c >> @@ -2371,13 +2371,13 @@ static void local_completions(struct work_struct *work) >> list_del(&local->completion_list); >> spin_unlock_irqrestore(&mad_agent_priv->lock, flags); >> if (local->mad_priv) { >> + recv = 1; >> recv_mad_agent = local->recv_mad_agent; >> if (!recv_mad_agent) { >> printk(KERN_ERR PFX "No receive MAD agent for lo >> goto local_send_completion; >> } >> >> - recv = 1; >> /* >> * Defined behavior is to complete response >> * before request >> @@ -2422,7 +2422,7 @@ local_send_completion: >> >> spin_lock_irqsave(&mad_agent_priv->lock, flags); >> atomic_dec(&mad_agent_priv->refcount); >> - if (!recv) >> + if (!recv && local->mad_priv) >> kmem_cache_free(ib_mad_cache, local->mad_priv); >> kfree(local); >> } >> >> > So, I'm proposing the following patch: >> > >> > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c >> > index 5c54fc2..93d80e5 100644 >> > --- a/drivers/infiniband/core/mad.c >> > +++ b/drivers/infiniband/core/mad.c >> > @@ -2356,7 +2356,6 @@ static void local_completions(struct work_struct *work) >> > struct ib_mad_local_private *local; >> > struct ib_mad_agent_private *recv_mad_agent; >> > unsigned long flags; >> > - int recv = 0; >> > struct ib_wc wc; >> > struct ib_mad_send_wc mad_send_wc; >> > >> > @@ -2377,7 +2376,6 @@ static void local_completions(struct work_struct *work) >> > goto local_send_completion; >> > } >> > >> > - recv = 1; >> > /* >> > * Defined behavior is to complete response >> > * before request >> > @@ -2422,7 +2420,7 @@ local_send_completion: >> > >> > spin_lock_irqsave(&mad_agent_priv->lock, flags); >> > atomic_dec(&mad_agent_priv->refcount); >> > - if (!recv) >> > + if (local->mad_priv) >> > kmem_cache_free(ib_mad_cache, local->mad_priv); >> > kfree(local); >> > } >> > >> > >> > _______________________________________________ >> > general mailing list >> > general at lists.openfabrics.org >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > > From chien.tin.tung at intel.com Wed Feb 4 15:44:34 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Wed, 4 Feb 2009 17:44:34 -0600 Subject: [ofa-general] [PATCH] RDMA/nes: ibv_devinfo displays 0 for vendor_id and vendor_part_id Message-ID: <20090204234434.GA1856@ctung-MOBL> ibv_devinfo displays 0 for vendor_id and vendor_part_id. Fill in OUI and device_id for those two fields. Signed-off-by: Chien Tung --- diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index cb4a5f3..da966a5 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -254,6 +254,7 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { u32 adapter_size; u32 arp_table_size; u16 vendor_id; + u16 device_id; u8 OneG_Mode; u8 func_index; @@ -356,6 +357,13 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { return NULL; } + nesadapter->vendor_id = (((u32) nesadapter->mac_addr_high) << 8) | + (nesadapter->mac_addr_low >> 24); + + pci_bus_read_config_word(nesdev->pcidev->bus, nesdev->pcidev->devfn, + PCI_DEVICE_ID, &device_id); + nesadapter->vendor_part_id = device_id; + if (nes_init_serdes(nesdev, hw_rev, port_count, nesadapter, OneG_Mode)) { kfree(nesadapter); -- 1.5.3.3 From sashak at voltaire.com Wed Feb 4 16:03:23 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 02:03:23 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <498850A2.8090701@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> Message-ID: <20090205000323.GN11874@sashak.voltaire.com> Hi Eli, On 16:11 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: > rescan configuration as first step on every heavy sweep > this is a must in case of priority change (increase) for standby SM > > Signed-off-by: Eli Dorfman > --- > opensm/opensm/osm_state_mgr.c | 11 ++++++----- > 1 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c > index fc7ceb9..622867b 100644 > --- a/opensm/opensm/osm_state_mgr.c > +++ b/opensm/opensm/osm_state_mgr.c > @@ -1042,6 +1042,12 @@ static void do_sweep(osm_sm_t * sm) > ib_api_status_t status; > osm_remote_sm_t *p_remote_sm; > > + if (sm->p_subn->force_heavy_sweep && > + osm_subn_rescan_conf_files(sm->p_subn) < 0) { > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " > + "osm_subn_rescan_conf_file failed\n"); > + } > + > if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && > sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING) > return; > @@ -1131,11 +1137,6 @@ _repeat_discovery: > sm->p_subn->force_reroute = FALSE; > sm->p_subn->subnet_initialization_error = FALSE; > > - /* rescan configuration updates */ > - if (osm_subn_rescan_conf_files(sm->p_subn) < 0) > - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " > - "osm_subn_rescan_conf_file failed\n"); > - > if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER) > sm->p_subn->need_update = 1; 'force_heavy_sweep' flag can be raised during light sweep too. In this case you will miss config rescanning before incoming heavy sweep. I guess the patch should similar to (not tested): diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index aecfac6..f5d3837 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1041,11 +1041,14 @@ static void do_sweep(osm_sm_t * sm) { ib_api_status_t status; osm_remote_AM_t *p_remote_sm; + unsigned config_parsed = 0; - if (sm->p_subn->force_heavy_sweep && - osm_subn_rescan_conf_files(sm->p_subn) < 0) { - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " - "osm_subn_rescan_conf_file failed\n"); + if (sm->p_subn->force_heavy_sweep) { + if (osm_subn_rescan_conf_files(sm->p_subn) < 0) + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " + "osm_subn_rescan_conf_file failed\n"); + else + config_parsed = 1; } if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && @@ -1137,6 +1140,11 @@ _repeat_discovery: sm->p_subn->force_reroute = FALSE; sm->p_subn->subnet_initialization_error = FALSE; + /* rescan configuration updates */ + if (!config_parsed && osm_subn_rescan_conf_files(sm->p_subn) < 0) + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " + "osm_subn_rescan_conf_file failed\n"); + if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER) sm->p_subn->need_update = 1; Sasha From sean.hefty at intel.com Wed Feb 4 16:32:04 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 4 Feb 2009 16:32:04 -0800 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com> Message-ID: >- an rdma-cm based app wants to determine if the route for a multicast >group leads to IPoIB interface/device based on the outcome of >rdma_bind_addr etc I'm not quite following this yet. Are you wanting a list of IP addresses that map to RDMA devices? >- for HA scheme, an app want to resolve the device/port and then use >IB events as a trigger to actually start doing things such as QP >creation, Joining multicast groups, etc Thanks - I'll look at Yossi's patch in detail. The general principal looks fine to me. Is there some notification for IP addresses becoming usable that could be used instead? - Sean From sashak at voltaire.com Wed Feb 4 16:45:39 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 02:45:39 +0200 Subject: [ofa-general] Re: [PATCHv2] opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation In-Reply-To: <1233785166.8992.473.camel@bertha1.edm.orcorp.ca> References: <1233785166.8992.473.camel@bertha1.edm.orcorp.ca> Message-ID: <20090205004539.GO11874@sashak.voltaire.com> On 15:06 Wed 04 Feb , Hal Rosenstock wrote: > Sasha, > > Attached is a revised patch superceeding any outstanding perfmgr > patches. This version fixes esp0 perfmgr operation. It determines ESP0 > for the monitored node and subsequently copies this into the db node. > Also, it fixes redirection table size and port number validation. > > -- Hal > > > opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation > > Determine ESP0 for monitored node and copy into db node > Also, fix redirection table size and port number validation > > Signed-off-by: Hal Rosenstock Applied. Thanks. With one change - there was one more (port > p_mon_node->redir_tbl_size) test in osm_perfmgr_mad_send_err_callback(), fixing this to (port >= p_mon_node->redir_tbl_size). Sasha From ralph.campbell at qlogic.com Wed Feb 4 17:56:37 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 04 Feb 2009 17:56:37 -0800 Subject: [ofa-general] Possible memory leak and null pointer dereference in local_completions() In-Reply-To: References: <1233689172.23327.155.camel@chromite.mv.qlogic.com> <1233777486.23327.172.camel@chromite.mv.qlogic.com> Message-ID: <1233798997.23327.180.camel@chromite.mv.qlogic.com> > >> Would you try the untested patch below and see if it fixes the problem > >> you found ? Thanks. > > > > We are in the middle of moving our office so I won't be able to > > reproduce this until next week. > > I no longer have any test bed setup for this. Any chance you can > regress with the Mellanox HCAs to be sure this works there ? Part of > that testing should be running OpenSM as it creates some of those > cases. > > -- Hal We have a variety of Mellanox HCAs so next week when our lab is back up I can test it. From mkatiyar at gmail.com Wed Feb 4 19:01:30 2009 From: mkatiyar at gmail.com (Manish Katiyar) Date: Thu, 5 Feb 2009 08:31:30 +0530 Subject: [ofa-general] ***SPAM*** Re: [PATCH] : Define debugging variables only when CONFIG_INFINIBAND_NES_DEBUG is enabled In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA3830320A21FD5@azsmsx501.amr.corp.intel.com> References: <60BEFF3FBD4C6047B0F13F205CAFA3830320A21FD5@azsmsx501.amr.corp.intel.com> Message-ID: On Thu, Feb 5, 2009 at 3:18 AM, Tung, Chien Tin wrote: > >>> Below patch removes following compilation warnings : >>> drivers/infiniband/hw/nes/nes_cm.c:781: warning: unused >>variable 'tmp_addr' >>> drivers/infiniband/hw/nes/nes_cm.c:820: warning: unused >>variable 'tmp_addr' >>> >> >>Any feedback on this ? > > > Manish, > > Thank you for the patch to take care of the warnings. Upon closer > examination on the usage of tmp_addr in the subsequent NES_DEBUG, > it seems to be nonsense. I am creating a patch to take out > tmp_addr and the subsequent NES_DEBUG. Thanks a lot Chien Thanks - Manish > > Thanks, > > Chien From He.Huang at Sun.COM Wed Feb 4 20:47:28 2009 From: He.Huang at Sun.COM (Isaac Huang) Date: Wed, 04 Feb 2009 23:47:28 -0500 Subject: [ofa-general] troubleshooting IB_CM_REJ_INVALID_SERVICE_ID in RDMA_CM_EVENT_REJECTED at active side of the connection Message-ID: <20090205044728.GL18580@sun.com> Hi, I got some RDMA_CM_EVENT_REJECTED errors at active sides (i.e. nodes doing rdma_connect), after RDMA_CM_EVENT_ADDR_RESOLVED and RDMA_CM_EVENT_ROUTE_RESOLVED. Poking around in CM code told me that the passive side couldn't find a listener with requested service_id on the incoming device of the connection request. I suspected that either the active side or passive side could have been bound to a wrong IB device - both sides did have multiple IB interfaces on the fabric. Our code did bind to correct local IP addresses at both sides, src_addr in rdma_resolve_addr and rdma_bind_addr before rdma_listen. However, I seemed to remember that some old OFED versions had issues in rdma_translate_ip so that a wrong interface could be returned, e.g. bug 726 and 325. Also, the active side was running OFED 1.3.1 and passive side could be an older version. Could you guys give me some tips for troubleshooting? Any debugging options or /proc file to look at? Is there any netstat-like tool in OFED (e.g. something like a "netstat -ltp" to find out who is listening on which device)? The other possible cause could be ARP flux, but unfortunately arping via IPoIB always segfault on our systems. Is there any other way to troubleshoot possible ARP flux issues? BTW, pinging over IPoIB addresses worked fine. Your suggestion is greatly appreciated. Thanks, Isaac From dorfman.eli at gmail.com Wed Feb 4 23:43:04 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 05 Feb 2009 09:43:04 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <20090205000323.GN11874@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> Message-ID: <498A9888.5010003@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 16:11 Tue 03 Feb , Eli Dorfman (Voltaire) wrote: >> rescan configuration as first step on every heavy sweep >> this is a must in case of priority change (increase) for standby SM >> >> Signed-off-by: Eli Dorfman >> --- >> opensm/opensm/osm_state_mgr.c | 11 ++++++----- >> 1 files changed, 6 insertions(+), 5 deletions(-) >> >> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c >> index fc7ceb9..622867b 100644 >> --- a/opensm/opensm/osm_state_mgr.c >> +++ b/opensm/opensm/osm_state_mgr.c >> @@ -1042,6 +1042,12 @@ static void do_sweep(osm_sm_t * sm) >> ib_api_status_t status; >> osm_remote_sm_t *p_remote_sm; >> >> + if (sm->p_subn->force_heavy_sweep && >> + osm_subn_rescan_conf_files(sm->p_subn) < 0) { >> + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " >> + "osm_subn_rescan_conf_file failed\n"); >> + } >> + >> if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && >> sm->p_subn->sm_state != IB_SMINFO_STATE_DISCOVERING) >> return; >> @@ -1131,11 +1137,6 @@ _repeat_discovery: >> sm->p_subn->force_reroute = FALSE; >> sm->p_subn->subnet_initialization_error = FALSE; >> >> - /* rescan configuration updates */ >> - if (osm_subn_rescan_conf_files(sm->p_subn) < 0) >> - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " >> - "osm_subn_rescan_conf_file failed\n"); >> - >> if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER) >> sm->p_subn->need_update = 1; > > 'force_heavy_sweep' flag can be raised during light sweep too. In this > case you will miss config rescanning before incoming heavy sweep. I > guess the patch should similar to (not tested): > > > diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c > index aecfac6..f5d3837 100644 > --- a/opensm/opensm/osm_state_mgr.c > +++ b/opensm/opensm/osm_state_mgr.c > @@ -1041,11 +1041,14 @@ static void do_sweep(osm_sm_t * sm) > { > ib_api_status_t status; > osm_remote_AM_t *p_remote_sm; > + unsigned config_parsed = 0; > > - if (sm->p_subn->force_heavy_sweep && > - osm_subn_rescan_conf_files(sm->p_subn) < 0) { > - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " > - "osm_subn_rescan_conf_file failed\n"); > + if (sm->p_subn->force_heavy_sweep) { > + if (osm_subn_rescan_conf_files(sm->p_subn) < 0) > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " > + "osm_subn_rescan_conf_file failed\n"); > + else > + config_parsed = 1; > } > > if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER && > @@ -1137,6 +1140,11 @@ _repeat_discovery: > sm->p_subn->force_reroute = FALSE; > sm->p_subn->subnet_initialization_error = FALSE; > > + /* rescan configuration updates */ > + if (!config_parsed && osm_subn_rescan_conf_files(sm->p_subn) < 0) > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331A: " > + "osm_subn_rescan_conf_file failed\n"); > + > if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER) > sm->p_subn->need_update = 1; > ok. Please apply the fixed patch. Thanks, Eli From ruffing at motama.com Thu Feb 5 02:03:19 2009 From: ruffing at motama.com (Jan Ruffing) Date: Thu, 05 Feb 2009 11:03:19 +0100 Subject: [ofa-general] RDMA transfers: Buffer status communications? Message-ID: <498AB967.9010108@motama.com> Hello, when planning a data transfer system using Infiniband's RDMA mechanisms, I stumbled upon the following question: Is there a standard approach to inform the sender after an RDMA_write operation that the receiving buffer has been processed by the receiver and is now ready to receive new data? My understanding is as follows: - As soon as a IBV_WR_RDMA_WRITE[_WITH_IMM] operation has finished transfering data into the target buffer on the receiver side, a work completion gets put onto the sender side completion queue [and optionally the receiver's completion queue, too]. - The receiver processes the data in the buffer without the sender side noticing - If the receiver wants to inform the sender that the buffer has been processed and is ready to accept new data, the receiver has to manually send a message to the sender (f.e. by filing a send work request containing some kind of buffer identifier). Is my understanding of the mechanisms correct? Since locking and unlocking of data receiving buffers is a standard use case in most transport strategies, I wanted to ask if there's a more elegant way to manage this using the Infiniband architecture? Like for example delaying the sender side work completion till the buffer has been processed by the receiver? Thanks, Jan -- Jan Ruffing Software Developer Motama GmbH Lortzingstraße 10 · 66111 Saarbrücken · Germany tel +49 681 940 85 50 · fax +49 681 940 85 49 ruffing at motama.com · www.motama.com Companies register · district council Saarbrücken · HRB 15249 CEOs · Dr.-Ing. Marco Lohse, Michael Repplinger This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. From vlad at lists.openfabrics.org Thu Feb 5 03:11:56 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 5 Feb 2009 03:11:56 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090205-0200 daily build status Message-ID: <20090205111156.AD4E4E611D7@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From ogerlitz at voltaire.com Thu Feb 5 03:44:53 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 5 Feb 2009 13:44:53 +0200 (IST) Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: References: Message-ID: Hi Sean, It seems that even when the rdma-cm consumer binds to a specific address, the rdma-cm address resolution code follows the order of the devices/rules in routing table. So the user can't really dictate an outgoing interface based on the src address provided to rdma_resolve_addr. This problem seem to happen even if the user first called rdma_bind_addr, so its either same issue or that rdma_resolve_addr somehow stepping on the device/port "resolved" by rdma_bind_addr. Consider this system, with two IPoIB intefaces on the same IP subnet using the same HCA, each on a different port. The first match for 192.168.10.0/24 would be ib3. Now I issue a ping with the -I flag, to have the ICMP socket bind to a diffrent interface. First, I see that two neighbours has been created, each on a different interface, and second from sampling the interface packet counters (not brought here) I see that each ping uses the correct interface. Repeating the same test with rds-ping -I (rds-ping is a user space utility provided by the rds-devel package, sending packets through the rds kernel driver) - I can see that the two rds rdma-cm ids (rds would have two connections in that case) is using the same port, the one corresponding to ib3, the first routing match. Below is some info on my system. Or, when running with multiple HCAs on Linux - we run into an problem with RDS - in that rdma_resolve_addr does not pick the outgoing NIC based on the IP we bind to.. it seems to always be using the destination IP. We put this patch together - which solves the problem on Linux... note that this is behavior only fails on Linux - it works correctly on HPUX...as an example. Do you see a problem with proposing that this patch be picked up by OFED ? Rick Frank who brought this to my attention, also handed me this patch which is claimed to workaround this issue, its badly formatted and I couldn't really understand what it does. I hoped to be able and reproduce this with rping or ucmatose, but neither allow me to specify a -I address to the client side, and I don't have the time now for this enhancement. --- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c +++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c @@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so struct flowi fl; struct rtable *rt; struct neighbour *neigh; + struct net_device *dev; int ret; memset(&fl, 0, sizeof fl); fl.nl_u.ip4_u.daddr = dst_ip; fl.nl_u.ip4_u.saddr = src_ip; + + if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) { + fl.oif = dev->ifindex; + dev_put(dev); + + ret = ip_route_output_key(&rt, &fl); + if (ret == 0) + goto found; + /* Fall back to using any local device */ + fl.oif = 0; + } ret = ip_route_output_key(&rt, &fl); if (ret) goto out; +found: ; + /* If the device does ARP internally, return 'done' */ if (rt->idev->dev->flags & IFF_NOARP) { rdma_copy_addr(addr, rt->idev->dev, NULL); [root at anise ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0 ib3 192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0 ib2 [root at anise ~]# ip addr show ib2 11: ib2: mtu 2044 qdisc pfifo_fast qlen 256 link/infiniband 80:56:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c1 brd inet 192.168.10.60/24 brd 192.168.10.255 scope global ib2 inet6 fe80::202:c903:3:17c1/64 scope link [root at anise ~]# ip addr show ib3 12: ib3: mtu 2044 qdisc pfifo_fast qlen 256 link/infiniband 80:56:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c2 brd inet 192.168.10.61/24 brd 192.168.10.255 scope global ib3 inet6 fe80::202:c903:3:17c2/64 scope link [root at anise ~]# ping -I 192.168.10.60 192.168.10.89 2 packets transmitted, 2 received, 0% packet loss, time 999ms [root at anise ~]# ping -I 192.168.10.61 192.168.10.89 3 packets transmitted, 3 received, 0% packet loss, time 1999ms [root at anise ~]# ip n s 192.168.10.89 dev ib3 lladdr 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE 192.168.10.89 dev ib2 lladdr 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE [root at anise ~]# rds-ping -I 192.168.10.60 192.168.10.89 3: 33 usec [root at anise ~]# rds-ping -I 192.168.10.61 192.168.10.89 3: 33 usec [root at anise ~]# rds-info -I RDS IB Connections: LocalAddr RemoteAddr LocalDev RemoteDev 192.168.10.61 192.168.10.89 fe80::2:c903:3:17c2 fe80::2:c902:22:efe5 192.168.10.60 192.168.10.89 fe80::2:c903:3:17c2 fe80::2:c902:22:efe5 From ogerlitz at voltaire.com Thu Feb 5 04:03:42 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 05 Feb 2009 14:03:42 +0200 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: References: Message-ID: <498AD59E.4030003@voltaire.com> Or Gerlitz wrote: > Rick Frank who brought this to my attention, also handed me this patch > which is claimed to workaround this issue, > --- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c > +++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c > @@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so > struct flowi fl; > struct rtable *rt; > struct neighbour *neigh; > + struct net_device *dev; > int ret; > > memset(&fl, 0, sizeof fl); > fl.nl_u.ip4_u.daddr = dst_ip; > fl.nl_u.ip4_u.saddr = src_ip; > + > + if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) { > + fl.oif = dev->ifindex; > + dev_put(dev); > + > + ret = ip_route_output_key(&rt, &fl); > + if (ret == 0) > + goto found; I assume the trick here is to somehow enforce the interface returned by ip_dev_find and not the one resolved by the routing table. At least as I understand the addr.c code, it takes the interface later from neigh->dev , correct? Or. > + /* Fall back to using any local device */ > + fl.oif = 0; > + } > ret = ip_route_output_key(&rt, &fl); > if (ret) > goto out; > > +found: ; > + > /* If the device does ARP internally, return 'done' */ > if (rt->idev->dev->flags & IFF_NOARP) { > rdma_copy_addr(addr, rt->idev->dev, NULL); From sashak at voltaire.com Thu Feb 5 04:16:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 14:16:34 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <498A9888.5010003@gmail.com> References: <497DC87F.2090308@gmail.com> <497DC96F.3000902@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> <498A9888.5010003@gmail.com> Message-ID: <20090205121634.GQ11874@sashak.voltaire.com> On 09:43 Thu 05 Feb , Eli Dorfman (Voltaire) wrote: > > ok. Please apply the fixed patch. Did you test it? Sasha From Alexr at voltaire.com Wed Feb 4 21:21:07 2009 From: Alexr at voltaire.com (Alex Rosenbaum) Date: Thu, 5 Feb 2009 07:21:07 +0200 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com> Message-ID: <39C75744D164D948A170E9792AF8E7CA01F19812@exil.voltaire.com> >- I'm not quite following this yet. Are you wanting a list of IP addresses that map to RDMA devices? When looking at a case that the user defines a local interface ip addr which it wants to work with. The application does not know if the ip addr maps to an rdma-cm capable device (IB or iWapr) or not (i.e.: 1GigE). In current implemenation if the IB port is down (i.e.: cable unpluged) but the interface is up, rdma_bind_addr fails. That will also be the case if the rdma_bind_addr is called with an ip addr of the 1GigE interface. The application does not know if the failure is due to trying to bind on a 1GigE deives which is not rdma-cm capable or if it is a capable rdma-cm device which is in a temporery 'bad' state. Assuming this is an rdma-cm capable device in a 'bad' state, the user space application can wait for asyn ibv events (PORT_ACTIVE) from the device. Once the device is active again it can retry the rdma_create_qp or rdma_join_mc. Alex From nicolas.morey-chaisemartin at ext.bull.net Thu Feb 5 06:34:24 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 05 Feb 2009 15:34:24 +0100 Subject: [ofa-general] [ibsim][PATCH] Socket name can be forced by exporting IBSIM_SOCKNAME before starting ibsim and/or preloading umad2sim so multiple simulator can run on the same system at the same time Message-ID: <498AF8F0.2080707@ext.bull.net> As we do a lot of routing tests with ibsim we had the need to be able to launch multiple simulator on the same system. With this patch, ibsim (and umad2sim) will try to read the socket basename using a getenv("IBSIM_SOCKNAME") which makes it possible. If IBSIM_SOCKNAME is not set, SIM_BASENAME is still used. Signed-off-by: Nicolas Morey-Chaisemartin --- ibsim/ibsim.c | 10 ++++++++-- umad2sim/sim_client.c | 14 +++++++++----- 2 files changed, 17 insertions(+), 7 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 51201d225702489862648d1380d84c1570c11c71.diff Type: text/x-patch Size: 3286 bytes Desc: not available URL: From dorfman.eli at gmail.com Thu Feb 5 07:00:19 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 05 Feb 2009 17:00:19 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_subnet.c enable log_max_size opt update Message-ID: <498AFF03.7090903@gmail.com> enable log_max_size opt update Signed-off-by: Eli Dorfman --- opensm/opensm/osm_subnet.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index f589180..d6d39a6 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -132,7 +132,7 @@ static const opt_rec_t opt_tbl[] = { { "connect_roots", OPT_OFFSET(connect_roots), opts_parse_boolean, NULL, 1 }, { "use_ucast_cache", OPT_OFFSET(use_ucast_cache), opts_parse_boolean, NULL, 1 }, { "log_file", OPT_OFFSET(log_file), opts_parse_charp, NULL, 0 }, - { "log_max_size", OPT_OFFSET(log_max_size), opts_parse_uint32, opts_setup_log_max_size }, + { "log_max_size", OPT_OFFSET(log_max_size), opts_parse_uint32, opts_setup_log_max_size, 1 }, { "log_flags", OPT_OFFSET(log_flags), opts_parse_uint8, opts_setup_log_flags, 1 }, { "force_log_flush", OPT_OFFSET(force_log_flush), opts_parse_boolean, opts_setup_force_log_flush, 1 }, { "accum_log_file", OPT_OFFSET(accum_log_file), opts_parse_boolean, opts_setup_accum_log_file, 1 }, -- 1.5.5 From dorfman.eli at gmail.com Thu Feb 5 07:19:41 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 05 Feb 2009 17:19:41 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_subnet.c fix parse functions for big endian machines Message-ID: <498B038D.4020009@gmail.com> fix parse functions for big endian machines Signed-off-by: Eli Dorfman --- opensm/opensm/osm_subnet.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index d6d39a6..7b33659 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -710,14 +710,14 @@ opts_parse_net16(IN osm_subn_t *p_subn, IN void *p_v, IN setup_fn_t pfn) { uint16_t *p_val = p_v; - uint32_t val = strtoul(p_val_str, NULL, 0); + uint16_t val = strtoul(p_val_str, NULL, 0); CL_ASSERT(val < 0x10000); - if (cl_hton32(val) != *p_val) { + if (cl_hton16(val) != *p_val) { log_config_value(p_key, "0x%04x", val); if (pfn) pfn(p_subn, &val); - *p_val = cl_hton16((uint16_t) val); + *p_val = cl_hton16(val); } } @@ -729,14 +729,14 @@ opts_parse_uint8(IN osm_subn_t *p_subn, IN void *p_v, IN setup_fn_t pfn) { uint8_t *p_val = p_v; - uint32_t val = strtoul(p_val_str, NULL, 0); + uint8_t val = strtoul(p_val_str, NULL, 0); CL_ASSERT(val < 0x100); if (val != *p_val) { log_config_value(p_key, "%u", val); if (pfn) pfn(p_subn, &val); - *p_val = (uint8_t) val; + *p_val = val; } } -- 1.5.5 From chien.tin.tung at intel.com Thu Feb 5 07:21:06 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Thu, 5 Feb 2009 09:21:06 -0600 Subject: [ofa-general] [PATCH] RDMA/nes: tmp_addr compilation warning Message-ID: <20090205152106.GA2304@ctung-MOBL> As reported by Manish Katiyar , tmp_addr is causing a compilation warning when INFINIBAND_NES_DEBUG is not defined. tmp_addr is used in a NES_DEBUG and the print does not make sense. Taking out tmp_addr and the NES_DEBUG. Signed-off-by: Chien Tung --- diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 6f42ab6..bd918df 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -778,14 +778,10 @@ static struct nes_cm_node *find_node(struct nes_cm_core *cm_core, unsigned long flags; struct list_head *hte; struct nes_cm_node *cm_node; - __be32 tmp_addr = cpu_to_be32(loc_addr); /* get a handle on the hte */ hte = &cm_core->connected_nodes; - nes_debug(NES_DBG_CM, "Searching for an owner node: %pI4:%x from core %p->%p\n", - &tmp_addr, loc_port, cm_core, hte); - /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->ht_lock, flags); list_for_each_entry(cm_node, hte, list) { -- 1.5.3.3 From richard.frank at oracle.com Thu Feb 5 07:23:53 2009 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 5 Feb 2009 10:23:53 -0500 Subject: [ofa-general] ***SPAM*** Re: pick the outgoing HCA based on the IP used for bind References: Message-ID: FWIW - I tested with this patch to rmda_resolve_ip - and found no difference in behavior. At this point I do not think the addr.c patch resolves this... at one point we had two patches that were overlapping - both possilby solving the same problem... now that rds is explicitly binding to an IP...the resolve_ip patch appears to be not needed. The original problem is that we were not getting to either the HCA or port associated with an IP - even in a dual HCA configuration. Now that rds is explicitly binding we do get the correct HCA ( based on Or's tests ), however, we really want to resolve down to port backing the IP. ----- Original Message ----- From: "Or Gerlitz" To: "Sean Hefty" Cc: ; ; "Richard Frank" Sent: Thursday, February 05, 2009 6:44 AM Subject: Re: pick the outgoing HCA based on the IP used for bind > Hi Sean, > > It seems that even when the rdma-cm consumer binds to a specific address, > the rdma-cm address resolution code follows the order of the devices/rules > in routing table. So the user can't really dictate an outgoing interface > based on the src address provided to rdma_resolve_addr. This problem seem > to > happen even if the user first called rdma_bind_addr, so its either same > issue or that rdma_resolve_addr somehow stepping on the device/port > "resolved" by rdma_bind_addr. > > Consider this system, with two IPoIB intefaces on the same IP subnet using > the same HCA, each on a different port. The first match for > 192.168.10.0/24 > would be ib3. Now I issue a ping with the -I flag, to have the ICMP socket > bind to a diffrent interface. First, I see that two neighbours has been > created, each on a different interface, and second from sampling the > interface > packet counters (not brought here) I see that each ping uses the correct > interface. > > Repeating the same test with rds-ping -I (rds-ping is a user space utility > provided > by the rds-devel package, sending packets through the rds kernel driver) - > I can see > that the two rds rdma-cm ids (rds would have two connections in that case) > is using > the same port, the one corresponding to ib3, the first routing match. > Below is some info on my system. > > Or, when running with multiple HCAs on Linux - we run into an problem with > RDS - in that > rdma_resolve_addr does not pick the outgoing NIC based on the IP we bind > to.. it seems > to always be using the destination IP. > > We put this patch together - which solves the problem on Linux... note > that this is > behavior only fails on Linux - it works correctly on HPUX...as an example. > > Do you see a problem with proposing that this patch be picked up by OFED ? > > Rick Frank who brought this to my attention, also handed me this patch > which is claimed to workaround this issue, its badly formatted and I > couldn't really understand what it does. I hoped to be able and reproduce > this with rping or ucmatose, but neither allow me to specify a -I address > to the client side, and I don't have the time now for this enhancement. > > --- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c > +++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c > @@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so > struct flowi fl; > struct rtable *rt; > struct neighbour *neigh; > + struct net_device *dev; > int ret; > > memset(&fl, 0, sizeof fl); > fl.nl_u.ip4_u.daddr = dst_ip; > fl.nl_u.ip4_u.saddr = src_ip; > + > + if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) { > + fl.oif = dev->ifindex; > + dev_put(dev); > + > + ret = ip_route_output_key(&rt, &fl); > + if (ret == 0) > + goto found; > + /* Fall back to using any local device */ > + fl.oif = 0; > + } > ret = ip_route_output_key(&rt, &fl); > if (ret) > goto out; > > +found: ; > + > /* If the device does ARP internally, return 'done' */ > if (rt->idev->dev->flags & IFF_NOARP) { > rdma_copy_addr(addr, rt->idev->dev, NULL); > > > > > [root at anise ~]# route -n > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use > Iface > 192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0 > ib3 > 192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0 > ib2 > > > [root at anise ~]# ip addr show ib2 > 11: ib2: mtu 2044 qdisc pfifo_fast qlen > 256 > link/infiniband > 80:56:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c1 brd > inet 192.168.10.60/24 brd 192.168.10.255 scope global ib2 > inet6 fe80::202:c903:3:17c1/64 scope link > > [root at anise ~]# ip addr show ib3 > 12: ib3: mtu 2044 qdisc pfifo_fast qlen > 256 > link/infiniband > 80:56:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c2 brd > inet 192.168.10.61/24 brd 192.168.10.255 scope global ib3 > inet6 fe80::202:c903:3:17c2/64 scope link > > [root at anise ~]# ping -I 192.168.10.60 192.168.10.89 > 2 packets transmitted, 2 received, 0% packet loss, time 999ms > > [root at anise ~]# ping -I 192.168.10.61 192.168.10.89 > 3 packets transmitted, 3 received, 0% packet loss, time 1999ms > > [root at anise ~]# ip n s > 192.168.10.89 dev ib3 lladdr > 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE > 192.168.10.89 dev ib2 lladdr > 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE > > [root at anise ~]# rds-ping -I 192.168.10.60 192.168.10.89 > 3: 33 usec > > [root at anise ~]# rds-ping -I 192.168.10.61 192.168.10.89 > 3: 33 usec > > [root at anise ~]# rds-info -I > RDS IB Connections: > LocalAddr RemoteAddr LocalDev > RemoteDev > 192.168.10.61 192.168.10.89 fe80::2:c903:3:17c2 > fe80::2:c902:22:efe5 > 192.168.10.60 192.168.10.89 fe80::2:c903:3:17c2 > fe80::2:c902:22:efe5 > From PHF at zurich.ibm.com Thu Feb 5 08:22:50 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Thu, 5 Feb 2009 17:22:50 +0100 Subject: [ofa-general] Chelsio T3: Aggregate Throughput Message-ID: Hello, we am currently looking into the scalability of the T3 in terms of connections. We are using a 1-to-n scenario where the one server has a chunk of data and n client that fetch this chunk over and over again using RDMA reads (each 1MB in size). The clients do that such that they get an average data rate of about 9Mbps each. Every second we connect a new client to the server and see how far it goes. What puzzles us now is that after about 800 clients, they do no longer seem to receive much data. The first interesting thing is that the aggregate throughput actually drops (we expected it to stall). And the second interesting thing is that it does so already at about 6.3Gbps which is just a bit more than half of what the card can do. We do not experience this kind of situation when using much less clients that RDMA read the data at a much higher data rate. Is there any limitation on the RNIC that would give an explanation for this? (Setup: T3 RNICs, OFED-1.4, 2.6.26 kernel, MTU=9000) Many thanks for your advice, Philip -- Philip Frey IBM Zurich Research Laboratory Saumerstrasse 4 | Phone: +41 44 724 8613 CH-8803 Rueschlikon/Switzerland | Email: phf at zurich.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Feb 5 09:22:00 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 09:22:00 -0800 Subject: [ofa-general] RDMA transfers: Buffer status communications? In-Reply-To: <498AB967.9010108@motama.com> References: <498AB967.9010108@motama.com> Message-ID: <748E9553FCD94EA498B89E46765F309D@amr.corp.intel.com> >Is my understanding of the mechanisms correct? Since locking and unlocking of >data receiving buffers is a standard use case in most transport strategies, I >wanted to ask if there's a more elegant way to manage this using the Infiniband >architecture? Like for example delaying the sender side work completion till >the buffer has been processed by the receiver? Application level acks are needed to indicate when processing is complete. The hardware cannot determine this, so I don't know of any solution that's more elegant in a general case. - Sean From sean.hefty at intel.com Thu Feb 5 09:28:55 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 09:28:55 -0800 Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for bind In-Reply-To: References: Message-ID: >Rick Frank who brought this to my attention, also handed me this patch >which is claimed to workaround this issue, its badly formatted and I >couldn't really understand what it does. I hoped to be able and reproduce >this with rping or ucmatose, but neither allow me to specify a -I address >to the client side, and I don't have the time now for this enhancement. ucmatose allows binding to a specific address using -b. I haven't used rds-ping to know if it's the same as -I in that case. I don't have any systems myself with dual HCAs; I don't think they have enough slots to support more than one. - Sean From sashak at voltaire.com Thu Feb 5 09:44:49 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 19:44:49 +0200 Subject: [ofa-general] Re: [ibsim][PATCH] Socket name can be forced by exporting IBSIM_SOCKNAME before starting ibsim and/or preloading umad2sim so multiple simulator can run on the same system at the same time In-Reply-To: <498AF8F0.2080707@ext.bull.net> References: <498AF8F0.2080707@ext.bull.net> Message-ID: <20090205174449.GH5910@sashak.voltaire.com> On 15:34 Thu 05 Feb , Nicolas Morey Chaisemartin wrote: > As we do a lot of routing tests with ibsim we had the need to be able to > launch multiple simulator on the same system. > With this patch, ibsim (and umad2sim) will try to read the socket basename > using a getenv("IBSIM_SOCKNAME") which makes it possible. > If IBSIM_SOCKNAME is not set, SIM_BASENAME is still used. > > > Signed-off-by: Nicolas Morey-Chaisemartin > Applied. Thanks. I just changed 'socket_basename' to be static in both ibsim and umad2sim. Sasha From sashak at voltaire.com Thu Feb 5 09:46:38 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 19:46:38 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c enable log_max_size opt update In-Reply-To: <498AFF03.7090903@gmail.com> References: <498AFF03.7090903@gmail.com> Message-ID: <20090205174638.GI5910@sashak.voltaire.com> On 17:00 Thu 05 Feb , Eli Dorfman (Voltaire) wrote: > enable log_max_size opt update > > Signed-off-by: Eli Dorfman Applied. Thanks. Sasha From sashak at voltaire.com Thu Feb 5 10:04:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 20:04:00 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c fix parse functions for big endian machines In-Reply-To: <498B038D.4020009@gmail.com> References: <498B038D.4020009@gmail.com> Message-ID: <20090205180400.GJ5910@sashak.voltaire.com> On 17:19 Thu 05 Feb , Eli Dorfman (Voltaire) wrote: > fix parse functions for big endian machines > > Signed-off-by: Eli Dorfman Applied. Thanks. I'm fine with this patch - the code looks cleaner than it was before. But could you please explain what was a problem with original code on big endian machines (I don't see)? Also it would be helpful to have more detailed patch comments. Sasha > --- > opensm/opensm/osm_subnet.c | 10 +++++----- > 1 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index d6d39a6..7b33659 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -710,14 +710,14 @@ opts_parse_net16(IN osm_subn_t *p_subn, > IN void *p_v, IN setup_fn_t pfn) > { > uint16_t *p_val = p_v; > - uint32_t val = strtoul(p_val_str, NULL, 0); > + uint16_t val = strtoul(p_val_str, NULL, 0); > > CL_ASSERT(val < 0x10000); > - if (cl_hton32(val) != *p_val) { > + if (cl_hton16(val) != *p_val) { > log_config_value(p_key, "0x%04x", val); > if (pfn) > pfn(p_subn, &val); > - *p_val = cl_hton16((uint16_t) val); > + *p_val = cl_hton16(val); > } > } > > @@ -729,14 +729,14 @@ opts_parse_uint8(IN osm_subn_t *p_subn, > IN void *p_v, IN setup_fn_t pfn) > { > uint8_t *p_val = p_v; > - uint32_t val = strtoul(p_val_str, NULL, 0); > + uint8_t val = strtoul(p_val_str, NULL, 0); > > CL_ASSERT(val < 0x100); > if (val != *p_val) { > log_config_value(p_key, "%u", val); > if (pfn) > pfn(p_subn, &val); > - *p_val = (uint8_t) val; > + *p_val = val; > } > } > > -- > 1.5.5 > From jgunthorpe at obsidianresearch.com Thu Feb 5 10:02:03 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 5 Feb 2009 11:02:03 -0700 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: <498AD59E.4030003@voltaire.com> References: <498AD59E.4030003@voltaire.com> Message-ID: <20090205180203.GD3288@obsidianresearch.com> On Thu, Feb 05, 2009 at 02:03:42PM +0200, Or Gerlitz wrote: > Or Gerlitz wrote: > >Rick Frank who brought this to my attention, also handed me this patch > >which is claimed to workaround this issue, > >+++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c > >@@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so > > struct flowi fl; > > struct rtable *rt; > > struct neighbour *neigh; > >+ struct net_device *dev; > > int ret; > > > > memset(&fl, 0, sizeof fl); > > fl.nl_u.ip4_u.daddr = dst_ip; > > fl.nl_u.ip4_u.saddr = src_ip; > >+ > >+ if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) { > >+ fl.oif = dev->ifindex; > >+ dev_put(dev); > >+ > >+ ret = ip_route_output_key(&rt, &fl); > >+ if (ret == 0) > >+ goto found; > I assume the trick here is to somehow enforce the interface returned by > ip_dev_find and not the one resolved by the routing table. At least as I > understand the addr.c code, it takes the interface later from neigh->dev > , correct? That does seem to be what it is doing, but I can't see how that is correct? The output interface is selected by the routing table, except in very special cases (ie SO_BINDTODEVICE). Why doesn't the original code work? It passes src_ip into the route lookup which should be good enough.. Does 'ip route get from ' return the right thing? Jason From weiny2 at llnl.gov Thu Feb 5 10:03:31 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 5 Feb 2009 10:03:31 -0800 Subject: [ofa-general] [PATCH] libibmad: Use enum types for function parameters (WAS) Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090204103054.177aa6e2.weiny2@llnl.gov> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> <20090204182023.GP7618@obsidianresearch.com> <20090204182725.GX11874@sashak.voltaire.com> <20090204103054.177aa6e2.weiny2@llnl.gov> Message-ID: <20090205100331.5ab5de76.weiny2@llnl.gov> Sasha, On Wed, 4 Feb 2009 10:30:54 -0800 Ira Weiny wrote: > On Wed, 4 Feb 2009 20:27:25 +0200 > Sasha Khapyorsky wrote: > > > On 11:20 Wed 04 Feb , Jason Gunthorpe wrote: > > > On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote: > > > > > > > I don't understand how enum typedefing makes things cleaner - actually > > > > this will enforce me explicitly to verify an actual type in header > > > > files. Sometimes typedefs could help with porting, but it is not the > > > > case here. > > > > > > Not typedefing per say, but passing an enum through an int is not that > > > great. You don't need the typedefs to do this, just 'enum MAD_FIELDS' > > > for instance will do. > > > > Yes, that would be fine to do. > > I will redo the patch with 'enum MAD_FIELDS'. > Patch below, Ira >From 3a52d32d7c6964a8078402c3712a58d1e43975de Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Mon, 2 Feb 2009 10:21:18 -0800 Subject: [PATCH] Use enum types for function parameters Signed-off-by: weiny2 at llnl.gov --- libibmad/include/infiniband/mad.h | 30 +++++++++++++++--------------- libibmad/src/fields.c | 22 +++++++++++----------- libibmad/src/resolve.c | 6 +++--- 3 files changed, 29 insertions(+), 29 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 9ff4a3e..33a233c 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -595,14 +595,14 @@ typedef struct ib_vendor_call { #define MAD_DEF_RETRIES 3 #define MAD_DEF_TIMEOUT_MS 1000 -enum { +enum MAD_DEST { IB_DEST_LID, IB_DEST_DRPATH, IB_DEST_GUID, IB_DEST_DRSLID, }; -enum { +enum MAD_NODE_TYPE { IB_NODE_CA = 1, IB_NODE_SWITCH, IB_NODE_ROUTER, @@ -631,20 +631,20 @@ static inline int ib_portid_set(ib_portid_t * portid, int lid, int qp, int qkey) } /* fields.c */ -MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); -MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field); +MAD_EXPORT void mad_set_field(void *buf, int base_offs, enum MAD_FIELDS field, uint32_t val); /* field must be byte aligned */ -MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); -MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, enum MAD_FIELDS field); +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, enum MAD_FIELDS field, uint64_t val); -MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); -MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); -MAD_EXPORT void mad_decode_field(uint8_t * buf, int field, void *val); -MAD_EXPORT void mad_encode_field(uint8_t * buf, int field, void *val); -MAD_EXPORT int mad_print_field(int field, const char *name, void *val); -MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); -MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); +MAD_EXPORT void mad_set_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val); +MAD_EXPORT void mad_get_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val); +MAD_EXPORT void mad_decode_field(uint8_t * buf, enum MAD_FIELDS field, void *val); +MAD_EXPORT void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val); +MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val); +MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val); +MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val); /* mad.c */ MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, @@ -729,7 +729,7 @@ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout); MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id); + enum MAD_DEST dest, ib_portid_t * sm_id); MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid); @@ -737,7 +737,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout, const void *srcport); int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id, + enum MAD_DEST dest, ib_portid_t * sm_id, const void *srcport); int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, const void *srcport); diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index d5a1eb4..588c57f 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -479,37 +479,37 @@ static void _get_array(void *buf, int base_offs, const ib_field_t * f, memcpy(val, (uint8_t *) buf + base_offs + bitoffs / 8, f->bitlen / 8); } -uint32_t mad_get_field(void *buf, int base_offs, int field) +uint32_t mad_get_field(void *buf, int base_offs, enum MAD_FIELDS field) { return _get_field(buf, base_offs, ib_mad_f + field); } -void mad_set_field(void *buf, int base_offs, int field, uint32_t val) +void mad_set_field(void *buf, int base_offs, enum MAD_FIELDS field, uint32_t val) { _set_field(buf, base_offs, ib_mad_f + field, val); } -uint64_t mad_get_field64(void *buf, int base_offs, int field) +uint64_t mad_get_field64(void *buf, int base_offs, enum MAD_FIELDS field) { return _get_field64(buf, base_offs, ib_mad_f + field); } -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val) +void mad_set_field64(void *buf, int base_offs, enum MAD_FIELDS field, uint64_t val) { _set_field64(buf, base_offs, ib_mad_f + field, val); } -void mad_set_array(void *buf, int base_offs, int field, void *val) +void mad_set_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val) { _set_array(buf, base_offs, ib_mad_f + field, val); } -void mad_get_array(void *buf, int base_offs, int field, void *val) +void mad_get_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val) { _get_array(buf, base_offs, ib_mad_f + field, val); } -void mad_decode_field(uint8_t * buf, int field, void *val) +void mad_decode_field(uint8_t * buf, enum MAD_FIELDS field, void *val) { const ib_field_t *f = ib_mad_f + field; @@ -528,7 +528,7 @@ void mad_decode_field(uint8_t * buf, int field, void *val) _get_array(buf, 0, f, val); } -void mad_encode_field(uint8_t * buf, int field, void *val) +void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val) { const ib_field_t *f = ib_mad_f + field; @@ -602,21 +602,21 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val, valsz ? valsz : ALIGN(f->bitlen, 8) / 8); } -int mad_print_field(int field, const char *name, void *val) +int mad_print_field(enum MAD_FIELDS field, const char *name, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return -1; return _mad_print_field(ib_mad_f + field, name, val, 0); } -char *mad_dump_field(int field, char *buf, int bufsz, void *val) +char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); } -char *mad_dump_val(int field, char *buf, int bufsz, void *val) +char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index b62360b..553949d 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -92,7 +92,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, } int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, - int dest_type, ib_portid_t * sm_id, + enum MAD_DEST dest_type, ib_portid_t * sm_id, const void *srcport) { uint64_t guid; @@ -142,8 +142,8 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, return -1; } -int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, int dest_type, - ib_portid_t * sm_id) +int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, + enum MAD_DEST dest_type, ib_portid_t * sm_id) { return ib_resolve_portid_str_via(portid, addr_str, dest_type, sm_id, NULL); -- 1.5.4.5 From sashak at voltaire.com Thu Feb 5 10:21:28 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 20:21:28 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/common: use enum MAD_DEST as ibd_dest_type type In-Reply-To: <20090205100331.5ab5de76.weiny2@llnl.gov> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> <20090204182023.GP7618@obsidianresearch.com> <20090204182725.GX11874@sashak.voltaire.com> <20090204103054.177aa6e2.weiny2@llnl.gov> <20090205100331.5ab5de76.weiny2@llnl.gov> Message-ID: <20090205182128.GK5910@sashak.voltaire.com> Use introduced 'enum MAD_DEST' as type of ibd_dest_type variable. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/include/ibdiag_common.h | 2 +- infiniband-diags/src/ibdiag_common.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index b92aa4d..4783b8e 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -41,7 +41,7 @@ extern int ibdebug; extern int ibverbose; extern char *ibd_ca; extern int ibd_ca_port; -extern int ibd_dest_type; +extern enum MAD_DEST ibd_dest_type; extern ib_portid_t *ibd_sm_id; extern int ibd_timeout; diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index 7d6e772..bda1efa 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -57,7 +57,7 @@ int ibdebug; int ibverbose; char *ibd_ca; int ibd_ca_port; -int ibd_dest_type = IB_DEST_LID; +enum MAD_DEST ibd_dest_type = IB_DEST_LID; ib_portid_t *ibd_sm_id; int ibd_timeout; -- 1.6.1.rc1.45.g123ed From sashak at voltaire.com Thu Feb 5 10:21:49 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 5 Feb 2009 20:21:49 +0200 Subject: [ofa-general] Re: [PATCH] libibmad: Use enum types for function parameters (WAS) Declare some enums as typedefs for cleaner function interfaces In-Reply-To: <20090205100331.5ab5de76.weiny2@llnl.gov> References: <20090202185425.729a80b3.weiny2@llnl.gov> <20090204181421.GV11874@sashak.voltaire.com> <20090204182023.GP7618@obsidianresearch.com> <20090204182725.GX11874@sashak.voltaire.com> <20090204103054.177aa6e2.weiny2@llnl.gov> <20090205100331.5ab5de76.weiny2@llnl.gov> Message-ID: <20090205182149.GL5910@sashak.voltaire.com> On 10:03 Thu 05 Feb , Ira Weiny wrote: > Sasha, > > On Wed, 4 Feb 2009 10:30:54 -0800 > Ira Weiny wrote: > > > On Wed, 4 Feb 2009 20:27:25 +0200 > > Sasha Khapyorsky wrote: > > > > > On 11:20 Wed 04 Feb , Jason Gunthorpe wrote: > > > > On Wed, Feb 04, 2009 at 08:14:21PM +0200, Sasha Khapyorsky wrote: > > > > > > > > > I don't understand how enum typedefing makes things cleaner - actually > > > > > this will enforce me explicitly to verify an actual type in header > > > > > files. Sometimes typedefs could help with porting, but it is not the > > > > > case here. > > > > > > > > Not typedefing per say, but passing an enum through an int is not that > > > > great. You don't need the typedefs to do this, just 'enum MAD_FIELDS' > > > > for instance will do. > > > > > > Yes, that would be fine to do. > > > > I will redo the patch with 'enum MAD_FIELDS'. > > > > Patch below, > Ira > > From 3a52d32d7c6964a8078402c3712a58d1e43975de Mon Sep 17 00:00:00 2001 > From: weiny2 at llnl.gov > Date: Mon, 2 Feb 2009 10:21:18 -0800 > Subject: [PATCH] Use enum types for function parameters > > > Signed-off-by: weiny2 at llnl.gov Applied. Thanks. Sasha From sean.hefty at intel.com Thu Feb 5 11:17:32 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 11:17:32 -0800 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <39C75744D164D948A170E9792AF8E7CA01F19812@exil.voltaire.com> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <15ddcffd0902041352u5a7acaedl8b9485769cc90e7@mail.gmail.com> <39C75744D164D948A170E9792AF8E7CA01F19812@exil.voltaire.com> Message-ID: <6C22667CA9024C9780E41AA468FC9153@amr.corp.intel.com> >Assuming this is an rdma-cm capable device in a 'bad' state, the user >space application can wait for asyn ibv events (PORT_ACTIVE) from the >device. Once the device is active again it can retry the rdma_create_qp >or rdma_join_mc. Will this work? Even once the port goes active, what the application is really waiting for is for IPoIB to come back up and rejoin its 'broadcast' multicast group. I guess you could just continue to retry the operation until it succeeds... - Sean From sean.hefty at intel.com Thu Feb 5 11:17:54 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 11:17:54 -0800 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <4989E6D6.5030109@Voltaire.COM> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <4989E6D6.5030109@Voltaire.COM> Message-ID: <3522BA7F49834878A674F2908834D747@amr.corp.intel.com> >@@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i > event.status = ib_event->param.sidr_rep_rcvd.status; > break; > } >+ ret = cma_set_qkey(id_priv); >+ if (ret) { >+ event.event = RDMA_CM_EVENT_ADDR_ERROR; >+ event.status = -EINVAL; >+ break; >+ } > if (id_priv->qkey != rep->qkey) { > event.event = RDMA_CM_EVENT_UNREACHABLE; > event.status = -EINVAL; >@@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma > const void *private_data, int private_data_len) > { > struct ib_cm_sidr_rep_param rep; >+ int ret; > > memset(&rep, 0, sizeof rep); > rep.status = status; > if (status == IB_SIDR_SUCCESS) { >+ ret = cma_set_qkey(id_priv); >+ if (ret) >+ return ret; > rep.qp_num = id_priv->qp_num; > rep.qkey = id_priv->qkey; > } Looking at this, I keep wanting to set the qkey when sending or receiving the sidr req, not rep. This is earlier than the qkey is needed, but catching the error sooner in this case seems better to me than deferring. Thoughts? - Sean From yosefe at Voltaire.COM Thu Feb 5 11:26:54 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Thu, 05 Feb 2009 21:26:54 +0200 Subject: [ofa-general] Re: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <3522BA7F49834878A674F2908834D747@amr.corp.intel.com> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <4989E6D6.5030109@Voltaire.COM> <3522BA7F49834878A674F2908834D747@amr.corp.intel.com> Message-ID: <498B3D7E.6010300@Voltaire.COM> Sean Hefty wrote: >> @@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i >> event.status = ib_event->param.sidr_rep_rcvd.status; >> break; >> } >> + ret = cma_set_qkey(id_priv); >> + if (ret) { >> + event.event = RDMA_CM_EVENT_ADDR_ERROR; >> + event.status = -EINVAL; >> + break; >> + } >> if (id_priv->qkey != rep->qkey) { >> event.event = RDMA_CM_EVENT_UNREACHABLE; >> event.status = -EINVAL; >> @@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma >> const void *private_data, int private_data_len) >> { >> struct ib_cm_sidr_rep_param rep; >> + int ret; >> >> memset(&rep, 0, sizeof rep); >> rep.status = status; >> if (status == IB_SIDR_SUCCESS) { >> + ret = cma_set_qkey(id_priv); >> + if (ret) >> + return ret; >> rep.qp_num = id_priv->qp_num; >> rep.qkey = id_priv->qkey; >> } > > Looking at this, I keep wanting to set the qkey when sending or receiving the > sidr req, not rep. This is earlier than the qkey is needed, but catching the > error sooner in this case seems better to me than deferring. Thoughts? > > - Sean > It might be better to catch errors earlier, but there is the risk that the flow might change somehow, and losing the (now obvious) logical connection between retrieving the qkey and actually using it. -- --Yossi From sean.hefty at intel.com Thu Feb 5 11:31:09 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 11:31:09 -0800 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <498B3D7E.6010300@Voltaire.COM> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <4989E6D6.5030109@Voltaire.COM> <3522BA7F49834878A674F2908834D747@amr.corp.intel.com> <498B3D7E.6010300@Voltaire.COM> Message-ID: >It might be better to catch errors earlier, but there is the risk that the >flow might change somehow, and losing the (now obvious) logical connection >between retrieving the qkey and actually using it. I can go with that. I don't have a strong preference. Have you tested the patch and verified that it works for you? - Sean From yosefe at Voltaire.COM Thu Feb 5 11:41:42 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Thu, 05 Feb 2009 21:41:42 +0200 Subject: [ofa-general] Re: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <4989E6D6.5030109@Voltaire.COM> <3522BA7F49834878A674F2908834D747@amr.corp.intel.com> <498B3D7E.6010300@Voltaire.COM> Message-ID: <498B40F6.7060904@Voltaire.COM> Sean Hefty wrote: >> It might be better to catch errors earlier, but there is the risk that the >> flow might change somehow, and losing the (now obvious) logical connection >> between retrieving the qkey and actually using it. > > I can go with that. I don't have a strong preference. Have you tested the > patch and verified that it works for you? > > - Sean > Yes I did, with mckey. When the HCA port is down: Without the patch, mckey fails on from rdma_resolve_route (except when ipoib is trying to join at the same time - then there will be a join error). With the patch, mckey fails on rdma_create_qp (again, except when ipoib is trying to join at the same time). When the HCA port is up, mckey works normally. -- --Yossi From sean.hefty at intel.com Thu Feb 5 11:49:24 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 11:49:24 -0800 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <4989E6D6.5030109@Voltaire.COM> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <4989E6D6.5030109@Voltaire.COM> Message-ID: From: Yossi Etigin When doing rdma_resolve_addr() and relevant port is down, the function fails and rdma_cm id is not bound to the device. Therefore, application does not have device handle and cannot wait for the port to become active. The function fails because ipoib is not joined to the multicast group and therefore sa does not have a multicast record to take a qkey from. The proposed patch is to make lazy qkey resolution - cma_set_qkey will set id_priv->qkey if it was not set, and will be called just before the qkey is really required. Signed-off-by: Yossi Etigin Acked-by: Sean Hefty --- Roland, any objection to queuing this for 2.6.30? > drivers/infiniband/core/cma.c | 41 +++++++++++++++++++++++++++-------------- > 1 file changed, 27 insertions(+), 14 deletions(-) > >Index: b/drivers/infiniband/core/cma.c >=================================================================== >--- a/drivers/infiniband/core/cma.c 2009-02-04 20:40:20.000000000 +0200 >+++ b/drivers/infiniband/core/cma.c 2009-02-04 20:57:59.000000000 +0200 >@@ -296,21 +296,25 @@ static void cma_detach_from_dev(struct r > id_priv->cma_dev = NULL; > } > >-static int cma_set_qkey(struct ib_device *device, u8 port_num, >- enum rdma_port_space ps, >- struct rdma_dev_addr *dev_addr, u32 *qkey) >+static int cma_set_qkey(struct rdma_id_private *id_priv) > { > struct ib_sa_mcmember_rec rec; > int ret = 0; > >- switch (ps) { >+ if (id_priv->qkey) >+ return; >+ >+ switch (id_priv->id.ps) { > case RDMA_PS_UDP: >- *qkey = RDMA_UDP_QKEY; >+ id_priv->qkey = RDMA_UDP_QKEY; > break; > case RDMA_PS_IPOIB: >- ib_addr_get_mgid(dev_addr, &rec.mgid); >- ret = ib_sa_get_mcmember_rec(device, port_num, &rec.mgid, &rec); >- *qkey = be32_to_cpu(rec.qkey); >+ ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid); >+ ret = ib_sa_get_mcmember_rec(id_priv->id.device, >+ id_priv->id.port_num, &rec.mgid, >+ &rec); >+ if (!ret) >+ id_priv->qkey = be32_to_cpu(rec.qkey); > break; > default: > break; >@@ -340,12 +344,7 @@ static int cma_acquire_dev(struct rdma_i > ret = ib_find_cached_gid(cma_dev->device, &gid, > &id_priv->id.port_num, NULL); > if (!ret) { >- ret = cma_set_qkey(cma_dev->device, >- id_priv->id.port_num, >- id_priv->id.ps, dev_addr, >- &id_priv->qkey); >- if (!ret) >- cma_attach_to_dev(id_priv, cma_dev); >+ cma_attach_to_dev(id_priv, cma_dev); > break; > } > } >@@ -577,6 +576,10 @@ static int cma_ib_init_qp_attr(struct rd > *qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT; > > if (cma_is_ud_ps(id_priv->id.ps)) { >+ ret = cma_set_qkey(id_priv); >+ if (ret) >+ return ret; >+ > qp_attr->qkey = id_priv->qkey; > *qp_attr_mask |= IB_QP_QKEY; > } else { >@@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i > event.status = ib_event->param.sidr_rep_rcvd.status; > break; > } >+ ret = cma_set_qkey(id_priv); >+ if (ret) { >+ event.event = RDMA_CM_EVENT_ADDR_ERROR; >+ event.status = -EINVAL; >+ break; >+ } > if (id_priv->qkey != rep->qkey) { > event.event = RDMA_CM_EVENT_UNREACHABLE; > event.status = -EINVAL; >@@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma > const void *private_data, int private_data_len) > { > struct ib_cm_sidr_rep_param rep; >+ int ret; > > memset(&rep, 0, sizeof rep); > rep.status = status; > if (status == IB_SIDR_SUCCESS) { >+ ret = cma_set_qkey(id_priv); >+ if (ret) >+ return ret; > rep.qp_num = id_priv->qp_num; > rep.qkey = id_priv->qkey; > } From brian at sun.com Thu Feb 5 11:54:02 2009 From: brian at sun.com (Brian J. Murrell) Date: Thu, 05 Feb 2009 14:54:02 -0500 Subject: [ofa-general] 1.3.1 and 1.4 compatibilty Message-ID: <1233863642.22864.3203.camel@pc.interlinx.bc.ca> I'm sure I know the answer to this, or will be floored if it's other than I think, but just to do due diligence... are OFED 1.3.1 and 1.4 compatible? That is, nodes running one version will talk to nodes of the other version without problem, yes? Is it complete compatibility or are there any known caveats? Thanx! b. From swise at opengridcomputing.com Thu Feb 5 13:05:49 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 05 Feb 2009 15:05:49 -0600 Subject: [ofa-general] Chelsio T3: Aggregate Throughput In-Reply-To: References: Message-ID: <498B54AD.1010802@opengridcomputing.com> Philip Frey1 wrote: > > Hello, > > we am currently looking into the scalability of the T3 in terms of > connections. We are using a 1-to-n scenario where the one server > has a chunk of data and n client that fetch this chunk over and over > again using RDMA reads (each 1MB in size). > > The clients do that such that they get an average data rate of about > 9Mbps each. Every second we connect a new client to the server > and see how far it goes. > > What puzzles us now is that after about 800 clients, they do no longer > seem to receive much data. > > The first interesting thing is that the aggregate throughput actually > drops > (we expected it to stall). And the second interesting thing is that it > does > so already at about 6.3Gbps which is just a bit more than half of what > the > card can do. We do not experience this kind of situation when using > much less clients that RDMA read the data at a much higher data rate. > > Is there any limitation on the RNIC that would give an explanation for > this? > Are the RNICs experiencing lots of pause frames during the test? ethtool -S ethX|grep Pause Also, are the iWARP stacks retransmitting a lot during the test? cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs Steve. From andy.grover at oracle.com Thu Feb 5 14:08:06 2009 From: andy.grover at oracle.com (Andy Grover) Date: Thu, 05 Feb 2009 14:08:06 -0800 Subject: [ofa-general] IB credit-based flow control Message-ID: <498B6346.7000208@oracle.com> Hi, Steve and I have been working to debug RDS's credit-based flow control, and I happened to notice that IB already implements this (see ib spec section 9.7.7.2). So, why is it necessary for a ULP like RDS to implement its own flow control? It looks like IB's flow control should result in no RNR retries, yet without protocol-level FC, we see RNR retries. Thanks -- Regards -- Andy From sean.hefty at intel.com Thu Feb 5 14:23:10 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 14:23:10 -0800 Subject: [ofa-general] IB credit-based flow control In-Reply-To: <498B6346.7000208@oracle.com> References: <498B6346.7000208@oracle.com> Message-ID: <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> >So, why is it necessary for a ULP like RDS to implement its own flow >control? It looks like IB's flow control should result in no RNR >retries, yet without protocol-level FC, we see RNR retries. If you're using a shared receive queue, end to end flow control is disabled. Also, see 9.7.7.2.5 C9-162 - an HCA is allowed to send up to one packet for a send request even if it doesn't have any credits available. - Sean From hal.rosenstock at gmail.com Thu Feb 5 14:55:22 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 5 Feb 2009 17:55:22 -0500 Subject: [ofa-general] ***SPAM*** [RFC] infiniband-diags/perfquery.c: Any objections to changing an option name ? Message-ID: In infiniband-diags/perfquery, -e is used for extended counters and covers up using the common errors option so I'd like to change this to be -x for xtended. Any objections ? Without this change when perfquery fails you can't get the more detailed error information which is very useful for debugging problems. -- Hal From halr at obsidianresearch.com Thu Feb 5 15:31:22 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:31:22 -0700 Subject: [ofa-general] [PATCH] ibsim: Eliminate unused modified variable Message-ID: <1233876682.8992.492.camel@bertha1.edm.orcorp.ca> Sasha, Trivial patch to eliminate the unused 'modified' variable. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-ibsim-Eliminate-unused-modified-variable.patch Type: application/mbox Size: 1398 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 15:31:31 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:31:31 -0700 Subject: [ofa-general] [PATCH] ibsim: Change lid print format to unsigned Message-ID: <1233876691.8992.494.camel@bertha1.edm.orcorp.ca> Sasha, Patch to change lid print format to unsigned to be consistent elsewhere. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-ibsim-Change-lid-prints-to-unsigned.patch Type: application/mbox Size: 6637 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 15:41:39 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:41:39 -0700 Subject: [ofa-general] [PATCH] opensm/doc/perf-manager-arch.txt: Fix some commentary typos Message-ID: <1233877299.8992.508.camel@bertha1.edm.orcorp.ca> Sasha, Trivial patch to fix some typos in this doc. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-opensm-doc-perf-manager-arch.txt-Fix-some-commentar.patch Type: application/mbox Size: 1761 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 15:42:23 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:42:23 -0700 Subject: [ofa-general] [PATCH] opensm/PerfMgr: Add copyrights Message-ID: <1233877343.8992.510.camel@bertha1.edm.orcorp.ca> Sasha, This just adds copyrights missed in previous patches. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-opensm-PerfMgr-Add-copyright.patch Type: application/mbox Size: 2700 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 15:42:59 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:42:59 -0700 Subject: [ofa-general] [PATCH] libibmad: lid print format changed to unsigned Message-ID: <1233877379.8992.511.camel@bertha1.edm.orcorp.ca> Sasha, This changes libibmad lid print format to unsigned to be consistent with OpenSM and diag tools. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-libibmad-lid-printing-changed-to-unsigned-as-was-d.patch Type: application/mbox Size: 1698 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 15:43:34 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:43:34 -0700 Subject: [ofa-general] libibumad/umad.c: Change lid print format to unsigned Message-ID: <1233877414.8992.512.camel@bertha1.edm.orcorp.ca> Sasha, This patch changes umad.c lid print format to unsigned. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0007-libibumad-umad.c-Change-lid-prints-to-unsigned.patch Type: application/mbox Size: 1563 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 15:47:33 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:47:33 -0700 Subject: [ofa-general] [PATCH] libibmad/rpc.c: In mad_rpc/mad_rpc_rmpp, set rpc attribute ID from response Message-ID: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca> Sasha, This patch sets the attribute ID based on what is in the response. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0009-libibmad-rpc.c-In-mad_rpc-and-mad_rpc_rmpp-set-rpc.patch Type: application/mbox Size: 1458 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 15:48:08 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 16:48:08 -0700 Subject: [ofa-general] [PATCH] libibmad/gs.c: Factor out common code Message-ID: <1233877688.8992.518.camel@bertha1.edm.orcorp.ca> Sasha, This patch factors out some common code in gs.c. common_query_setup is used by both pma_query_via and performance_reset_via. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0010-libibmad-gs.c-Factor-out-common-code.patch Type: application/mbox Size: 3036 bytes Desc: not available URL: From halr at obsidianresearch.com Thu Feb 5 16:00:02 2009 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 05 Feb 2009 17:00:02 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/perfquery: Change option name for extended counters Message-ID: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca> Sasha, Per the RFC, this patch changes the option name for extended counters to to not cover up common errors option. This changes it from -e/--extended to -x/--xtended so -e/--errors can be used to get error information as is common with the IB diags. -- Hal -------------- next part -------------- A non-text attachment was scrubbed... Name: 0012-infiniband-diags-perfquery-Change-option-name-for-e.patch Type: application/mbox Size: 4217 bytes Desc: not available URL: From andy.grover at oracle.com Thu Feb 5 16:26:14 2009 From: andy.grover at oracle.com (Andy Grover) Date: Thu, 05 Feb 2009 16:26:14 -0800 Subject: [ofa-general] IB credit-based flow control In-Reply-To: <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> References: <498B6346.7000208@oracle.com> <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> Message-ID: <498B83A6.9030702@oracle.com> Sean Hefty wrote: >> So, why is it necessary for a ULP like RDS to implement its own flow >> control? It looks like IB's flow control should result in no RNR >> retries, yet without protocol-level FC, we see RNR retries. > > If you're using a shared receive queue, end to end flow control is disabled. > Also, see 9.7.7.2.5 C9-162 - an HCA is allowed to send up to one packet for a > send request even if it doesn't have any credits available. Good point, but just looking at the non-SRQ case: I'm reading C9-162 and still not seeing why (according to the spec anyways) there should ever be RNR retries on a connection. I would think the receiving HCA would not credit its last WQE to the sender, and thus retries should never happen? The whole point of this feature is to eliminate RNR retries, no? Thanks -- Regards -- Andy From sean.hefty at intel.com Thu Feb 5 16:57:27 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Feb 2009 16:57:27 -0800 Subject: [ofa-general] IB credit-based flow control In-Reply-To: <498B83A6.9030702@oracle.com> References: <498B6346.7000208@oracle.com> <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> <498B83A6.9030702@oracle.com> Message-ID: <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com> >I'm reading C9-162 and still not seeing why (according to the spec >anyways) there should ever be RNR retries on a connection. I would think >the receiving HCA would not credit its last WQE to the sender, and thus >retries should never happen? > >The whole point of this feature is to eliminate RNR retries, no? What I'm looking at for C9-162 is: C9-162: When the requester encounters a WQE on its send queue for which it has no available credits, that WQE is said to be limited. If the limited request WQE is a SEND request, the send queue shall transmit no more than a single packet of the request message before it must stop transmission and wait for an acknowledge packet. My assumption is that if no credits are available when the SEND request arrives, then the receiver generates a RNR message, but I didn't read through the entire section to verify this. This is totally a guess, but there needs to be some sort of recovery mechanism in place to handle a lost credit update message. Allowing the requester to issue a limited request in the absence of credits will force a credit update if any are available. Did you verify that the HCAs you're using implement e2e flow control? - Sean From andy.grover at oracle.com Thu Feb 5 18:28:25 2009 From: andy.grover at oracle.com (Andy Grover) Date: Thu, 05 Feb 2009 18:28:25 -0800 Subject: [ofa-general] IB credit-based flow control In-Reply-To: <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com> References: <498B6346.7000208@oracle.com> <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> <498B83A6.9030702@oracle.com> <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com> Message-ID: <498BA049.6090006@oracle.com> Sean Hefty wrote: > My assumption is that if no credits are available when the SEND request arrives, > then the receiver generates a RNR message, but I didn't read through the entire > section to verify this. > > This is totally a guess, but there needs to be some sort of recovery mechanism > in place to handle a lost credit update message. Allowing the requester to > issue a limited request in the absence of credits will force a credit update if > any are available. > > Did you verify that the HCAs you're using implement e2e flow control? How would I verify that? I'm using current HCAs (mlx4), so I'm assuming if the spec says an HCA must support something, is is supported? We definitely still need ulp-level flow control for iwarp so it's not wasted work. But if IB doesn't, then it would be great to not incur the overhead. Thanks -- Regards -- Andy From swise at opengridcomputing.com Thu Feb 5 19:36:53 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 05 Feb 2009 21:36:53 -0600 Subject: [ofa-general] IB credit-based flow control In-Reply-To: <498BA049.6090006@oracle.com> References: <498B6346.7000208@oracle.com> <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> <498B83A6.9030702@oracle.com> <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com> <498BA049.6090006@oracle.com> Message-ID: <498BB055.8040608@opengridcomputing.com> Andy Grover wrote: > Sean Hefty wrote: >> My assumption is that if no credits are available when the SEND >> request arrives, >> then the receiver generates a RNR message, but I didn't read through >> the entire >> section to verify this. >> >> This is totally a guess, but there needs to be some sort of recovery >> mechanism >> in place to handle a lost credit update message. Allowing the >> requester to >> issue a limited request in the absence of credits will force a credit >> update if >> any are available. >> >> Did you verify that the HCAs you're using implement e2e flow control? > > How would I verify that? I'm using current HCAs (mlx4), so I'm > assuming if the spec says an HCA must support something, is is supported? > > We definitely still need ulp-level flow control for iwarp so it's not > wasted work. But if IB doesn't, then it would be great to not incur > the overhead. > From what I've seen in the various IB ULPs, the only way to remove RNRs is to do correct ULP flow control. But I never know about this IB transport level credit stuff until you brought it up! :) Steve From andy.grover at oracle.com Thu Feb 5 19:40:29 2009 From: andy.grover at oracle.com (Andy Grover) Date: Thu, 05 Feb 2009 19:40:29 -0800 Subject: [ofa-general] IB credit-based flow control In-Reply-To: <498BA049.6090006@oracle.com> References: <498B6346.7000208@oracle.com> <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> <498B83A6.9030702@oracle.com> <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com> <498BA049.6090006@oracle.com> Message-ID: <498BB12D.5080107@oracle.com> Andy Grover wrote: > How would I verify that? I'm using current HCAs (mlx4), so I'm assuming > if the spec says an HCA must support something, is is supported? > > We definitely still need ulp-level flow control for iwarp so it's not > wasted work. But if IB doesn't, then it would be great to not incur the > overhead. Mystery solved, RDS has ulp-level flow control specifically to support iwarp, so this is not needed on IB connections, due to the HW FC we've been discussing. Thanks -- Regards -- Andy From rdreier at cisco.com Thu Feb 5 20:39:55 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Feb 2009 20:39:55 -0800 Subject: [ofa-general] IB credit-based flow control In-Reply-To: <498BA049.6090006@oracle.com> (Andy Grover's message of "Thu, 05 Feb 2009 18:28:25 -0800") References: <498B6346.7000208@oracle.com> <6964DFC8601A4569A59FB03747E52FF9@amr.corp.intel.com> <498B83A6.9030702@oracle.com> <031DEB206CEA4802860C38861660EC87@amr.corp.intel.com> <498BA049.6090006@oracle.com> Message-ID: > How would I verify that? I'm using current HCAs (mlx4), so I'm > assuming if the spec says an HCA must support something, is is > supported? > > We definitely still need ulp-level flow control for iwarp so it's not > wasted work. But if IB doesn't, then it would be great to not incur > the overhead. mlx4 HCAs do support end-to-end credits. However, as you've discovered, that transport level flow control is not necessarily that useful: if a sender overruns the receives that are posted, then it triggers an RNR NAK which leads to a large delay in the connection, which can be very bad for throughput. So for best performance, application level flow control is required, even with IB end-to-end credit flow control at the transport level. - R. From PHF at zurich.ibm.com Fri Feb 6 01:59:27 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Fri, 6 Feb 2009 10:59:27 +0100 Subject: [ofa-general] Chelsio T3: Aggregate Throughput In-Reply-To: <498B54AD.1010802@opengridcomputing.com> References: <498B54AD.1010802@opengridcomputing.com> Message-ID: > Are the RNICs experiencing lots of pause frames during the test? > > ethtool -S ethX|grep Pause (cheiron was the server and the others were RDMA reading from it) [root at ajax]$ ethtool -S eth2 | grep Pause TxPauseFrames : 248428611 RxPauseFrames : 0 [root at achilles]$ ethtool -S eth2 | grep Pause TxPauseFrames : 250937599 RxPauseFrames : 0 [root at bacchus]$ ethtool -S eth2 | grep Pause TxPauseFrames : 21153321 RxPauseFrames : 70 [root at borus]$ ethtool -S eth2 | grep Pause TxPauseFrames : 22056840 RxPauseFrames : 70 [root at car]$ ethtool -S eth2 | grep Pause TxPauseFrames : 23540619 RxPauseFrames : 70 [root at cheiron]$ ethtool -S eth2 | grep Pause TxPauseFrames : 0 RxPauseFrames : 26569935 > Also, are the iWARP stacks retransmitting a lot during the test? > > cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs [root at ajax]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs 0 [root at achilles]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs 0 [root at bacchus]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs 0 [root at borus]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs 0 [root at car]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs 0 [root at cheiron]$ cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs 0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Line.Holen at Sun.COM Fri Feb 6 02:14:13 2009 From: Line.Holen at Sun.COM (Line.Holen at Sun.COM) Date: Fri, 06 Feb 2009 11:14:13 +0100 Subject: [ofa-general] 1.4 git repository for the management SW Message-ID: <498C0D75.6090904@Sun.COM> Hi, I would like to get hold of the source for the 1.4 release of the management SW. I've tried to clone ofed_1_4/management.git, but that seems to be about 2 weeks newer than the release. Where / how can I find the correct version ? I was expecting to find OpenSM version 3.2.5 in the above source repository, but it shows up as 3.3.0. Line From vlad at lists.openfabrics.org Fri Feb 6 03:11:51 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 6 Feb 2009 03:11:51 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090206-0200 daily build status Message-ID: <20090206111151.B1134E610D5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Fri Feb 6 03:53:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 6 Feb 2009 13:53:11 +0200 Subject: [ofa-general] 1.4 git repository for the management SW In-Reply-To: <498C0D75.6090904@Sun.COM> References: <498C0D75.6090904@Sun.COM> Message-ID: <20090206115311.GA17713@sashak.voltaire.com> On 11:14 Fri 06 Feb , Line.Holen at Sun.COM wrote: > > I would like to get hold of the source for the 1.4 release of the > management SW. > I've tried to clone ofed_1_4/management.git, but that seems to be about 2 > weeks > newer than the release. > Where / how can I find the correct version ? Get opensm-3.2 branch of git://git.openfabrics.org/~sashak/management tree (or opensm-3.2.5 tag). Sasha From mossy.boulders at gmail.com Fri Feb 6 04:34:10 2009 From: mossy.boulders at gmail.com (Markus Uhlmann) Date: Fri, 6 Feb 2009 13:34:10 +0100 Subject: [ofa-general] ***SPAM*** debian/ofed-1.4 - mpi global communication performance Message-ID: <962e48ae0902060434n19759d92x673e9289d9915059@mail.gmail.com> Hi all, we have been struggling with the performance of a supermicro (quad-core xeon) / qlogic (9024-FC) system running Debian, kernel 2.6.24-x86_64, and ofed-1.4 (from http://www.openfabrics.org/). There are 8 nodes attached to the switch. What happens is that the performance of MPI global communication is extremely low (i.e. ~ factor 10 when 16 procs out of only 2 nodes communicate). This number comes from comparison with a *similar* system (dell/cisco). Some test which we have performed: * local memory bandwidth test ("stream" benchmark on 8-way node returns >8GB/s) * firmware: since the hca's are on-board supermicro (board_id: SM_2001000001; firmware-version: 1.2.0) I don't know how/where to check adequacy. * openib low-level communication tests seem okay (see output from ib_write_lat, ib_write_bw below) * However, I see errors of type "RcvSwRelayErrors" when checking "ibcheckerrors". Is this normal? * Mpi benchmarks reveal slow all-to-all communication (see output below for "osu_alltoall" test https://mvapich.cse.ohio-state.edu/svn/mpi-benchmarks/branches/OMB-3.1/osu_alltoall.c , compiled with openmpi-1.3 and intel compiler 11.0) Some questions I have: 1) Do I have to configure the switch? So far I have not attempted to install the "ofed+" etc. software which came with the qlogic hardware. Is there any chance that it would be compatible with ofed-1.4? Or even installable under Debian (without too much tweaking)? 2) Is it okay for this system to run "opensm" on one of the nodes? NOTE: the version is "OpenSM 3.2.5_20081207" Any other lead or things I should test? Thanks in advance, MU ============================================================== ------------------------------------------------------------------ RDMA_Write Latency Test Inline data is used up to 400 bytes message Connection type : RC Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 3.10 22.88 3.15 4 1000 3.13 6.29 3.16 8 1000 3.14 6.24 3.18 16 1000 3.17 6.25 3.21 32 1000 3.25 7.60 3.38 64 1000 3.32 6.43 3.45 128 1000 3.48 6.40 3.57 256 1000 3.77 6.63 3.82 512 1000 4.71 8.44 4.76 1024 1000 5.58 7.53 5.63 2048 1000 7.38 8.17 7.51 4096 1000 8.64 9.04 8.77 8192 1000 11.41 11.81 11.57 16384 1000 16.55 17.27 16.71 32768 1000 26.81 28.12 27.01 65536 1000 47.41 49.43 47.62 131072 1000 89.86 91.98 90.81 262144 1000 174.25 176.34 175.35 524288 1000 343.03 344.79 343.51 1048576 1000 679.04 680.57 679.72 2097152 1000 1350.88 1352.80 1351.75 4194304 1000 2693.31 2696.13 2694.50 8388608 1000 5380.45 5383.29 5381.62 ------------------------------------------------------------------ ------------------------------------------------------------------ RDMA_Write BW Test Number of qp's running 1 Connection type : RC Each Qp will post up to 100 messages each time Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations BW peak[MB/sec] BW average[MB/sec] 2 5000 2.51 2.51 4 5000 5.03 5.03 8 5000 10.09 10.09 16 5000 19.71 19.70 32 5000 39.23 39.22 64 5000 77.91 77.84 128 5000 146.67 146.53 256 5000 223.14 222.82 512 5000 640.09 639.80 1024 5000 1106.72 1106.22 2048 5000 1271.61 1270.87 4096 5000 1379.58 1379.44 8192 5000 1446.01 1445.95 16384 5000 1477.11 1477.09 32768 5000 1498.18 1498.17 65536 5000 1507.23 1507.22 131072 5000 1511.83 1511.82 262144 5000 1487.64 1487.62 524288 5000 1485.76 1485.75 1048576 5000 1487.13 1486.54 2097152 5000 1487.95 1487.95 4194304 5000 1488.11 1488.10 8388608 5000 1488.22 1488.22 ------------------------------------------------------------------ ***************OUR-SYSTEM /supermicro-qlogic:******************** # OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1 # Size Latency (us) 1 7.87 2 7.80 4 7.77 8 7.78 16 7.81 32 9.00 64 9.00 128 10.15 256 11.75 512 15.55 1024 23.54 2048 40.57 4096 107.12 8192 187.28 16384 343.61 32768 602.17 65536 1135.20 131072 3086.28 262144 9086.50 524288 18713.30 1048576 37378.61 ------------------------------------------------------------------ **************REFERENCE_SYSTEM / dell-cisco:*********************** # OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1 # Size Latency (us) 1 16.14 2 15.93 4 16.25 8 16.60 16 25.83 32 28.66 64 33.57 128 40.94 256 56.20 512 91.24 1024 156.13 2048 373.17 4096 696.95 8192 1464.89 16384 1367.96 32768 2499.21 65536 5686.46 131072 11065.98 262144 23922.69 524288 49294.71 1048576 101290.67 ============================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From mossy.boulders at gmail.com Fri Feb 6 04:38:06 2009 From: mossy.boulders at gmail.com (Markus Uhlmann) Date: Fri, 6 Feb 2009 13:38:06 +0100 Subject: [ofa-general] ***SPAM*** debian/ofed-1.4 - mpi global communication performance Message-ID: <962e48ae0902060438y26e58a7s649a7fa8cb8ac2d2@mail.gmail.com> Sorry, the numbers for one of the tests were inserted wrongly. It should be: ***************OUR-SYSTEM /supermicro-qlogic:******************** # OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1 # Size Latency (us) 1 137.32 2 136.23 4 135.97 8 135.63 16 138.00 32 139.19 64 139.26 128 140.06 256 1770.24 512 1772.94 1024 1776.16 2048 1811.75 4096 584.51 8192 746.64 16384 3927.21 32768 4576.17 65536 6052.26 131072 9898.08 262144 19566.90 524288 37515.47 1048576 74443.69 -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Fri Feb 6 08:43:51 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 6 Feb 2009 18:43:51 +0200 Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for bind In-Reply-To: References: Message-ID: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> > > ucmatose allows binding to a specific address using -b. I haven't used > rds-ping > to know if it's the same as -I in that case. I don't have any systems > myself > with dual HCAs; I don't think they have enough slots to support more than > one. Hi Sean, ucmatose doesn't do anything with the address provided with the -b param on its --active-- side, where this problem takes place. Yes, -I to rds-ping is the same as -I to ping (other then the fact of the former doesn't seem to work well). As I wrote you in detail, there's no need for two HCAs to get the problem reproduced, just have one node with two active port, each assigned with a different IP address. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.frank at oracle.com Fri Feb 6 09:05:48 2009 From: richard.frank at oracle.com (Richard Frank) Date: Fri, 06 Feb 2009 12:05:48 -0500 Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for bind In-Reply-To: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> References: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> Message-ID: <498C6DEC.70805@oracle.com> I played around with this a bit more yesterday - and it looks like rdma_bind_addr()->rdma_resolve_ip()->ip_dev_find() is always returning the first matching entry in the routing table... even though we are providing the source ip for the bind... Keeping in mind that both IB ports have IPs on the same subnet... [root at vosib8 rds-tools-1.1-2]# ip a s ib0 33: ib0: mtu 65520 qdisc pfifo_fast qlen 256 link/infiniband 80:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:20:3b:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 11.0.0.8/24 brd 11.0.0.255 scope global ib0 inet6 fe80::202:c902:20:3b61/64 scope link valid_lft forever preferred_lft forever [root at vosib8 rds-tools-1.1-2]# ip a s ib1 34: ib1: mtu 65520 qdisc pfifo_fast qlen 256 link/infiniband 80:00:04:05:fe:80:00:00:00:00:00:00:00:02:c9:02:00:20:3b:62 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 11.0.0.18/24 brd 11.0.0.255 scope global ib1 inet6 fe80::202:c902:20:3b62/64 scope link valid_lft forever preferred_lft forever [root at vosib8 rds-tools-1.1-2]# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 11.0.0.0 * 255.255.255.0 U 0 0 0 ib0 11.0.0.0 * 255.255.255.0 U 0 0 0 ib1 10.10.0.0 * 255.255.255.0 U 0 0 0 eth3 42.2.0.0 * 255.255.255.0 U 0 0 0 eth2 139.185.139.0 * 255.255.255.0 U 0 0 0 eth1 10.12.0.0 * 255.255.255.0 U 0 0 0 eth0 169.254.0.0 * 255.255.0.0 U 0 0 0 ib1 default whq2op-swi-1-rt 0.0.0.0 UG 0 0 0 eth1 Or Gerlitz wrote: > > ucmatose allows binding to a specific address using -b. I haven't > used rds-ping > to know if it's the same as -I in that case. I don't have any > systems myself > with dual HCAs; I don't think they have enough slots to support > more than one. > > > Hi Sean, > > ucmatose doesn't do anything with the address provided with the -b > param on its --active-- side, where this problem takes place. Yes, -I > to rds-ping is the same as -I to ping (other then the fact of the > former doesn't seem to work well). As I wrote you in detail, there's > no need for two HCAs to get the problem reproduced, just have one node > with two active port, each assigned with a different IP address. > > Or. > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Fri Feb 6 10:10:11 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 6 Feb 2009 10:10:11 -0800 Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for bind In-Reply-To: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> References: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> Message-ID: >ucmatose doesn't do anything with the address provided with the -b param on its >--active-- side, where this problem takes place. It passes the address into rdma_resolve_addr() as the source address, which results in binding to that address. - Sean From hal.rosenstock at gmail.com Fri Feb 6 11:12:08 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 6 Feb 2009 14:12:08 -0500 Subject: [ofa-general] [RFC] OpenSM vendor layer Message-ID: Hi, I'm looking at adding pkey support into the OpenSM vendor layer. The pkey table is a per port structure and is part of ib_port_attr_t. That structure also include num_pkeys. There is only related API: osm_vendor_get_all_port_attr which takes several pointers, the second one is a pointer to a preallocated array of port attributes (memory allocation for that is done by the client). ib_port_attr_t includes a pointer to the pkey table. So the only way this can work is if that allocation is also done by the client which makes that a valid parameter on input (as well as output). Similarly for num_pkeys so the vendor layer doesn't go past the end of the supplied table. So both num_pkeys and p_pkey_table in that struct need to be in/out parameters. num_pkeys could always be returned as the total number of pkeys for the port when num_pkeys is set to 0 on input. Similar thing is true for gid table in ib_port_attr_t. I'm also not sure which vendor layers are important. I don't see how to fix them all (e.g. osm_vendor_al.c is one, there are some others) as some of them appear to do a straight memory to memory copy of the ib_port_attr_t structure (others are OK and fixable). The only other alternative I see is to change this API and possibly this structure which is way more disruptive and risky (especially with the inability to test anything but one of the vendor layers). Thoughts ? -- Hal From brian at sun.com Fri Feb 6 11:39:58 2009 From: brian at sun.com (Brian J. Murrell) Date: Fri, 06 Feb 2009 14:39:58 -0500 Subject: [ofa-general] build warnings on rhel4 U6 Message-ID: <1233949198.3257.19.camel@pc.interlinx.bc.ca> I get these warnings trying to build with RHEL4U6 and ofa_kernel from OFED 1.4: include/linux/jbd.h:1204:1: warning: "assert_spin_locked" redefined In file included from include/linux/wait.h:25, from include/linux/fs.h:12, from /cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/fs.h:4, from /cache/build/BUILD/lustre-1.6.7.50/lustre/lvfs/fsfilt.c:42: /cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/spinlock.h:8:1: warning: this is the location of the previous definition The code in question is (from jbd.h): #ifdef __KERNEL__ #ifdef CONFIG_SMP #define assert_spin_locked(lock) J_ASSERT(spin_is_locked(lock)) #else #define assert_spin_locked(lock) do {} while(0) #endif and (from the backport spinlock.h): #ifndef BACKPORT_LINUX_SPINLOCK_H #define BACKPORT_LINUX_SPINLOCK_H #include_next #define spin_lock_nested(lock, subclass) spin_lock(lock) #define assert_spin_locked(lock) do { (void)(lock); } while(0) #endif Any thoughts on how to resolve? b. From hal.rosenstock at gmail.com Fri Feb 6 11:47:17 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 6 Feb 2009 14:47:17 -0500 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: References: Message-ID: On Fri, Feb 6, 2009 at 2:12 PM, Hal Rosenstock wrote: > Hi, > > I'm looking at adding pkey support into the OpenSM vendor layer. The > pkey table is a per port structure and is part of ib_port_attr_t. That > structure also include num_pkeys. There is only related API: > osm_vendor_get_all_port_attr which takes several pointers, the second > one is a pointer to a preallocated array of port attributes (memory > allocation for that is done by the client). ib_port_attr_t includes a > pointer to the pkey table. So the only way this can work is if that > allocation is also done by the client which makes that a valid > parameter on input (as well as output). Similarly for num_pkeys so the > vendor layer doesn't go past the end of the supplied table. So both > num_pkeys and p_pkey_table in that struct need to be in/out > parameters. num_pkeys could always be returned as the total number of > pkeys for the port when num_pkeys is set to 0 on input. > > Similar thing is true for gid table in ib_port_attr_t. > > I'm also not sure which vendor layers are important. I don't see how > to fix them all (e.g. osm_vendor_al.c is one, there are some others) > as some of them appear to do a straight memory to memory copy of the > ib_port_attr_t structure (others are OK and fixable). > > The only other alternative I see is to change this API and possibly > this structure which is way more disruptive and risky (especially with > the inability to test anything but one of the vendor layers). Actually, although more disruptive, it might be cleaner (and safer in the long run) to add to the vendor API. There could be additional osm vendor APIs for pkeys and gids and these could return some suitable IB_ error from ib_types in vendor layers where they are unimplemented. IB_UNSUPPORTED looks good to me. I'm likely to head down this approach unless I hear otherwise. -- Hal > Thoughts ? > > -- Hal > From or.gerlitz at gmail.com Fri Feb 6 12:58:11 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 6 Feb 2009 22:58:11 +0200 Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for bind In-Reply-To: References: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> Message-ID: <15ddcffd0902061258p3c0c1971y17fcb2401bd03ef4@mail.gmail.com> On Fri, Feb 6, 2009 at 8:10 PM, Sean Hefty wrote: > It passes the address into rdma_resolve_addr() as the source address, which > results in binding to that address. OK, I managed to reproduce the problem with ucmatose in the same manner it happened with rds-ping: two running interfaces, two runs, telling ucmatose to bind a different interface address on each run, and in both runs the same local port was used (as ucmatose doesn't have prints, I used perfquery to see on what port data really goes). Or. From or.gerlitz at gmail.com Fri Feb 6 13:11:24 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 6 Feb 2009 23:11:24 +0200 Subject: [ofa-general] troubleshooting IB_CM_REJ_INVALID_SERVICE_ID in RDMA_CM_EVENT_REJECTED at active side of the connection In-Reply-To: <20090205044728.GL18580@sun.com> References: <20090205044728.GL18580@sun.com> Message-ID: <15ddcffd0902061311kbb4c4d7j24ad93dc51791609@mail.gmail.com> On Thu, Feb 5, 2009 at 6:47 AM, Isaac Huang wrote: > I got some RDMA_CM_EVENT_REJECTED errors at active sides (i.e. nodes > Poking around in CM code told me that the passive side couldn't find a listener with > requested service_id on the incoming device of the connection request. for this rdma-cm event, the status field would be a value from the ib_cm_rej_reason, so I assume you were getting IB_CM_REJ_INVALID_SERVICE_ID > Could you guys give me some tips for troubleshooting? Any > debugging options or /proc file to look at? Is there any netstat-like > tool (e.g. something like a "netstat -ltp" to find out who is > listening on which device)? yes, this pain in the ass, currently there's no netstat line support for RDMA connections > The other possible cause could be ARP flux, but unfortunately arping > via IPoIB always segfault on our systems. Is there any other way to > troubleshoot possible ARP flux issues? yes, ping could serve you in that respect, just use it and then look on the resulted neighbours by doing $ip neigh show and comparing with $ip addr show on the system you are pinging. Your problem may be solved through correct setting of the arp_ignore sysctl attribute, take a look on the known issues section in the ipoib release notes provided with the ofed-docs package. Or. From rdreier at cisco.com Fri Feb 6 13:18:22 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 13:18:22 -0800 Subject: [ofa-general] [PATCH 2.6.30 1/2] RDMA/cxgb3: sgl/pbl offset calculation is 64b. In-Reply-To: <20090204202612.27031.78831.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 04 Feb 2009 14:26:12 -0600") References: <20090204202612.27031.78831.stgit@dell3.ogc.int> Message-ID: > The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64. I assume this fixes an overflow. What's the impact of this overflow, and when does it trigger? ie is this urgent enough for 2.6.29 maybe? - R. From rdreier at cisco.com Fri Feb 6 13:19:45 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 13:19:45 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection termination fixes. In-Reply-To: <20090204202614.27031.22248.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 04 Feb 2009 14:26:14 -0600") References: <20090204202612.27031.78831.stgit@dell3.ogc.int> <20090204202614.27031.22248.stgit@dell3.ogc.int> Message-ID: > + BUG_ON((*cqe_flushed == 0) && !SW_CQE(*hw_cqe)); BUG_ON()s are kind of nasty -- possibly killing the whole box because of a driver issue or an unanticipated HW quirk -- is there any way to report this problem and try to limp on? - R. From swise at opengridcomputing.com Fri Feb 6 13:27:10 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 06 Feb 2009 15:27:10 -0600 Subject: [ofa-general] [PATCH 2.6.30 1/2] RDMA/cxgb3: sgl/pbl offset calculation is 64b. In-Reply-To: References: <20090204202612.27031.78831.stgit@dell3.ogc.int> Message-ID: <498CAB2E.2070105@opengridcomputing.com> Roland Dreier wrote: > > The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64. > > I assume this fixes an overflow. What's the impact of this overflow, > and when does it trigger? ie is this urgent enough for 2.6.29 maybe? > > - R. > This was actually found by a customer using another OS derived from the ofed code. 2.6.30 is ok with me. Steve. From swise at opengridcomputing.com Fri Feb 6 13:32:54 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 06 Feb 2009 15:32:54 -0600 Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection termination fixes. In-Reply-To: References: <20090204202612.27031.78831.stgit@dell3.ogc.int> <20090204202614.27031.22248.stgit@dell3.ogc.int> Message-ID: <498CAC86.3090005@opengridcomputing.com> Roland Dreier wrote: > > + BUG_ON((*cqe_flushed == 0) && !SW_CQE(*hw_cqe)); > > BUG_ON()s are kind of nasty -- possibly killing the whole box because of > a driver issue or an unanticipated HW quirk -- is there any way to > report this problem and try to limp on? > I'm not sure I agree with trying to limp on. This BUG_ON() doesn't indicate a HW quirk. It indicates the driver logic is busted. Isn't that what BUG_ON() should be used for? Steve. From rdreier at cisco.com Fri Feb 6 13:38:00 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 13:38:00 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection termination fixes. In-Reply-To: <498CAC86.3090005@opengridcomputing.com> (Steve Wise's message of "Fri, 06 Feb 2009 15:32:54 -0600") References: <20090204202612.27031.78831.stgit@dell3.ogc.int> <20090204202614.27031.22248.stgit@dell3.ogc.int> <498CAC86.3090005@opengridcomputing.com> Message-ID: > I'm not sure I agree with trying to limp on. This BUG_ON() doesn't > indicate a HW quirk. It indicates the driver logic is busted. Isn't > that what BUG_ON() should be used for? Yeah, I guess so -- the only issue is that it's very annoying for some buggy driver to kill the whole system when only some non-critical piece is busted. But it's not a killer. From rdreier at cisco.com Fri Feb 6 13:39:49 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 13:39:49 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection termination fixes. In-Reply-To: <20090204202614.27031.22248.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 04 Feb 2009 14:26:14 -0600") References: <20090204202612.27031.78831.stgit@dell3.ogc.int> <20090204202614.27031.22248.stgit@dell3.ogc.int> Message-ID: applied 1-2 From rdreier at cisco.com Fri Feb 6 13:40:50 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 13:40:50 -0800 Subject: [ofa-general] [PATCH] RDMA/nes: ibv_devinfo displays 0 for vendor_id and vendor_part_id In-Reply-To: <20090204234434.GA1856@ctung-MOBL> (Chien Tung's message of "Wed, 4 Feb 2009 17:44:34 -0600") References: <20090204234434.GA1856@ctung-MOBL> Message-ID: thanks, applied From rdreier at cisco.com Fri Feb 6 13:42:34 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 13:42:34 -0800 Subject: [ofa-general] Re: [PATCH] RDMA/nes: tmp_addr compilation warning In-Reply-To: <20090205152106.GA2304@ctung-MOBL> (Chien Tung's message of "Thu, 5 Feb 2009 09:21:06 -0600") References: <20090205152106.GA2304@ctung-MOBL> Message-ID: thanks, applied From swise at opengridcomputing.com Fri Feb 6 13:44:24 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 06 Feb 2009 15:44:24 -0600 Subject: [ofa-general] Re: [PATCH 2.6.30 2/2] RDMA/cxgb3: Connection termination fixes. In-Reply-To: References: <20090204202612.27031.78831.stgit@dell3.ogc.int> <20090204202614.27031.22248.stgit@dell3.ogc.int> <498CAC86.3090005@opengridcomputing.com> Message-ID: <498CAF38.80106@opengridcomputing.com> Roland Dreier wrote: > > I'm not sure I agree with trying to limp on. This BUG_ON() doesn't > > indicate a HW quirk. It indicates the driver logic is busted. Isn't > > that what BUG_ON() should be used for? > > Yeah, I guess so -- the only issue is that it's very annoying for some > buggy driver to kill the whole system when only some non-critical piece > is busted. > I agree with you there. But I think trying to gracefully bail on on these conditions is painful and complicated and prone to resulting in a BUG_ON() somewhere else. :) From jgunthorpe at obsidianresearch.com Fri Feb 6 13:52:52 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 6 Feb 2009 14:52:52 -0700 Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for bind In-Reply-To: <498C6DEC.70805@oracle.com> References: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> <498C6DEC.70805@oracle.com> Message-ID: <20090206215252.GE19892@obsidianresearch.com> On Fri, Feb 06, 2009 at 12:05:48PM -0500, Richard Frank wrote: > I played around with this a bit more yesterday - and it looks like > rdma_bind_addr()->rdma_resolve_ip()->ip_dev_find() is always returning the > first matching entry in the routing table... even though we are providing > the source ip for the bind... Right, thats the trouble, it shouldn't be calling ip_dev_find on the bind path with any address.. ip_route_output_key needs to be used to get the device. Just looking at the 2.6.27 upstream it looks like ip_dev_find is used in many places where a route lookup would probably be more appropriate.. Jason From rdreier at cisco.com Fri Feb 6 13:59:59 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 13:59:59 -0800 Subject: [ofa-general] Re: [PATCH 1 of 2 for 2.6.28] core: Fix Raw Ethertype QP support In-Reply-To: <200808121720.11878.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 12 Aug 2008 17:20:11 +0300") References: <200808121720.11878.jackm@dev.mellanox.co.il> Message-ID: > @@ -752,6 +752,11 @@ struct ib_send_wr { > int access_flags; > u32 rkey; > } fast_reg; > + struct { > + struct ib_unpacked_lrh *lrh; > + u32 eth_type; > + u8 static_rate; > + } raw_ety; Would it be more sensible to make eth_type __be16, since it's limited to 16 bits, and ethertype is usually specified in network endian? Also rather than an LRH structure, would it make more sense to give dlid, source path bits and service level? Otherwise it seems the consumer needs to keep track of the port's assigned LID to make sure the SLID field is correct (not to mention computing packet length, setting LNH properly, etc). It seems there are some changes needed to the ib_wc structure to be able to return the ethertype on receive? - R. From rdreier at cisco.com Fri Feb 6 14:05:44 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 06 Feb 2009 14:05:44 -0800 Subject: [ofa-general] [PATCH 2 of 2 for 2.6.28] mlx4: Add Raw Ethertype QP support In-Reply-To: <200812151312.53603.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 15 Dec 2008 13:12:53 +0200") References: <200812151312.53603.jackm@dev.mellanox.co.il> Message-ID: > - type != IB_QPT_SMI && type != IB_QPT_GSI) > + type != IB_QPT_SMI && type != IB_QPT_GSI && type != IB_QPT_RAW_ETY) Seems we're at the point where mlx4 could use a "is_special_qpt()" helper maybe? > err = create_qp_common(dev, pd, init_attr, udata, > dev->dev->caps.sqp_start + > - (init_attr->qp_type == IB_QPT_SMI ? 0 : 2) + > + (init_attr->qp_type == IB_QPT_RAW_ETY ? 4 : > + (init_attr->qp_type == IB_QPT_SMI ? 0 : 2)) + > init_attr->port_num - 1, I think this is now way past the point where we should use a helper function to compute this? > @@ -60,6 +60,7 @@ enum { > MLX4_DEV_CAP_FLAG_IPOIB_CSUM = 1 << 7, > MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR = 1 << 8, > MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR = 1 << 9, > + MLX4_DEV_CAP_FLAG_RAW_ETY = 1 << 13, > MLX4_DEV_CAP_FLAG_MEM_WINDOW = 1 << 16, > MLX4_DEV_CAP_FLAG_APM = 1 << 17, > MLX4_DEV_CAP_FLAG_ATOMIC = 1 << 18, probably nice to add this is dump_dev_cap_flags() so someone can check dmesg output to see if raw ethertype is supported. I don't see any changes to the poll cq side of things. Is there anything required to handle receiving raw ethertype datagrams? - R. From or.gerlitz at gmail.com Fri Feb 6 15:02:28 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sat, 7 Feb 2009 01:02:28 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad/rpc.c: In mad_rpc/mad_rpc_rmpp, set rpc attribute ID from response In-Reply-To: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca> References: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca> Message-ID: <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com> On Fri, Feb 6, 2009 at 1:47 AM, Hal Rosenstock wrote: > Sasha, > This patch sets the attribute ID based on what is in the response. Hal, Your patches can't really be reviewed when being sent as attachment, any reason not to send them embedded within the email message? Or. From richard.frank at oracle.com Fri Feb 6 15:31:40 2009 From: richard.frank at oracle.com (Richard Frank) Date: Fri, 06 Feb 2009 18:31:40 -0500 Subject: [ofa-general] RE: pick the outgoing HCA based on the IP used for bind In-Reply-To: <20090206215252.GE19892@obsidianresearch.com> References: <15ddcffd0902060843i1eceef42nca7af9acb9d191a5@mail.gmail.com> <498C6DEC.70805@oracle.com> <20090206215252.GE19892@obsidianresearch.com> Message-ID: <498CC85C.8070903@oracle.com> Interesting - Andy Grover pointed this out too - and I totally (as usual) missed the point. :( Jason Gunthorpe wrote: > On Fri, Feb 06, 2009 at 12:05:48PM -0500, Richard Frank wrote: > >> I played around with this a bit more yesterday - and it looks like >> rdma_bind_addr()->rdma_resolve_ip()->ip_dev_find() is always returning the >> first matching entry in the routing table... even though we are providing >> the source ip for the bind... >> > > Right, thats the trouble, it shouldn't be calling ip_dev_find on the > bind path with any address.. ip_route_output_key needs to be used to > get the device. > > Just looking at the 2.6.27 upstream it looks like ip_dev_find is used > in many places where a route lookup would probably be more appropriate.. > > Jason > From acceptany at gmail.com Fri Feb 6 19:01:45 2009 From: acceptany at gmail.com (=?GB2312?B?zfXUyrHy?=) Date: Sat, 7 Feb 2009 11:01:45 +0800 Subject: [ofa-general] ***SPAM*** problem about the installation of the OPED Message-ID: <91fe68d50902061901g2a409e50l5f6550f2c4159b84@mail.gmail.com> I want to install the OFED on a PC without any infiniband devices ,can this idea work? or what i need (hardware) when i want to install this software on a general computer ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at harr.org Fri Feb 6 19:34:29 2009 From: cameron at harr.org (Cameron Harr) Date: Fri, 06 Feb 2009 20:34:29 -0700 Subject: [ofa-general] ***SPAM*** problem about the installation of the OPED In-Reply-To: <91fe68d50902061901g2a409e50l5f6550f2c4159b84@mail.gmail.com> References: <91fe68d50902061901g2a409e50l5f6550f2c4159b84@mail.gmail.com> Message-ID: <498D0145.7030801@harr.org> An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sat Feb 7 01:43:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 11:43:24 +0200 Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: clean_val() remove trailing quotation Message-ID: <20090207094324.GD17713@sashak.voltaire.com> Remove training quotation character from parsed string values. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_subnet.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 2b3f463..bd52f76 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -794,7 +794,7 @@ static char *clean_val(char *val) /* clean quotas */ if ((*val == '\"' && *p == '\"') || (*val == '\'' && *p == '\'')) { val++; - p--; + *p-- = '\0'; } return val; } -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 7 01:43:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 11:43:56 +0200 Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: break matching when config parameter already found Message-ID: <20090207094356.GE17713@sashak.voltaire.com> Break config parameter matching procedure when it is already found. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_subnet.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 42c5682..3324af9 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1165,6 +1165,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) p_field = (void *)p_opts + r->opt_offset; /* don't call setup function first time */ r->parse_fn(NULL, p_key, p_val, p_field, NULL); + break; } } fclose(opts_file); @@ -1216,6 +1217,7 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) p_field = (void *)p_opts + r->opt_offset; r->parse_fn(p_subn, p_key, p_val, p_field, r->setup_fn); + break; } } fclose(opts_file); -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 7 01:48:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 11:48:10 +0200 Subject: [ofa-general] [PATCH] opensm: avoid memory leaks on config parameters reloading In-Reply-To: <20090203122450.GB11874@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <497DC9FC.2050907@gmail.com> <20090203122450.GB11874@sashak.voltaire.com> Message-ID: <20090207094810.GF17713@sashak.voltaire.com> When OpenSM string config parameters are loaded it will always allocate memory (except NULL value), and will free and reallocate on reloading. Signed-off-by: Sasha Khapyorsky --- On 14:24 Tue 03 Feb , Sasha Khapyorsky wrote: > > I'm applying this with several changes: > > - disable update option and setup function for all string parameter - > as I commented originally opts_parse_charp() will leak memory and this > cannot be ignored if config file is rescanned. Exception is QoS string > parameters where memory leak is handled. This probably solves an issue with potential memory leaks.... opensm/opensm/main.c | 33 +++++++++++++++----------- opensm/opensm/osm_subnet.c | 55 +++++++++++++------------------------------ 2 files changed, 36 insertions(+), 52 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index c09a54e..a8dc9e6 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -507,6 +507,11 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) /********************************************************************** **********************************************************************/ +#define SET_STR_OPT(opt, val) do { \ + if (opt) free(opt); \ + opt = val ? strdup(val) : NULL ; \ +} while (0) + int main(int argc, char *argv[]) { osm_opensm_t osm; @@ -650,7 +655,7 @@ int main(int argc, char *argv[]) /* Specifies ignore guids file. */ - opt.port_prof_ignore_file = optarg; + SET_STR_OPT(opt.port_prof_ignore_file, optarg); printf(" Ignore Guids File = %s\n", opt.port_prof_ignore_file); break; @@ -706,7 +711,7 @@ int main(int argc, char *argv[]) || strcmp(optarg, OSM_LOOPBACK_CONSOLE) == 0 #endif ) - opt.console = optarg; + SET_STR_OPT(opt.console, optarg); else printf("-console %s option not understood\n", optarg); @@ -763,7 +768,7 @@ int main(int argc, char *argv[]) break; case 'f': - opt.log_file = optarg; + SET_STR_OPT(opt.log_file, optarg); break; case 'L': @@ -778,7 +783,7 @@ int main(int argc, char *argv[]) break; case 'P': - opt.partition_config_file = optarg; + SET_STR_OPT(opt.partition_config_file, optarg); break; case 'N': @@ -790,7 +795,7 @@ int main(int argc, char *argv[]) break; case 'Y': - opt.qos_policy_file = optarg; + SET_STR_OPT(opt.qos_policy_file, optarg); printf(" QoS policy file \'%s\'\n", optarg); break; @@ -829,7 +834,7 @@ int main(int argc, char *argv[]) break; case 'R': - opt.routing_engine_names = optarg; + SET_STR_OPT(opt.routing_engine_names, optarg); printf(" Activate \'%s\' routing engine(s)\n", optarg); break; @@ -844,17 +849,17 @@ int main(int argc, char *argv[]) break; case 'M': - opt.lid_matrix_dump_file = optarg; + SET_STR_OPT(opt.lid_matrix_dump_file, optarg); printf(" Lid matrix dump file is \'%s\'\n", optarg); break; case 'U': - opt.lfts_file = optarg; + SET_STR_OPT(opt.lfts_file, optarg); printf(" LFTs file is \'%s\'\n", optarg); break; case 'S': - opt.sa_db_file = optarg; + SET_STR_OPT(opt.sa_db_file, optarg); printf(" SA DB file is \'%s\'\n", optarg); break; @@ -862,7 +867,7 @@ int main(int argc, char *argv[]) /* Specifies root guids file */ - opt.root_guid_file = optarg; + SET_STR_OPT(opt.root_guid_file, optarg); printf(" Root Guid File: %s\n", opt.root_guid_file); break; @@ -870,20 +875,20 @@ int main(int argc, char *argv[]) /* Specifies compute node guids file */ - opt.cn_guid_file = optarg; + SET_STR_OPT(opt.cn_guid_file, optarg); printf(" Compute Node Guid File: %s\n", opt.cn_guid_file); break; case 'm': /* Specifies ids guid file */ - opt.ids_guid_file = optarg; + SET_STR_OPT(opt.ids_guid_file, optarg); printf(" IDs Guid File: %s\n", opt.ids_guid_file); break; case 'X': /* Specifies guid routing order file */ - opt.guid_routing_order_file = optarg; + SET_STR_OPT(opt.guid_routing_order_file, optarg); printf(" GUID Routing Order File: %s\n", opt.guid_routing_order_file); break; @@ -912,7 +917,7 @@ int main(int argc, char *argv[]) #endif /* ENABLE_OSM_PERF_MGR */ case 3: - opt.prefix_routes_file = optarg; + SET_STR_OPT(opt.prefix_routes_file, optarg); break; case 4: opt.consolidate_ipv6_snm_req = TRUE; diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index bd52f76..42c5682 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -488,21 +488,15 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) { opt->max_vls = 0; opt->high_limit = -1; - opt->vlarb_high = NULL; - opt->vlarb_low = NULL; - opt->sl2vl = NULL; -} - -static void subn_free_qos_options(IN osm_qos_options_t * opt) -{ if (opt->vlarb_high) free(opt->vlarb_high); - + opt->vlarb_high = NULL; if (opt->vlarb_low) free(opt->vlarb_low); - + opt->vlarb_low = NULL; if (opt->sl2vl) free(opt->sl2vl); + opt->sl2vl = NULL; } /********************************************************************** @@ -518,7 +512,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->m_key_lease_period = 0; p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS; p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; - p_opt->console = OSM_DEFAULT_CONSOLE; + p_opt->console = strdup(OSM_DEFAULT_CONSOLE); p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT; p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; /* by default we will consider waiting for 50x transaction timeout normal */ @@ -566,13 +560,13 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->dump_files_dir = getenv("OSM_TMP_DIR"); if (!p_opt->dump_files_dir || !(*p_opt->dump_files_dir)) p_opt->dump_files_dir = OSM_DEFAULT_TMP_DIR; - - p_opt->log_file = OSM_DEFAULT_LOG_FILE; + p_opt->dump_files_dir = strdup(p_opt->dump_files_dir); + p_opt->log_file = strdup(OSM_DEFAULT_LOG_FILE); p_opt->log_max_size = 0; - p_opt->partition_config_file = OSM_DEFAULT_PARTITION_CONFIG_FILE; + p_opt->partition_config_file = strdup(OSM_DEFAULT_PARTITION_CONFIG_FILE); p_opt->no_partition_enforcement = FALSE; p_opt->qos = FALSE; - p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE; + p_opt->qos_policy_file = strdup(OSM_DEFAULT_QOS_POLICY_FILE); p_opt->accum_log_file = TRUE; p_opt->port_prof_ignore_file = NULL; p_opt->port_profile_switch_nodes = FALSE; @@ -591,7 +585,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->exit_on_fatal = TRUE; p_opt->enable_quirks = FALSE; p_opt->no_clients_rereg = FALSE; - p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE; + p_opt->prefix_routes_file = strdup(OSM_DEFAULT_PREFIX_ROUTES_FILE); p_opt->consolidate_ipv6_snm_req = FALSE; subn_init_qos_options(&p_opt->qos_options); subn_init_qos_options(&p_opt->qos_ca_options); @@ -753,25 +747,16 @@ static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key, char **p_val = p_v; const char *current_str = *p_val ? *p_val : null_str ; - if (!p_val_str) - return; - - if (strcmp(p_val_str, current_str)) { + if (p_val_str && strcmp(p_val_str, current_str)) { + char *new; log_config_value(p_key, "%s", p_val_str); /* special case the "(null)" string */ - if (strcmp(null_str, p_val_str) == 0) { - if (pfn) - pfn(p_subn, NULL); - *p_val = NULL; - } else { - if (pfn) - pfn(p_subn, p_val_str); - /* - Ignore the possible memory leak here; - the pointer may be to a static default. - */ - *p_val = strdup(p_val_str); - } + new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL; + if (pfn) + pfn(p_subn, new); + if (*p_val) + free(*p_val); + *p_val = new; } } @@ -1211,12 +1196,6 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) return -1; } - subn_free_qos_options(&p_opts->qos_options); - subn_free_qos_options(&p_opts->qos_ca_options); - subn_free_qos_options(&p_opts->qos_sw0_options); - subn_free_qos_options(&p_opts->qos_swe_options); - subn_free_qos_options(&p_opts->qos_rtr_options); - subn_init_qos_options(&p_opts->qos_options); subn_init_qos_options(&p_opts->qos_ca_options); subn_init_qos_options(&p_opts->qos_sw0_options); -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 7 01:53:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 11:53:41 +0200 Subject: [ofa-general] ***SPAM*** [RFC] infiniband-diags/perfquery.c: Any objections to changing an option name ? In-Reply-To: References: Message-ID: <20090207095341.GG17713@sashak.voltaire.com> Hi Hal, On 17:55 Thu 05 Feb , Hal Rosenstock wrote: > In infiniband-diags/perfquery, -e is used for extended counters and > covers up using the common errors option so I'd like to change this to > be -x for xtended. Any objections ? AFAIK '-e' is not used in infiniband-diags scripts and proposed change likely will not break any known usage. I'm fine with change. Sasha > Without this change when perfquery > fails you can't get the more detailed error information which is very > useful for debugging problems. > > -- Hal > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Sat Feb 7 02:25:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 12:25:10 +0200 Subject: [ofa-general] Re: [PATCH] ibsim: Eliminate unused modified variable In-Reply-To: <1233876682.8992.492.camel@bertha1.edm.orcorp.ca> References: <1233876682.8992.492.camel@bertha1.edm.orcorp.ca> Message-ID: <20090207102510.GH17713@sashak.voltaire.com> On 16:31 Thu 05 Feb , Hal Rosenstock wrote: > Sasha, > > Trivial patch to eliminate the unused 'modified' variable. Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 7 02:44:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 12:44:21 +0200 Subject: [ofa-general] Re: [PATCH] ibsim: Change lid print format to unsigned In-Reply-To: <1233876691.8992.494.camel@bertha1.edm.orcorp.ca> References: <1233876691.8992.494.camel@bertha1.edm.orcorp.ca> Message-ID: <20090207104421.GI17713@sashak.voltaire.com> On 16:31 Thu 05 Feb , Hal Rosenstock wrote: > Sasha, > > Patch to change lid print format to unsigned to be consistent elsewhere. dev_sysfs_create() umad2sim.c generates simulated sysfs tree. In native sysfs tree port lid and sm_lid files store lid values in hex form (core/sysfs.c lid_show() and sm_lid_show()), so I don't see any good reason to make simulation differently (unless you are going to change this in kernel first). I'm removing this part from the patch. Sasha From sashak at voltaire.com Sat Feb 7 02:47:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 12:47:24 +0200 Subject: [ofa-general] Re: [PATCH] opensm/doc/perf-manager-arch.txt: Fix some commentary typos In-Reply-To: <1233877299.8992.508.camel@bertha1.edm.orcorp.ca> References: <1233877299.8992.508.camel@bertha1.edm.orcorp.ca> Message-ID: <20090207104724.GJ17713@sashak.voltaire.com> On 16:41 Thu 05 Feb , Hal Rosenstock wrote: > Sasha, > > Trivial patch to fix some typos in this doc. Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 7 02:50:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 12:50:24 +0200 Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Add copyrights In-Reply-To: <1233877343.8992.510.camel@bertha1.edm.orcorp.ca> References: <1233877343.8992.510.camel@bertha1.edm.orcorp.ca> Message-ID: <20090207105024.GK17713@sashak.voltaire.com> On 16:42 Thu 05 Feb , Hal Rosenstock wrote: > Sasha, > > This just adds copyrights missed in previous patches. Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 7 02:55:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 12:55:45 +0200 Subject: [ofa-general] Re: libibumad/umad.c: Change lid print format to unsigned In-Reply-To: <1233877414.8992.512.camel@bertha1.edm.orcorp.ca> References: <1233877414.8992.512.camel@bertha1.edm.orcorp.ca> Message-ID: <20090207105545.GL17713@sashak.voltaire.com> On 16:43 Thu 05 Feb , Hal Rosenstock wrote: > Sasha, > > This patch changes umad.c lid print format to unsigned. Both applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 7 03:10:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 13:10:00 +0200 Subject: [ofa-general] [PATCH] libibmad/rpc.c: In mad_rpc/mad_rpc_rmpp, set rpc attribute ID from response In-Reply-To: <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com> References: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca> <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com> Message-ID: <20090207110953.GM17713@sashak.voltaire.com> On 01:02 Sat 07 Feb , Or Gerlitz wrote: > On Fri, Feb 6, 2009 at 1:47 AM, Hal Rosenstock > wrote: > > Sasha, > > This patch sets the attribute ID based on what is in the response. > > Hal, > > Your patches can't really be reviewed when being sent as attachment, This is true :(. And likely at some point I will be need to start to reject such patches. Sasha From vlad at lists.openfabrics.org Sat Feb 7 03:14:17 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 7 Feb 2009 03:14:17 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090207-0200 daily build status Message-ID: <20090207111418.0BB91E61085@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Sat Feb 7 03:57:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 13:57:56 +0200 Subject: [ofa-general] Re: [PATCH] libibmad/gs.c: Factor out common code In-Reply-To: <1233877688.8992.518.camel@bertha1.edm.orcorp.ca> References: <1233877688.8992.518.camel@bertha1.edm.orcorp.ca> Message-ID: <20090207115750.GN17713@sashak.voltaire.com> On 16:48 Thu 05 Feb , Hal Rosenstock wrote: > > This patch factors out some common code in gs.c. common_query_setup is > used by both pma_query_via and performance_reset_via. Should rcvbuf be initialized a common code? I'm not sure, but if it is valid then mad_rpc call could look like: mad_rpc(srcport, &rpc, dest, NULL, rcvbuf); to prevent empty payload copying in mad_build_pkt(). Sasha From sashak at voltaire.com Sat Feb 7 04:09:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 14:09:31 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery: Change option name for extended counters In-Reply-To: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca> References: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca> Message-ID: <20090207120924.GO17713@sashak.voltaire.com> On 17:00 Thu 05 Feb , Hal Rosenstock wrote: > Sasha, > > Per the RFC, this patch changes the option name for extended counters to > to not cover up common errors option. This changes it from -e/--extended > to -x/--xtended so -e/--errors can be used to get error information as > is common with the IB diags. To avoid typos this can be done as -x/--extended and -e/--errors: { "extended", 'x', ... }, getopt*() will handle this properly. Sasha From sashak at voltaire.com Sat Feb 7 04:33:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 14:33:55 +0200 Subject: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: References: Message-ID: <20090207123355.GP17713@sashak.voltaire.com> On 14:12 Fri 06 Feb , Hal Rosenstock wrote: > > I'm looking at adding pkey support into the OpenSM vendor layer. The > pkey table is a per port structure and is part of ib_port_attr_t. That > structure also include num_pkeys. There is only related API: > osm_vendor_get_all_port_attr which takes several pointers, the second > one is a pointer to a preallocated array of port attributes (memory > allocation for that is done by the client). ib_port_attr_t includes a > pointer to the pkey table. So the only way this can work is if that > allocation is also done by the client which makes that a valid > parameter on input (as well as output). This could be a client choice: if pkey table pointer is initialized as NULL osm_vendor_get_all_port_attr() allocates memory and initialize the table and its size, otherwise it fills up only provided by client pkey table entries. Sasha From sashak at voltaire.com Sat Feb 7 04:38:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 14:38:37 +0200 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: References: Message-ID: <20090207123830.GQ17713@sashak.voltaire.com> On 14:47 Fri 06 Feb , Hal Rosenstock wrote: > > Actually, although more disruptive, it might be cleaner (and safer in > the long run) to add to the vendor API. There could be additional osm > vendor APIs for pkeys and gids I don't think so - existing osm_vendor_get_all_port_attr() call following its name could/should provide *all* port attributes already, no needs for new APIs. Sasha From hal.rosenstock at gmail.com Sat Feb 7 04:39:49 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 7 Feb 2009 07:39:49 -0500 Subject: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: <20090207123355.GP17713@sashak.voltaire.com> References: <20090207123355.GP17713@sashak.voltaire.com> Message-ID: On Sat, Feb 7, 2009 at 7:33 AM, Sasha Khapyorsky wrote: > On 14:12 Fri 06 Feb , Hal Rosenstock wrote: >> >> I'm looking at adding pkey support into the OpenSM vendor layer. The >> pkey table is a per port structure and is part of ib_port_attr_t. That >> structure also include num_pkeys. There is only related API: >> osm_vendor_get_all_port_attr which takes several pointers, the second >> one is a pointer to a preallocated array of port attributes (memory >> allocation for that is done by the client). ib_port_attr_t includes a >> pointer to the pkey table. So the only way this can work is if that >> allocation is also done by the client which makes that a valid >> parameter on input (as well as output). > > This could be a client choice: if pkey table pointer is initialized as > NULL osm_vendor_get_all_port_attr() allocates memory and initialize the > table and its size, otherwise it fills up only provided by client pkey > table entries. Right; that's what I was trying to describe. The downside of this approach is that it breaks in and out of tree uses of this API as the passed in structure is uninitialized. I can fix the in tree ones (I know about). -- Hal > Sasha > From hal.rosenstock at gmail.com Sat Feb 7 04:41:12 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 7 Feb 2009 07:41:12 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: <20090207123830.GQ17713@sashak.voltaire.com> References: <20090207123830.GQ17713@sashak.voltaire.com> Message-ID: On Sat, Feb 7, 2009 at 7:38 AM, Sasha Khapyorsky wrote: > On 14:47 Fri 06 Feb , Hal Rosenstock wrote: >> >> Actually, although more disruptive, it might be cleaner (and safer in >> the long run) to add to the vendor API. There could be additional osm >> vendor APIs for pkeys and gids > > I don't think so - existing osm_vendor_get_all_port_attr() call > following its name could/should provide *all* port attributes already, > no needs for new APIs. I can see cases where rather than getting all port attr, it would be useful to get the bound port's attributes without all the rest. -- Hal > Sasha > From sashak at voltaire.com Sat Feb 7 05:20:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 15:20:19 +0200 Subject: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123355.GP17713@sashak.voltaire.com> Message-ID: <20090207132019.GR17713@sashak.voltaire.com> On 07:39 Sat 07 Feb , Hal Rosenstock wrote: > > Right; that's what I was trying to describe. The downside of this > approach is that it breaks in and out of tree uses of this API as the > passed in structure is uninitialized. I can fix the in tree ones (I > know about). All OpenSM vendor layer users are opensm, osmtest, saquery and ibis. BTW, why and where do you need this? Sasha From sashak at voltaire.com Sat Feb 7 05:28:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 15:28:01 +0200 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123830.GQ17713@sashak.voltaire.com> Message-ID: <20090207132753.GS17713@sashak.voltaire.com> On 07:41 Sat 07 Feb , Hal Rosenstock wrote: > > I can see cases where rather than getting all port attr, it would be > useful to get the bound port's attributes without all the rest. Then it is probably simpler just to use umad_get_port(). Why to bother with all those OpenSM vendor junks? Sasha From hal.rosenstock at gmail.com Sat Feb 7 05:40:03 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 7 Feb 2009 08:40:03 -0500 Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery: Change option name for extended counters In-Reply-To: <20090207120924.GO17713@sashak.voltaire.com> References: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca> <20090207120924.GO17713@sashak.voltaire.com> Message-ID: On Sat, Feb 7, 2009 at 7:09 AM, Sasha Khapyorsky wrote: > On 17:00 Thu 05 Feb , Hal Rosenstock wrote: >> Sasha, >> >> Per the RFC, this patch changes the option name for extended counters to >> to not cover up common errors option. This changes it from -e/--extended >> to -x/--xtended so -e/--errors can be used to get error information as >> is common with the IB diags. > > To avoid typos this can be done as -x/--extended and -e/--errors: > > { "extended", 'x', ... }, Do you want a revised patch for this ? -- Hal > > getopt*() will handle this properly. > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Sat Feb 7 05:42:33 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 7 Feb 2009 08:42:33 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: <20090207132753.GS17713@sashak.voltaire.com> References: <20090207123830.GQ17713@sashak.voltaire.com> <20090207132753.GS17713@sashak.voltaire.com> Message-ID: On Sat, Feb 7, 2009 at 8:28 AM, Sasha Khapyorsky wrote: > On 07:41 Sat 07 Feb , Hal Rosenstock wrote: >> >> I can see cases where rather than getting all port attr, it would be >> useful to get the bound port's attributes without all the rest. > > Then it is probably simpler just to use umad_get_port(). Why to bother > with all those OpenSM vendor junks? Is bypassing it's vendor layer acceptable for OpenSM unless we are going to totally remove it and go straight to umad (which I'm not proposing) ? -- Hal > Sasha > From sashak at voltaire.com Sat Feb 7 05:56:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 15:56:18 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on index port order incrementation In-Reply-To: <49896FF7.8060908@ext.bull.net> References: <4981DC18.9030400@ext.bull.net> <49896B9C.8040006@dev.mellanox.co.il> <49896FF7.8060908@ext.bull.net> Message-ID: <20090207135618.GT17713@sashak.voltaire.com> Yevgeny and Nicolas, On 11:37 Wed 04 Feb , Nicolas Morey Chaisemartin wrote: > > That seems good. > I'm going to think a bit more about the case where there are no downports. I hope eventually updated version of the patch will be posted to the list. Sasha From sashak at voltaire.com Sat Feb 7 06:44:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 16:44:26 +0200 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123830.GQ17713@sashak.voltaire.com> <20090207132753.GS17713@sashak.voltaire.com> Message-ID: <20090207144426.GU17713@sashak.voltaire.com> On 08:42 Sat 07 Feb , Hal Rosenstock wrote: > On Sat, Feb 7, 2009 at 8:28 AM, Sasha Khapyorsky wrote: > > On 07:41 Sat 07 Feb , Hal Rosenstock wrote: > >> > >> I can see cases where rather than getting all port attr, it would be > >> useful to get the bound port's attributes without all the rest. > > > > Then it is probably simpler just to use umad_get_port(). Why to bother > > with all those OpenSM vendor junks? > > Is bypassing it's vendor layer acceptable for OpenSM Sure, so it is why I asked where and for what purpose do you need pkey table and why is OpenSM vendor layer chosen there? > unless we are > going to totally remove it and go straight to umad (which I'm not > proposing) ? BTW, WinOF now has libibumad implemented too, it could be an option to switch. Sasha From sashak at voltaire.com Sat Feb 7 06:46:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 16:46:27 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery: Change option name for extended counters In-Reply-To: References: <1233878402.8992.523.camel@bertha1.edm.orcorp.ca> <20090207120924.GO17713@sashak.voltaire.com> Message-ID: <20090207144627.GV17713@sashak.voltaire.com> On 08:40 Sat 07 Feb , Hal Rosenstock wrote: > On Sat, Feb 7, 2009 at 7:09 AM, Sasha Khapyorsky wrote: > > On 17:00 Thu 05 Feb , Hal Rosenstock wrote: > >> Sasha, > >> > >> Per the RFC, this patch changes the option name for extended counters to > >> to not cover up common errors option. This changes it from -e/--extended > >> to -x/--xtended so -e/--errors can be used to get error information as > >> is common with the IB diags. > > > > To avoid typos this can be done as -x/--extended and -e/--errors: > > > > { "extended", 'x', ... }, > > Do you want a revised patch for this ? I will fix in my tree. Sasha From hal.rosenstock at gmail.com Sat Feb 7 07:24:04 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 7 Feb 2009 10:24:04 -0500 Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: <20090207132019.GR17713@sashak.voltaire.com> References: <20090207123355.GP17713@sashak.voltaire.com> <20090207132019.GR17713@sashak.voltaire.com> Message-ID: On Sat, Feb 7, 2009 at 8:20 AM, Sasha Khapyorsky wrote: > On 07:39 Sat 07 Feb , Hal Rosenstock wrote: >> >> Right; that's what I was trying to describe. The downside of this >> approach is that it breaks in and out of tree uses of this API as the >> passed in structure is uninitialized. I can fix the in tree ones (I >> know about). > > All OpenSM vendor layer users are opensm, osmtest, saquery and ibis. > > BTW, why and where do you need this? For some PerfMgr work I'm doing. -- Hal > > Sasha > From hal.rosenstock at gmail.com Sat Feb 7 07:27:17 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 7 Feb 2009 10:27:17 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: <20090207144426.GU17713@sashak.voltaire.com> References: <20090207123830.GQ17713@sashak.voltaire.com> <20090207132753.GS17713@sashak.voltaire.com> <20090207144426.GU17713@sashak.voltaire.com> Message-ID: On Sat, Feb 7, 2009 at 9:44 AM, Sasha Khapyorsky wrote: > On 08:42 Sat 07 Feb , Hal Rosenstock wrote: >> On Sat, Feb 7, 2009 at 8:28 AM, Sasha Khapyorsky wrote: >> > On 07:41 Sat 07 Feb , Hal Rosenstock wrote: >> >> >> >> I can see cases where rather than getting all port attr, it would be >> >> useful to get the bound port's attributes without all the rest. >> > >> > Then it is probably simpler just to use umad_get_port(). Why to bother >> > with all those OpenSM vendor junks? >> >> Is bypassing it's vendor layer acceptable for OpenSM > > Sure, so it is why I asked where and for what purpose do you need pkey > table and why is OpenSM vendor layer chosen there? > >> unless we are >> going to totally remove it and go straight to umad (which I'm not >> proposing) ? > > BTW, WinOF now has libibumad implemented too, Yes, it seems pretty far along now. > it could be an option to switch. Could be but what about the other vendor layers ? Would we orphan those ? -- Hal > Sasha > From sashak at voltaire.com Sat Feb 7 09:02:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 19:02:34 +0200 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123830.GQ17713@sashak.voltaire.com> <20090207132753.GS17713@sashak.voltaire.com> <20090207144426.GU17713@sashak.voltaire.com> Message-ID: <20090207170234.GX17713@sashak.voltaire.com> On 10:27 Sat 07 Feb , Hal Rosenstock wrote: > >> Is bypassing it's vendor layer acceptable for OpenSM > > > > Sure, so it is why I asked where and for what purpose do you need pkey > > table and why is OpenSM vendor layer chosen there? > > > >> unless we are > >> going to totally remove it and go straight to umad (which I'm not > >> proposing) ? > > > > BTW, WinOF now has libibumad implemented too, > > Yes, it seems pretty far along now. > > > it could be an option to switch. > > Could be but what about the other vendor layers ? Would we orphan those ? Who needs this really, it is broken long time anyway. Sasha From hal.rosenstock at gmail.com Sat Feb 7 09:02:29 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 7 Feb 2009 12:02:29 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: <20090207170234.GX17713@sashak.voltaire.com> References: <20090207123830.GQ17713@sashak.voltaire.com> <20090207132753.GS17713@sashak.voltaire.com> <20090207144426.GU17713@sashak.voltaire.com> <20090207170234.GX17713@sashak.voltaire.com> Message-ID: On Sat, Feb 7, 2009 at 12:02 PM, Sasha Khapyorsky wrote: > On 10:27 Sat 07 Feb , Hal Rosenstock wrote: >> >> Is bypassing it's vendor layer acceptable for OpenSM >> > >> > Sure, so it is why I asked where and for what purpose do you need pkey >> > table and why is OpenSM vendor layer chosen there? >> > >> >> unless we are >> >> going to totally remove it and go straight to umad (which I'm not >> >> proposing) ? >> > >> > BTW, WinOF now has libibumad implemented too, >> >> Yes, it seems pretty far along now. >> >> > it could be an option to switch. >> >> Could be but what about the other vendor layers ? Would we orphan those ? > > Who needs this really, AFAIK the carrying along of these came from Mellanox. If they no longer need these and Windows is ready to switch over officially to umad, then I don't see an issue. > it is broken long time anyway. What are you referring to as broken here ? -- Hal > Sasha > From sashak at voltaire.com Sat Feb 7 10:37:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 20:37:01 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 1/3] Added io_guid_file options and variables in the different structures and functions. In-Reply-To: <494A5396.5040106@ext.bull.net> References: <494A5339.9030304@ext.bull.net> <494A5396.5040106@ext.bull.net> Message-ID: <20090207183701.GA27757@sashak.voltaire.com> On 14:43 Thu 18 Dec , Nicolas Morey Chaisemartin wrote: > > Signed-off-by: Nicolas Morey-Chaisemartin > > --- > opensm/include/opensm/osm_subnet.h | 5 ++ > opensm/opensm/main.c | 13 ++++++ > opensm/opensm/osm_subnet.c | 9 ++++ > opensm/opensm/osm_ucast_ftree.c | 81 > ++++++++++++++++++++++++++++++++---- > 4 files changed, 100 insertions(+), 8 deletions(-) > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index fe456d5..3f3d919 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -190,6 +190,7 @@ typedef struct osm_subn_opt { > char *lfts_file; > char *root_guid_file; > char *cn_guid_file; > + char *io_guid_file; > char *ids_guid_file; > char *guid_routing_order_file; > char *sa_db_file; > @@ -382,6 +383,10 @@ typedef struct osm_subn_opt { > * Name of the file that contains list of compute node guids that > * will be used by fat-tree routing (provided by User) > * > +* io_guid_file > +* Name of the file that contains list of I/O node guids that > +* will be used by fat-tree routing (provided by User) > +* > * ids_guid_file > * Name of the file that contains list of ids which should be > * used by Up/Down algorithm instead of node GUIDs > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 999e92f..3c1bcf2 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -207,6 +207,9 @@ static void show_usage(void) > printf("--cn_guid_file, -u \n" > " Set the compute nodes for the Fat-Tree routing algorithm\n" > " to the guids provided in the given file (one to a line)\n\n"); > + printf("--io_guid_file, -G \n" > + " Set the I/O nodes for the Fat-Tree routing algorithm\n" > + " to the guids provided in the given file (one to a line)\n\n"); > printf("--ids_guid_file, -m \n" > " Name of the map file with set of the IDs which will be used\n" > " by Up/Down routing algorithm instead of node GUIDs\n" > @@ -570,6 +573,7 @@ int main(int argc, char *argv[]) > {"sadb_file", 1, NULL, 'S'}, > {"root_guid_file", 1, NULL, 'a'}, > {"cn_guid_file", 1, NULL, 'u'}, > + {"io_guid_file", 1, NULL, 'G'}, "G:" should be added to short_options too. > {"ids_guid_file", 1, NULL, 'm'}, > {"guid_routing_order_file", 1, NULL, 'X'}, > {"stay_on_fatal", 0, NULL, 'y'}, > @@ -880,6 +884,15 @@ int main(int argc, char *argv[]) > opt.cn_guid_file); > break; > > + case 'G': > + /* > + Specifies I/O node guids file > + */ > + opt.io_guid_file = optarg; > + printf(" I/O Node Guid File: %s\n", > + opt.io_guid_file); > + break; > + > case 'm': > /* Specifies ids guid file */ > opt.ids_guid_file = optarg; > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 9136021..5bfb6ae 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -410,6 +410,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) > p_opt->lfts_file = NULL; > p_opt->root_guid_file = NULL; > p_opt->cn_guid_file = NULL; > + p_opt->io_guid_file = NULL; > p_opt->ids_guid_file = NULL; > p_opt->guid_routing_order_file = NULL; > p_opt->sa_db_file = NULL; > @@ -1163,6 +1164,9 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) > opts_unpack_charp("cn_guid_file", > p_key, p_val, &p_opts->cn_guid_file); > > + opts_unpack_charp("io_guid_file", > + p_key, p_val, &p_opts->io_guid_file); > + > opts_unpack_charp("ids_guid_file", > p_key, p_val, &p_opts->ids_guid_file); > > @@ -1465,6 +1469,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) > p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str); > > fprintf(out, > + "# The file holding the fat-tree I/O node guids\n" > + "# One guid in each line\nio_guid_file %s\n\n", > + p_opts->io_guid_file ? p_opts->io_guid_file : null_str); > + > + fprintf(out, > "# The file holding the node ids which will be used by" > " Up/Down algorithm instead\n# of GUIDs (one guid and" > " id in each line)\nids_guid_file %s\n\n", > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index b7da20b..c24c517 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -155,6 +155,7 @@ typedef struct ftree_port_group_t_ { > ftree_hca_or_sw remote_hca_or_sw; /* pointer to remote hca/switch */ > cl_ptr_vector_t ports; /* vector of ports to the same lid */ > boolean_t is_cn; /* whether this port is a compute node */ > + boolean_t is_io; /* whether this port is an I/O node */ > uint32_t counter_down; /* number of allocated routs downwards */ > } ftree_port_group_t; > > @@ -205,6 +206,7 @@ typedef struct ftree_fabric_t_ { > cl_qmap_t sw_by_tuple_tbl; > cl_qlist_t root_guid_list; > cl_qmap_t cn_guid_tbl; > + cl_qmap_t io_guid_tbl; > unsigned cn_num; > uint8_t leaf_switch_rank; > uint8_t max_switch_rank; > @@ -392,7 +394,8 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid, > IN ib_net64_t remote_node_guid, > IN uint8_t remote_node_type, > IN void *p_remote_hca_or_sw, > - IN boolean_t is_cn) > + IN boolean_t is_cn, > + IN boolean_t is_io) > { > ftree_port_group_t *p_group = > (ftree_port_group_t *) malloc(sizeof(ftree_port_group_t)); > @@ -440,6 +443,7 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid, > cl_ptr_vector_init(&p_group->ports, 0, /* min size */ > 8); /* grow size */ > p_group->is_cn = is_cn; > + p_group->is_io = is_io; > return p_group; > } /* __osm_ftree_port_group_create() */ > > @@ -705,7 +709,7 @@ __osm_ftree_sw_add_port(IN ftree_sw_t * p_sw, > remote_node_guid, > remote_node_type, > p_remote_hca_or_sw, > - FALSE); > + FALSE,FALSE); Please don't break indentation. Also here and in another places space after ',' is needed (you can look at opensm/doc/opensm-coding-style.txt and use opensm/opensm/osm_indent to get an idea about desired formatting style). > CL_ASSERT(p_group); > > if (direction == FTREE_DIRECTION_UP) > @@ -836,7 +840,8 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca, > IN ib_net64_t remote_port_guid, > IN ib_net64_t remote_node_guid, > IN uint8_t remote_node_type, > - IN void *p_remote_hca_or_sw, IN boolean_t is_cn) > + IN void *p_remote_hca_or_sw, IN boolean_t is_cn, > + IN boolean_t is_io) > { > ftree_port_group_t *p_group; > > @@ -859,7 +864,7 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca, > remote_node_guid, > remote_node_type, > p_remote_hca_or_sw, > - is_cn); > + is_cn,is_io); > p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group; > } > __osm_ftree_port_group_add_port(p_group, port_num, remote_port_num); > @@ -885,6 +890,7 @@ static ftree_fabric_t *__osm_ftree_fabric_create() > cl_qmap_init(&p_ftree->sw_tbl); > cl_qmap_init(&p_ftree->sw_by_tuple_tbl); > cl_qmap_init(&p_ftree->cn_guid_tbl); > + cl_qmap_init(&p_ftree->io_guid_tbl); > > cl_qlist_init(&p_ftree->root_guid_list); > > @@ -953,6 +959,18 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) > } > cl_qmap_remove_all(&p_ftree->cn_guid_tbl); > > + /* remove all the elements of io_guid_tbl */ > + p_next_guid_element = > + (name_map_item_t *) cl_qmap_head(&p_ftree->io_guid_tbl); > + while (p_next_guid_element != > + (name_map_item_t *) cl_qmap_end(&p_ftree->io_guid_tbl)) { > + p_guid_element = p_next_guid_element; > + p_next_guid_element = > + (name_map_item_t *) cl_qmap_next(&p_guid_element->item); > + free(p_guid_element); > + } > + cl_qmap_remove_all(&p_ftree->io_guid_tbl); > + > /* remove all the elements of root_guid_list */ > while (!cl_is_qlist_empty(&p_ftree->root_guid_list)) > free(cl_qlist_remove_head(&p_ftree->root_guid_list)); > @@ -1347,6 +1365,14 @@ static inline boolean_t __osm_ftree_fabric_cns_provided(IN ftree_fabric_t * > > /***************************************************/ > > +static inline boolean_t __osm_ftree_fabric_ios_provided(IN ftree_fabric_t * > + p_ftree) > +{ > + return (p_ftree->p_osm->subn.opt.io_guid_file != NULL); > +} > + > +/***************************************************/ > + > static int __osm_ftree_fabric_mark_leaf_switches(IN ftree_fabric_t * p_ftree) > { > ftree_sw_t *p_sw; > @@ -2816,6 +2842,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree, > uint8_t i; > uint8_t remote_port_num; > boolean_t is_cn = FALSE; > + boolean_t is_io = FALSE; > int res = 0; > > for (i = 0; i < osm_node_get_num_physp(p_node); i++) { > @@ -2893,9 +2920,27 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree, > "Marking CN port GUID 0x%016" PRIx64 "\n", > cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); > } else { > - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > - "Marking non-CN port GUID 0x%016" PRIx64 "\n", > - cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); > + if (__osm_ftree_fabric_ios_provided(p_ftree)) { } else if (...) { .... > + name_map_item_t *p_elem = > + (name_map_item_t *) cl_qmap_get(&p_ftree-> > + io_guid_tbl, > + cl_ntoh64(osm_physp_get_port_guid > + (p_osm_port))); > + if (p_elem != > + (name_map_item_t *) cl_qmap_end(&p_ftree-> > + io_guid_tbl)) > + is_io = TRUE; > + > + > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > + "Marking I/O port GUID 0x%016" PRIx64 "\n", > + cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); > + > + } else { > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > + "Marking non-CN port GUID 0x%016" PRIx64 "\n", > + cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); > + } > } > > __osm_ftree_hca_add_port(p_hca, /* local ftree_hca object */ > @@ -2908,7 +2953,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree, > remote_node_guid, /* remote node guid */ > remote_node_type, /* remote node type */ > (void *)p_remote_sw, /* remote ftree_hca/sw object */ > - is_cn); /* whether this port is compute node */ > + is_cn,is_io); /* whether this port is compute node */ > } > > Exit: > @@ -3399,6 +3444,26 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree) > } > } > > + > + if (__osm_ftree_fabric_ios_provided(p_ftree)) { > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > + "Fetching I/O nodes from file %s\n", > + p_ftree->p_osm->subn.opt.io_guid_file); > + > + if (parse_node_map(p_ftree->p_osm->subn.opt.io_guid_file, > + add_guid_item_to_map, > + &p_ftree->io_guid_tbl)) { > + status = -1; > + goto Exit; > + } > + > + if (!cl_qmap_count(&p_ftree->io_guid_tbl)) { > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: " > + "I/O node guids file has no valid guids\n"); > + status = -1; > + goto Exit; > + } Should empty io_guids file be an error (I don't know)? Sasha > + } > Exit: > OSM_LOG_EXIT(&p_ftree->p_osm->log); > return status; > From sashak at voltaire.com Sat Feb 7 10:39:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 20:39:54 +0200 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123830.GQ17713@sashak.voltaire.com> <20090207132753.GS17713@sashak.voltaire.com> <20090207144426.GU17713@sashak.voltaire.com> <20090207170234.GX17713@sashak.voltaire.com> Message-ID: <20090207183954.GB27757@sashak.voltaire.com> On 12:02 Sat 07 Feb , Hal Rosenstock wrote: > > > it is broken long time anyway. > > What are you referring to as broken here ? All those not-used vendor implementations are not supported many years and likely will not work with current version of OpenSM. Sasha From sashak at voltaire.com Sat Feb 7 10:47:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 20:47:25 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 3/3] Added possible reverse hops for Ftree algorithm. In-Reply-To: <494A53AE.8080706@ext.bull.net> References: <494A5339.9030304@ext.bull.net> <494A53AE.8080706@ext.bull.net> Message-ID: <20090207184725.GC27757@sashak.voltaire.com> On 14:44 Thu 18 Dec , Nicolas Morey Chaisemartin wrote: > > Signed-off-by: Nicolas Morey-Chaisemartin > > --- > opensm/opensm/osm_ucast_ftree.c | 102 > ++++++++++++++++++++++++++++++++------- > 1 files changed, 85 insertions(+), 17 deletions(-) > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index c24c517..d4d3e70 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -2131,7 +2131,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, > IN ib_net16_t target_lid, > IN uint8_t target_rank, > IN boolean_t is_real_lid, > - IN boolean_t is_main_path) > + IN boolean_t is_main_path, > + IN uint16_t reverse_hop_credit) > { > ftree_sw_t *p_remote_sw; > uint16_t ports_num; > @@ -2155,8 +2156,36 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, > p_sw->rank); /* the highest visited point in the tree before going down */ > > /* recursion stop condition - if it's a root switch, */ > - if (p_sw->rank == 0) > - return; > + if (p_sw->rank == 0){ > + if(reverse_hop_credit>0){ if (p_sw->rank == 0 && reverse_hop_credit > 0) { ... > + /* We go up by going down as we have some reverse_hop_credit left*/ > + /* We use the index to scatter a bit the reverse up routes */ > + p_sw->down_port_groups_idx = > + (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; > + i=p_sw->down_port_groups_idx; > + for (j = 0; j < p_sw->down_port_groups_num; j++) { > + > + p_group = p_sw->down_port_groups[i]; > + i = (i + 1) % p_sw->down_port_groups_num; > + > + /* Skip this port group unless it points to a switch */ > + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) > + continue; > + p_remote_sw = p_group->remote_hca_or_sw.p_sw; > + > + __osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ > + p_sw, /* this switch - prev. position switch for the function */ > + target_lid, /* LID that we're routing to */ > + target_rank, /* rank of the LID that we're routing to */ > + is_real_lid, /* whether this target LID is real or dummy */ > + is_main_path,reverse_hop_credit-1); /* whether this is path to HCA that should by tracked by counters */ > + return; > + } > + > + } > + return; > + } > + > > /* Find the least loaded upgoing port group */ > p_min_group = NULL; > @@ -2242,14 +2271,17 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, > p_min_group->counter_down++; > p_min_port->counter_down++; > if (is_real_lid) { > - p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] = > - p_min_port->remote_port_num; > - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > - "Switch %s: set path to CA LID %u through port %u\n", > - __osm_ftree_tuple_to_str(p_remote_sw->tuple), > - cl_ntoh16(target_lid), > - p_min_port->remote_port_num); > - > + /* This LID may already be in the LFT in the reverse_hop feature is used */ > + /* We update the LFT only if this LID isn't already present. */ > + if(p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] == OSM_NO_PATH) { > + p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] = > + p_min_port->remote_port_num; > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > + "Switch %s: set path to CA LID %u through port %u\n", > + __osm_ftree_tuple_to_str(p_remote_sw->tuple), > + cl_ntoh16(target_lid), > + p_min_port->remote_port_num); > + } > /* On the remote switch that is pointed by the min_group, > set hops for ALL the ports in the remote group. */ > > @@ -2274,7 +2306,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, > target_lid, /* LID that we're routing to */ > target_rank, /* rank of the LID that we're routing to */ > is_real_lid, /* whether this target LID is real or dummy */ > - is_main_path); /* whether this is path to HCA that should by tracked by counters */ > + is_main_path, /* whether this is path to HCA that should by tracked by counters */ > + reverse_hop_credit); > } > > /* we're done for the third case */ > @@ -2360,9 +2393,39 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, > target_lid, /* LID that we're routing to */ > target_rank, /* rank of the LID that we're routing to */ > TRUE, /* whether the target LID is real or dummy */ > - FALSE); /* whether this is path to HCA that should by tracked by counters */ > + FALSE,reverse_hop_credit); /* whether this is path to HCA that should by tracked by counters */ > } > > + > + /* If we don't have any reverse hop credits, we are done */ > + if(reverse_hop_credit==0) > + return; > + > + /* We explore all the down group ports */ > + /* We try to reverse jump for each of them */ > + /* They already have a route to us from the upgoing_by_going_down started earlier */ > + /* This is only so it'll continue exploring up, after this step backwards*/ > + for (i = 0; i < p_sw->down_port_groups_num; i++) { > + p_group = p_sw->down_port_groups[i]; > + p_remote_sw = p_group->remote_hca_or_sw.p_sw; > + > + > + /* Skip this port group unless it points to a switch */ > + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) > + continue; > + > + > + /* Recursion step: > + Assign downgoing ports by stepping up, fter doing one step down starting on REMOTE switch. */ > + __osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ > + p_sw, /* this switch - prev. position switch for the function */ > + target_lid, /* LID that we're routing to */ > + target_rank, /* rank of the LID that we're routing to */ > + TRUE, /* whether the target LID is real or dummy */ > + TRUE,reverse_hop_credit-1); /* whether this is path to HCA that should by tracked by counters */ > + } > + > + > } /* ftree_fabric_route_downgoing_by_going_up() */ > > /***************************************************/ > @@ -2448,7 +2511,7 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree) > hca_lid, /* LID that we're routing to */ > p_sw->rank + 1, /* rank of the LID that we're routing to */ > TRUE, /* whether this HCA LID is real or dummy */ > - TRUE); /* whether this path to HCA should by tracked by counters */ > + TRUE,0); /* whether this path to HCA should by tracked by counters */ > > /* count how many real targets have been routed from this leaf switch */ > routed_targets_on_leaf++; > @@ -2473,7 +2536,7 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree) > 0, /* LID that we're routing to - ignored for dummy HCA */ > 0, /* rank of the LID that we're routing to - ignored for dummy HCA */ > FALSE, /* whether this HCA LID is real or dummy */ > - TRUE); /* whether this path to HCA should by tracked by counters */ > + TRUE,0); /* whether this path to HCA should by tracked by counters */ > } > } > } > @@ -2558,7 +2621,8 @@ static void __osm_ftree_fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) > hca_lid, /* LID that we're routing to */ > p_sw->rank + 1, /* rank of the LID that we're routing to */ > TRUE, /* whether this HCA LID is real or dummy */ > - TRUE); /* whether this path to HCA should by tracked by counters */ > + TRUE, /* whether this path to HCA should by tracked by counters */ > + p_hca_port_group->is_io ? p_ftree->p_osm->subn.opt.max_reverse_hops :0 ); /* Number or reverse hops allowed*/ > } > /* done with all the port groups of this HCA - go to next HCA */ > } > @@ -2610,7 +2674,7 @@ static void __osm_ftree_fabric_route_to_switches(IN ftree_fabric_t * p_ftree) > p_sw->base_lid, /* LID that we're routing to */ > p_sw->rank, /* rank of the LID that we're routing to */ > TRUE, /* whether the target LID is a real or dummy */ > - FALSE); /* whether this path should by tracked by counters */ > + FALSE,0); /* whether this path should by tracked by counters */ > } > > OSM_LOG_EXIT(&p_ftree->p_osm->log); > @@ -3432,6 +3496,8 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree) > if (parse_node_map(p_ftree->p_osm->subn.opt.cn_guid_file, > add_guid_item_to_map, > &p_ftree->cn_guid_tbl)) { > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: " > + "Problem parsin CN guid file\n"); > status = -1; > goto Exit; > } > @@ -3453,6 +3519,8 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree) > if (parse_node_map(p_ftree->p_osm->subn.opt.io_guid_file, > add_guid_item_to_map, > &p_ftree->io_guid_tbl)) { > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: " > + "Problem parsin I/O guid file\n"); "ERR AB**" codes should be unique. Sasha > status = -1; > goto Exit; > } > From sashak at voltaire.com Sat Feb 7 10:55:51 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 20:55:51 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <494A5339.9030304@ext.bull.net> References: <494A5339.9030304@ext.bull.net> Message-ID: <20090207185551.GD27757@sashak.voltaire.com> Hi Nicolas, On 14:42 Thu 18 Dec , Nicolas Morey Chaisemartin wrote: > > We are current working on a Ftree topology where IO nodes are connected on > spine switches. > Using the cn_guid_file and root_guid_file works great. > It is possible to route the whole tree as a fat tree. All the CNs are > connected to the other CN and IO nodes. > However, we are missing some connectivity between IO nodes. This is the > expected behavior as the route between those IO nodes would have > to go down to go back up on another spine switch. > > However, we need at least a bit of connectivity between those nodes. There > won't be any real traffic but just some "ping" for HA purposes. > > Therefore, I have implemented two new options to openSM: io_guid_file and > max_reverse_hops. > The io_guid_file provides a list of all the IO guid (it may differs from > the list of non-CN nodes) "IO" is specific for your setup. Could we find more generic name for such nodes? > The max_reverse_hops gives the number of time IO nodes (described by > io_guid_file) are allowed to use a switch backward. Don't those two options duplicate each others somehow? If we want to connect io nodes anyway, why max_reverse_hops should be important? Or probably instead of having io nodes guids list we prefer to connect everything N hops from roots? Then sort of --connect-roots extension (--connect-roots=3) could work. No? > > According to my tests this has absolutely no effects on regular routing and > manages to connect the io nodes together, if max_reverse_hops is big > enough. > > This is a first draft for this feature. I'd be happy to have some feedback > about how to upgrade it and make it as clean as possible, wether it is > integrated in the mainstream or not. Since this functionality is optional, useful and shouldn't change a default behavior it can be suitable for main stream IMO. Sasha From devel at morey-chaisemartin.com Sat Feb 7 11:48:13 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Sat, 07 Feb 2009 20:48:13 +0100 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <20090207185551.GD27757@sashak.voltaire.com> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> Message-ID: <498DE57D.4030501@morey-chaisemartin.com> Sasha Khapyorsky a écrit : > Hi Nicolas, > > On 14:42 Thu 18 Dec , Nicolas Morey Chaisemartin wrote: > >> We are current working on a Ftree topology where IO nodes are connected on >> spine switches. >> Using the cn_guid_file and root_guid_file works great. >> It is possible to route the whole tree as a fat tree. All the CNs are >> connected to the other CN and IO nodes. >> However, we are missing some connectivity between IO nodes. This is the >> expected behavior as the route between those IO nodes would have >> to go down to go back up on another spine switch. >> >> However, we need at least a bit of connectivity between those nodes. There >> won't be any real traffic but just some "ping" for HA purposes. >> >> Therefore, I have implemented two new options to openSM: io_guid_file and >> max_reverse_hops. >> The io_guid_file provides a list of all the IO guid (it may differs from >> the list of non-CN nodes) >> > > "IO" is specific for your setup. Could we find more generic name for such > nodes? > > Sure. Any ideas? >> The max_reverse_hops gives the number of time IO nodes (described by >> io_guid_file) are allowed to use a switch backward. >> > > Don't those two options duplicate each others somehow? If we want to > connect io nodes anyway, why max_reverse_hops should be important? > Because we may not want to connect all of them to all the nodes. By specifying a small max_reverse_hop you can restrain (depending on your topology) the effect of the io_guid_file so an "IO" node will only see the closests "IO" node through reverse routes but not all of them As the effect on credit loop is not certain yet, I think the less reverse route we create, the better it is. > Or probably instead of having io nodes guids list we prefer to connect > everything N hops from roots? Then sort of --connect-roots extension > (--connect-roots=3) could work. No? > > That should work too but it is less flexible than io_guid_file for tweaking the configuration and have the best routing scheme. >> According to my tests this has absolutely no effects on regular routing and >> manages to connect the io nodes together, if max_reverse_hops is big >> enough. >> >> This is a first draft for this feature. I'd be happy to have some feedback >> about how to upgrade it and make it as clean as possible, wether it is >> integrated in the mainstream or not. >> > > Since this functionality is optional, useful and shouldn't change a > default behavior it can be suitable for main stream IMO. > > Sasha Okay, I'll fix the indentation and few coding style error. There is also a bug in the current patch as the hops counter are not set to the right value when creating route which had reverse hops number of reverse hops*2 should be added. I'll have to rewrite the patches so they work with the current HEAD. Specially with the option system changes it won't merge cleanly. Nicolas From sashak at voltaire.com Sat Feb 7 12:23:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Feb 2009 22:23:19 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <498DE57D.4030501@morey-chaisemartin.com> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> Message-ID: <20090207202319.GE27757@sashak.voltaire.com> On 20:48 Sat 07 Feb , Nicolas Morey-Chaisemartin wrote: > > > > "IO" is specific for your setup. Could we find more generic name for such > > nodes? > > > > > Sure. Any ideas? No, I didn't think about it. > >> The max_reverse_hops gives the number of time IO nodes (described by > >> io_guid_file) are allowed to use a switch backward. > >> > > > > Don't those two options duplicate each others somehow? If we want to > > connect io nodes anyway, why max_reverse_hops should be important? > > > Because we may not want to connect all of them to all the nodes. By > specifying a small max_reverse_hop you can restrain (depending on your > topology) the effect of the io_guid_file so an "IO" node will only see > the closests "IO" node through reverse routes but not all of them > As the effect on credit loop is not certain yet, I think the less > reverse route we create, the better it is. > > Or probably instead of having io nodes guids list we prefer to connect > > everything N hops from roots? Then sort of --connect-roots extension > > (--connect-roots=3) could work. No? > > > > > That should work too but it is less flexible than io_guid_file for > tweaking the configuration and have the best routing scheme. > >> According to my tests this has absolutely no effects on regular routing and > >> manages to connect the io nodes together, if max_reverse_hops is big > >> enough. > >> > >> This is a first draft for this feature. I'd be happy to have some feedback > >> about how to upgrade it and make it as clean as possible, wether it is > >> integrated in the mainstream or not. > >> > > > > Since this functionality is optional, useful and shouldn't change a > > default behavior it can be suitable for main stream IMO. > > > > Sasha > Okay, I'll fix the indentation and few coding style error. There is also > a bug in the current patch as the hops counter are not set to the right > value when creating route which had reverse hops number of reverse > hops*2 should be added. > > I'll have to rewrite the patches so they work with the current HEAD. > Specially with the option system changes it won't merge cleanly. Use 'git-rebase master ' - it does the job with only two trivial conflicts. Sasha From ogerlitz at voltaire.com Sat Feb 7 22:53:23 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 08 Feb 2009 08:53:23 +0200 Subject: [ofa-general] Re: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: <498B40F6.7060904@Voltaire.COM> References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <4989E6D6.5030109@Voltaire.COM> <3522BA7F49834878A674F2908834D747@amr.corp.intel.com> <498B3D7E.6010300@Voltaire.COM> <498B40F6.7060904@Voltaire.COM> Message-ID: <498E8163.6090803@voltaire.com> Yossi Etigin wrote: >> Have you tested the patch and verified that it works for you? >> > Yes I did, with mckey. When the HCA port is down: Without the patch, > mckey fails on from rdma_resolve_route (except when ipoib is trying to > join at the same time - then there will be a join error). With the > patch, mckey fails on rdma_create_qp (again, except when ipoib is > trying to join at the same time). When the HCA port is up, mckey > works normally. mckey shouldn't be calling rdma_resolve_route, so I assume you referred to rdma_resolve_addr Or. From vlad at lists.openfabrics.org Sun Feb 8 03:11:34 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 8 Feb 2009 03:11:34 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090208-0200 daily build status Message-ID: <20090208111134.F1EFCE60F20@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From tziporet at mellanox.co.il Sun Feb 8 07:57:34 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 8 Feb 2009 17:57:34 +0200 Subject: [ofa-general] OFED (EWG) meeting agenda for tomorrow (Feb 09) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com> > These are the agenda items for the meeting tomorrow: > > 1. OFED 1.4.1 release status: * New OSes: * RH 5.3 - done, we still have an issue with Itanium * SLES 11 - schedule is OK. RC3 already available - Any volunteers to prepare the backports? * Open MPI 1.3 - I heard there are some critical bugs. What is the status of 1.3.1? - Jeff S. * RDS with iWARP support - Steve * NFS/RDMA backports - Steve * Critical bug fixes As far as I know these are the critical bugs that should be fixed: 1383 blo P3 jackm at mellanox.co.il Local protection error on transmit from ipoib datagram to... 1471 cri P3 amirv at mellanox.co.il Performance degradation in ofed 1.4 Please send more bugs that are critical for the release 2. Decide on 1.4.1 schedule: Proposal: * RC1 - Mar 3 * RC2 - Mar 17 * RC3 - Mar 31 * GA - Apr 7 3. Sonoma updates (if any) - Bill Boas > Tziporet > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dorfman.eli at gmail.com Sun Feb 8 09:07:45 2009 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Sun, 8 Feb 2009 19:07:45 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_subnet.c fix parse functions for big endian machines In-Reply-To: <20090205180400.GJ5910@sashak.voltaire.com> References: <498B038D.4020009@gmail.com> <20090205180400.GJ5910@sashak.voltaire.com> Message-ID: <694d48600902080907u3a6b40f7s7d0d612fd6a793ce@mail.gmail.com> On Thu, Feb 5, 2009 at 8:04 PM, Sasha Khapyorsky wrote: > On 17:19 Thu 05 Feb     , Eli Dorfman (Voltaire) wrote: >> fix parse functions for big endian machines >> >> Signed-off-by: Eli Dorfman > > Applied. Thanks. > > I'm fine with this patch - the code looks cleaner than it was before. > > But could you please explain what was a problem with original code on > big endian machines (I don't see)? The problem was that setup function that is called from the parse uint8 function assumed that void * p_val is a pointer to uint8 but it was uint32 > > Also it would be helpful to have more detailed patch comments. > > Sasha > >> --- >>  opensm/opensm/osm_subnet.c |   10 +++++----- >>  1 files changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c >> index d6d39a6..7b33659 100644 >> --- a/opensm/opensm/osm_subnet.c >> +++ b/opensm/opensm/osm_subnet.c >> @@ -710,14 +710,14 @@ opts_parse_net16(IN osm_subn_t *p_subn, >>                 IN void *p_v, IN setup_fn_t pfn) >>  { >>       uint16_t *p_val = p_v; >> -     uint32_t val = strtoul(p_val_str, NULL, 0); >> +     uint16_t val = strtoul(p_val_str, NULL, 0); >> >>       CL_ASSERT(val < 0x10000); >> -     if (cl_hton32(val) != *p_val) { >> +     if (cl_hton16(val) != *p_val) { >>               log_config_value(p_key, "0x%04x", val); >>               if (pfn) >>                       pfn(p_subn, &val); >> -             *p_val = cl_hton16((uint16_t) val); >> +             *p_val = cl_hton16(val); >>       } >>  } >> >> @@ -729,14 +729,14 @@ opts_parse_uint8(IN osm_subn_t *p_subn, >>                 IN void *p_v, IN setup_fn_t pfn) >>  { >>       uint8_t *p_val = p_v; >> -     uint32_t val = strtoul(p_val_str, NULL, 0); >> +     uint8_t val = strtoul(p_val_str, NULL, 0); >> >>       CL_ASSERT(val < 0x100); >>       if (val != *p_val) { >>               log_config_value(p_key, "%u", val); >>               if (pfn) >>                       pfn(p_subn, &val); >> -             *p_val = (uint8_t) val; >> +             *p_val = val; >>       } >>  } >> >> -- >> 1.5.5 >> > From dorfman.eli at gmail.com Sun Feb 8 11:23:27 2009 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Sun, 8 Feb 2009 21:23:27 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <20090205121634.GQ11874@sashak.voltaire.com> References: <497DC87F.2090308@gmail.com> <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> <498A9888.5010003@gmail.com> <20090205121634.GQ11874@sashak.voltaire.com> Message-ID: <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com> On Thu, Feb 5, 2009 at 2:16 PM, Sasha Khapyorsky wrote: > On 09:43 Thu 05 Feb     , Eli Dorfman (Voltaire) wrote: >> >> ok. Please apply the fixed patch. > > Did you test it? yes, but wouldn't it be better to separate between heavy sweep and config rescan (due to SIGHUP). I think that user should know when configuration is updated and not wait for heavy sweep. Eli From sashak at voltaire.com Sun Feb 8 13:38:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 8 Feb 2009 23:38:26 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com> References: <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> <498A9888.5010003@gmail.com> <20090205121634.GQ11874@sashak.voltaire.com> <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com> Message-ID: <20090208213826.GA24254@sashak.voltaire.com> Hi Eli, On 21:23 Sun 08 Feb , Eli Dorfman wrote: > > yes, but wouldn't it be better to separate between heavy sweep and > config rescan (due to SIGHUP). SIGHUP main purpose always was to trigger heavy sweep. > I think that user should know when configuration is updated and not > wait for heavy sweep. I'm not following - SIGHUP will cause heavy sweep and config update, where is a waiting? Sasha From kliteyn at dev.mellanox.co.il Sun Feb 8 14:19:59 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 09 Feb 2009 00:19:59 +0200 Subject: [ofa-general] Re: saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> Message-ID: <498F5A8F.2000101@dev.mellanox.co.il> Hi Stan, Adding Sasha (OFED management maintainer) and the openib mailing list. Stan C. Smith wrote: > Hello, > The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in opensm\user\include\vendor\osm_vendor_al.h) is missing > the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'. > saquery.c expects to find ca_names in osm_vendor_t. > > A couple of observations: > 1) Windows currently supports a much older version of opensm than what OFED 1.4 tools expect. Correct. Windows OpenSM is a ported pre-OFED 1.2 OpenSM with couple of minor fixes. > 2) saquery uses OpenSM mad interfaces while it 'could' be using libibmad interfaces. By "OpenSM mad interfaces" you mean libosmvendor? > If libibmad from saquery, then OpenSM would not need libibmad references for Windows. Not sure what you mean here. You mean removing libibmad dependency from saquery? > 3) How bad is it to create libibmad dependencies for OpenSM? Pretty bad. I don't think we should add a new dependency unless there's a really good reason for it. > 4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD interfaces; the rest use > libibmad. > > Most of the OFED diagnostic tools support the cmd-line option '-C ca_name'. This cmd-line input is resolved thru > libibmad interfaces. > Saquery is no exception in that it expects to match the '-C ca_name' against osm_vendor_t.ca_names[]. 'ibstat -l' lists > CA names. > > The question becomes how best to resolve the missing ca_names? > > 1) modify saquery to call libibmad interface to get CA names; leaves osm_vendor_t unmodified. > So far, saquery is the only diag pgm which uses OSM mad interfaces; expecting ca_names > in osm_vendor_t. > > 2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and populate ca_names > from OpenSM code? I'd say that this option is much better. > Creates libibmad dependencies for opensm. But it doesn't have to. Can IBAL expose some function to get these names, so that Win osmvendor will use this function instead of libibmad? Also, Linux osmvendor doesn't have libibmad dependency. It uses umad function umad_get_cas_names() to obtain the CA names. I know that there is a Windows version of umad, but I don't know what is its status. If we *have* to add an additional dependency, then it should be libibumad and not libibmad. At some point in the future we would really want to have the new version of OFED OpenSM ported to WinOF. If there will be a match between Linux and Windows libraries, then the whole vendor concept can be simplified and there won't be a need to have a separate vendor for IBAL. The things that would be different are platform-dependent issues like threads, locks, syslog, but not IB-related code. -- Yevgeny > Comments? > > Thanks, > > Stan. > > > > From kliteyn at dev.mellanox.co.il Sun Feb 8 14:36:43 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 09 Feb 2009 00:36:43 +0200 Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor AL - ca_names missing from osm_vendor_t ? In-Reply-To: <498F5A8F.2000101@dev.mellanox.co.il> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> Message-ID: <498F5E7B.6020208@dev.mellanox.co.il> Yevgeny Kliteynik wrote: > Hi Stan, Oops... Looks like I was having a problem with my mail client. By now my response is partially outdated... -- Yevgeny > Adding Sasha (OFED management maintainer) > and the openib mailing list. > > Stan C. Smith wrote: >> Hello, >> The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in >> opensm\user\include\vendor\osm_vendor_al.h) is missing >> the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'. >> saquery.c expects to find ca_names in osm_vendor_t. >> >> A couple of observations: >> 1) Windows currently supports a much older version of opensm than what >> OFED 1.4 tools expect. > > Correct. Windows OpenSM is a ported pre-OFED 1.2 OpenSM with couple of > minor fixes. > >> 2) saquery uses OpenSM mad interfaces while it 'could' be using >> libibmad interfaces. > > By "OpenSM mad interfaces" you mean libosmvendor? > >> If libibmad from saquery, then OpenSM would not need libibmad >> references for Windows. > > Not sure what you mean here. You mean removing libibmad dependency from > saquery? > >> 3) How bad is it to create libibmad dependencies for OpenSM? > > Pretty bad. I don't think we should add a new dependency unless there's a > really good reason for it. > >> 4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD >> interfaces; the rest use >> libibmad. >> >> Most of the OFED diagnostic tools support the cmd-line option '-C >> ca_name'. This cmd-line input is resolved thru >> libibmad interfaces. >> Saquery is no exception in that it expects to match the '-C ca_name' >> against osm_vendor_t.ca_names[]. 'ibstat -l' lists >> CA names. >> >> The question becomes how best to resolve the missing ca_names? >> >> 1) modify saquery to call libibmad interface to get CA names; leaves >> osm_vendor_t unmodified. >> So far, saquery is the only diag pgm which uses OSM mad interfaces; >> expecting ca_names >> in osm_vendor_t. >> >> 2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names and >> populate ca_names >> from OpenSM code? > > I'd say that this option is much better. > >> Creates libibmad dependencies for opensm. > > But it doesn't have to. Can IBAL expose some function to get these names, > so that Win osmvendor will use this function instead of libibmad? > > Also, Linux osmvendor doesn't have libibmad dependency. > It uses umad function umad_get_cas_names() to obtain the CA names. > I know that there is a Windows version of umad, but I don't know what is > its status. If we *have* to add an additional dependency, then it should > be libibumad and not libibmad. > > At some point in the future we would really want to have the new version > of OFED OpenSM ported to WinOF. If there will be a match between Linux and > Windows libraries, then the whole vendor concept can be simplified and > there won't be a need to have a separate vendor for IBAL. The things > that would be different are platform-dependent issues like threads, locks, > syslog, but not IB-related code. > > -- Yevgeny > > >> Comments? >> >> Thanks, >> >> Stan. >> >> >> >> > > _______________________________________________ > ofw mailing list > ofw at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > From jsquyres at cisco.com Sun Feb 8 14:43:35 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Sun, 8 Feb 2009 14:43:35 -0800 Subject: [ofa-general] OFED (EWG) meeting agenda for tomorrow (Feb 09) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com> Message-ID: <00B5AD34-1DFF-440E-8BDB-3C9DE98110AE@cisco.com> On Feb 8, 2009, at 7:57 AM, Tziporet Koren wrote: > • Open MPI 1.3 - I heard there are some critical bugs. What is the > status of 1.3.1? - Jeff S. > I'm unfortunately unable to make it to the call tomorrow. What bugs do you want to know about -- are there any in particular that you're asking about? OMPI v1.3.1 is readying for release; *possibly* this week (50/50 chance of that). -- Jeff Squyres Cisco Systems From sashak at voltaire.com Sun Feb 8 14:54:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 00:54:12 +0200 Subject: [ofa-general] [PATCH] opensm/qos_config: no invalid option message on default values Message-ID: <20090208225412.GA24514@sashak.voltaire.com> Don't comply about invalid QoS options when its default values are used. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_subnet.c | 18 +++++++++--------- 1 files changed, 9 insertions(+), 9 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 3324af9..69937c1 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -911,9 +911,11 @@ static ib_api_status_t osm_parse_prefix_routes_file(IN osm_subn_t * const p_subn **********************************************************************/ static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned dflt) { - if (!(*max_vls) || *max_vls > 15) { - log_report(" Invalid Cached Option: %s_max_vls=%u: " - "Using Default = %u\n", prefix, *max_vls, dflt); + if (!*max_vls || *max_vls > 15) { + if (*max_vls) + log_report(" Invalid Cached Option: %s_max_vls=%u: " + "Using Default = %u\n", + prefix, *max_vls, dflt); *max_vls = dflt; } } @@ -921,8 +923,10 @@ static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned static void subn_verify_high_limit(int *high_limit, const char *prefix, int dflt) { if (*high_limit < 0 || *high_limit > 255) { - log_report(" Invalid Cached Option: %s_high_limit=%d: " - "Using Default: %d\n", prefix, *high_limit, dflt); + if (*high_limit > 255) + log_report(" Invalid Cached Option: %s_high_limit=%d: " + "Using Default: %d\n", + prefix, *high_limit, dflt); *high_limit = dflt; } } @@ -934,8 +938,6 @@ static void subn_verify_vlarb(char **vlarb, const char *prefix, int count = 0; if (*vlarb == NULL) { - log_report(" Invalid Cached Option: %s_vlarb_%s: " - "Using Default\n", prefix, suffix); *vlarb = strdup(dflt); return; } @@ -1003,8 +1005,6 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt) int count = 0; if (*sl2vl == NULL) { - log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n", - prefix); *sl2vl = strdup(dflt); return; } -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sun Feb 8 15:01:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 01:01:54 +0200 Subject: [ofa-general] [PATCH] opensm: sort port order for routing by switch loads Message-ID: <20090208230154.GB24514@sashak.voltaire.com> This follows "port order" routing load balancer improvements (implemented using "--guid_routing_order_file" command line option). The idea of the patch is about default behavior and it is to balance routing paths in such order that most loaded links enter balancer first - in most cases it should provide a better performance than just random balancing (as it is done now by default). The implementation is simple - endport list for load balancer is reverse sorted by number of endport links of leaf switches. Signed-off-by: Sasha Khapyorsky --- Changes from RFC version of this patch are: - ignore port state during endport_links counting - it is b/c initially links can be in other than ACTIVE states (INIT, ARMED), remote port existence should be good enough criteria by itself. - store endport_links value in osm_switch structure and don't recount it during qsort() - minor simplifications opensm/include/opensm/osm_switch.h | 1 + opensm/opensm/osm_ucast_mgr.c | 62 +++++++++++++++++++++++++++++++++++- 2 files changed, 62 insertions(+), 1 deletions(-) diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index dbc22e5..6279727 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -104,6 +104,7 @@ typedef struct osm_switch { uint8_t *new_lft; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; + unsigned endport_links; unsigned need_update; void *priv; } osm_switch_t; diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 96921a0..7232fbc 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -744,6 +744,65 @@ static void clear_prof_ignore_flag(cl_map_item_t * const p_map_item, void *ctx) } } +static void add_sw_endports_to_order_list(osm_switch_t *sw, osm_ucast_mgr_t *m) +{ + osm_port_t *port; + osm_physp_t *p; + int i; + + for (i = 1; i < sw->num_ports; i++) { + p = osm_node_get_physp_ptr(sw->p_node, i); + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw) { + port = osm_get_port_by_guid(m->p_subn, + p->p_remote_physp->port_guid); + cl_qlist_insert_tail(&m->port_order_list, + &port->list_item); + port->flag = 1; + } + } +} + +static void sw_count_endport_links(osm_switch_t *sw) +{ + osm_physp_t *p; + int i; + + sw->endport_links = 0; + for (i = 1; i < sw->num_ports; i++) { + p = osm_node_get_physp_ptr(sw->p_node, i); + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw) + sw->endport_links++; + } +} + +static int compar_sw_load(const void *s1, const void *s2) +{ +#define get_sw_endport_links(s) (*(osm_switch_t **)s)->endport_links + return get_sw_endport_links(s2) - get_sw_endport_links(s1); +} + +static void sort_ports_by_switch_load(osm_ucast_mgr_t *m) +{ + int i, num = cl_qmap_count(&m->p_subn->sw_guid_tbl); + void **s = malloc(num * sizeof(*s)); + if (!s) { + OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR: " + "No memory, skip by switch load sorting.\n"); + return; + } + s[0] = cl_qmap_head(&m->p_subn->sw_guid_tbl); + for (i = 1; i < num; i++) + s[i] = cl_qmap_next(s[i-1]); + + for (i = 0; i < num; i++) + sw_count_endport_links(s[i]); + + qsort(s, num, sizeof(*s), compar_sw_load); + + for (i = 0; i < num; i++) + add_sw_endports_to_order_list(s[i], m); +} + static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) { cl_qlist_init(&p_mgr->port_order_list); @@ -758,7 +817,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : " "cannot parse guid routing order file \'%s\'\n", p_mgr->p_subn->opt.guid_routing_order_file); - } + } else + sort_ports_by_switch_load(p_mgr); if (p_mgr->p_subn->opt.port_prof_ignore_file) { cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sun Feb 8 15:04:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 01:04:06 +0200 Subject: [ofa-general] [PATCH] opensm/ftree: cleanup ftree_sw_tbl_element_t use Message-ID: <20090208230406.GC24514@sashak.voltaire.com> cl_list() allocates memory needed for storing an object in the list - no need additional wrappers like ftree_sw_tbl_element_t. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_ftree.c | 17 ++++------------- 1 files changed, 4 insertions(+), 13 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 68900d8..10096c7 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1418,7 +1418,6 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) ftree_tuple_t new_tuple; uint32_t i; cl_list_t bfs_list; - ftree_sw_tbl_element_t *p_sw_tbl_element; OSM_LOG_ENTER(&p_ftree->p_osm->log); @@ -1465,14 +1464,10 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) */ cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); - cl_list_insert_tail(&bfs_list, - &__osm_ftree_sw_tbl_element_create(p_sw)->map_item); + cl_list_insert_tail(&bfs_list, p_sw); while (!cl_is_list_empty(&bfs_list)) { - p_sw_tbl_element = - (ftree_sw_tbl_element_t *) cl_list_remove_head(&bfs_list); - p_sw = p_sw_tbl_element->p_sw; - __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element); + p_sw = (ftree_sw_t *) cl_list_remove_head(&bfs_list); /* Discover all the nodes from ports that are pointing down */ @@ -1509,9 +1504,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) new_tuple); /* add the newly discovered switch to the BFS queue */ - cl_list_insert_tail(&bfs_list, - &__osm_ftree_sw_tbl_element_create - (p_remote_sw)->map_item); + cl_list_insert_tail(&bfs_list, p_sw); } /* Done assigning indexes to all the remote switches that are pointed by the downgoing ports. @@ -1547,9 +1540,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) p_remote_sw, new_tuple); /* add the newly discovered switch to the BFS queue */ - cl_list_insert_tail(&bfs_list, - &__osm_ftree_sw_tbl_element_create - (p_remote_sw)->map_item); + cl_list_insert_tail(&bfs_list, p_sw); } /* Done assigning indexes to all the remote switches that are pointed by the upgoing ports. -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sun Feb 8 15:08:30 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 01:08:30 +0200 Subject: [ofa-general] [PATCH] opensm/ftree: simplify root guids setup. Message-ID: <20090208230830.GD24514@sashak.voltaire.com> Eliminate root_guid_list storage - parse it directly to bfs list. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_ftree.c | 101 +++++++++++++------------------------- 1 files changed, 35 insertions(+), 66 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 10096c7..35f2ea1 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -100,11 +100,6 @@ struct ftree_fabric_t_; typedef uint8_t ftree_tuple_t[FTREE_TUPLE_LEN]; typedef uint64_t ftree_tuple_key_t; -struct guid_list_item { - cl_list_item_t list; - uint64_t guid; -}; - /*************************************************** ** ** ftree_sw_table_element_t definition @@ -203,7 +198,6 @@ typedef struct ftree_fabric_t_ { cl_qmap_t hca_tbl; cl_qmap_t sw_tbl; cl_qmap_t sw_by_tuple_tbl; - cl_qlist_t root_guid_list; cl_qmap_t cn_guid_tbl; unsigned cn_num; uint8_t leaf_switch_rank; @@ -886,8 +880,6 @@ static ftree_fabric_t *__osm_ftree_fabric_create() cl_qmap_init(&p_ftree->sw_by_tuple_tbl); cl_qmap_init(&p_ftree->cn_guid_tbl); - cl_qlist_init(&p_ftree->root_guid_list); - return p_ftree; } @@ -953,10 +945,6 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) } cl_qmap_remove_all(&p_ftree->cn_guid_tbl); - /* remove all the elements of root_guid_list */ - while (!cl_is_qlist_empty(&p_ftree->root_guid_list)) - free(cl_qlist_remove_head(&p_ftree->root_guid_list)); - /* free the leaf switches array */ if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches)) free(p_ftree->leaf_switches); @@ -3045,16 +3033,41 @@ Exit: /*************************************************** ***************************************************/ +struct rank_root_cxt { + ftree_fabric_t *fabric; + cl_list_t *list; +}; + +static int rank_root_sw_by_guid(void *cxt, uint64_t guid, char *p) +{ + struct rank_root_cxt *c = cxt; + ftree_sw_t *sw; + + sw = __osm_ftree_fabric_get_sw_by_guid(c->fabric, cl_hton64(guid)); + if (!sw) { + /* the specified root guid wasn't found in the fabric */ + OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_ERROR, "ERR AB24: " + "Root switch GUID 0x%" PRIx64 " not found\n", guid); + return 0; + } + + OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_DEBUG, + "Ranking root switch with GUID 0x%" PRIx64 "\n", guid); + sw->rank = 0; + cl_list_insert_tail(c->list, sw); + + return 0; +} static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree) { + struct rank_root_cxt context; osm_node_t *p_osm_node; osm_node_t *p_remote_osm_node; osm_physp_t *p_osm_physp; ftree_sw_t *p_sw; ftree_sw_t *p_remote_sw; cl_list_t ranking_bfs_list; - struct guid_list_item *item; int res = 0; unsigned num_roots; unsigned max_rank = 0; @@ -3064,25 +3077,16 @@ static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree) cl_list_init(&ranking_bfs_list, 10); /* Rank all the roots and add them to list */ - for (item = (void *)cl_qlist_head(&p_ftree->root_guid_list); - item != (void *)cl_qlist_end(&p_ftree->root_guid_list); - item = (void *)cl_qlist_next(&item->list)) { - p_sw = - __osm_ftree_fabric_get_sw_by_guid(p_ftree, - cl_hton64(item->guid)); - if (!p_sw) { - /* the specified root guid wasn't found in the fabric */ - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB24: " - "Root switch GUID 0x%" PRIx64 " not found\n", - item->guid); - continue; - } + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "Fetching root nodes from file %s\n", + p_ftree->p_osm->subn.opt.root_guid_file); - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "Ranking root switch with GUID 0x%" PRIx64 "\n", - item->guid); - p_sw->rank = 0; - cl_list_insert_tail(&ranking_bfs_list, p_sw); + context.fabric = p_ftree; + context.list = &ranking_bfs_list; + if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file, + rank_root_sw_by_guid, &context)) { + res = -1; + goto Exit; } num_roots = cl_list_count(&ranking_bfs_list); @@ -3314,21 +3318,6 @@ Exit: /*************************************************** ***************************************************/ -static int add_guid_item_to_list(void *cxt, uint64_t guid, char *p) -{ - cl_qlist_t *list = cxt; - struct guid_list_item *item; - - item = malloc(sizeof(*item)); - if (!item) - return -1; - - item->guid = guid; - cl_qlist_insert_tail(list, &item->list); - - return 0; -} - static int add_guid_item_to_map(void *cxt, uint64_t guid, char *p) { cl_qmap_t *map = cxt; @@ -3350,26 +3339,6 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree) OSM_LOG_ENTER(&p_ftree->p_osm->log); - if (__osm_ftree_fabric_roots_provided(p_ftree)) { - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "Fetching root nodes from file %s\n", - p_ftree->p_osm->subn.opt.root_guid_file); - - if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file, - add_guid_item_to_list, - &p_ftree->root_guid_list)) { - status = -1; - goto Exit; - } - - if (!cl_qlist_count(&p_ftree->root_guid_list)) { - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB22: " - "Root guids file has no valid guids\n"); - status = -1; - goto Exit; - } - } - if (__osm_ftree_fabric_cns_provided(p_ftree)) { OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Fetching compute nodes from file %s\n", -- 1.6.1.2.319.gbd9e From nicolas.morey-chaisemartin at ext.bull.net Sun Feb 8 23:01:32 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 09 Feb 2009 08:01:32 +0100 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation Message-ID: <498FD4CC.8070900@ext.bull.net> Here is an updated version of the patch including Yevgeni's feedback. Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 39 +++++++++++++++++++++++---------------- 1 files changed, 23 insertions(+), 16 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 2f1d358f2bdf67838fe8776438b7757d9dcd6e15.diff Type: text/x-patch Size: 3805 bytes Desc: not available URL: From ofedrnicuser at yahoo.com Sun Feb 8 23:36:34 2009 From: ofedrnicuser at yahoo.com (Ofed User) Date: Sun, 8 Feb 2009 23:36:34 -0800 (PST) Subject: [ofa-general] ***SPAM*** non zero lkey in send(), write() with num_sge > 1? Message-ID: <661509.82751.qm@web111205.mail.gq1.yahoo.com> Hi, Can stack pass num_sge > 1, and lkey !=0 as part of sg_list[] elements, in post_send() call? Regards, Bill From nicolas.morey-chaisemartin at ext.bull.net Sun Feb 8 23:40:05 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 09 Feb 2009 08:40:05 +0100 Subject: [ofa-general] [RFC] Fat-Tree upgrades Message-ID: <498FDDD5.1090204@ext.bull.net> Hi everyone, We have been working quite a lot at Bull lately on the Ftree algorithm and we have made some upgrades. However, as they modify the behavior of the ftree algorithm, we haven't pushed them until now. I'm just going to detail which upgrades we have done and let you decide if you are interested, if and how they should be pushed upstream (new routing algorithm, option in the ftree, etc.) Here is a simplify model of the topology we have been working on L3 L3 ___________________|__|____________________ / / \ \ <= All the L2 are connected on 2 L3 switches L2-1 L2-2 L2-1 L2-2 <= There are service nodes connected directly on L2 switches / \S1 / \S2 S3/ \ S4/ \ <== The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning. L1 L1 L1 L1 /|\ /|\ /|\ /|\ ==Fully mixed to L1== ==Fully mixed to L1== <=== We have multiple set. In each set, all L0 lead to all L1 of their set. L0 L0 L0 L0 / \ / \ / \ / \ CN CN .. CN CN .... CN CN .. CN CN To detail: We have a bunch of sets. Each set contains compute node, L0 and L1 switches. Plus a common top of L2 and L3 switches. In each set, there are groups of compute nodes. Each group is connected to a single L0 switch. In a given set, all L0 are connected to all L1. The Nth L1 of a set is connected to the Nth L2 and only to this one. (so through a L2, the Nth L1 can only see the Nth L1 of the other sets) There are Services nodes connected to the L2 switches. All the L2 are connected to a couple of L3. The problem we have seen when routing on this topology is that most of the routes from CN to SN (service nodes) go through the L3 switches. With the current algorithm, the less loaded link is choosed to go down by going up. Therefore, the primary path goes through a L2, then a L3 from where it covers all the network. This wasn't acceptable for us as L3 switches would be overloaded when there were less loaded/shorter paths to achieve the same HCA. So what we have introduced here is a "balanced min_hop" within the ftree algorithm. Basically, instead of just leaving when we reach a LFT which has already been configured for the target lid, we check the hops count of the switch toward this lid, and the hop count on the path we came through. If we have found a shorter path, we update the LFT and minhop tables to use this new path. This means that the difference between primary_path and secondary path is not so important anymore. Secondary path may increment port counters but only if routes to HCA were created (see opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation which makes this possible). I acknowledge that port count may be slightly wrong as a primary path that is replaced with a shorter secondary path has incremented counters and they won't be removed. However, in most the cases the primary path would have created other routes than the one replaced so counters are fine. For all regular ftree topology, I have see no change with this update but with topologies where two levels are not fully interconnected, this helps a lot ! Another thing we have developped here is to balance more secondary path. In the current algorithm, secondary down path (going_down_by_going up) are created in port_group order. This means that if the primary path didn't reach all the network (because a switch is broken for examples), all the routes missing will be created through the first port group. Which unbalance the network load a lot. To solve this, we create the secondary path by port group load. The previous patch has made us increment the port/portgroup counters when secondary routes towards HCA are created, therefore these counters are significant even when creating secondary routes. What our patch does is at the beginning of the function sort all the port group from lowest load to highest. Pick the first one for the primary path, and try secondary path from the 2nd to the last. Once again this seems to have no effect on regular topology but it made a real impact on our failover tests. Feel free to comment this, and more specially if and how you would want them upstream. Thanks in advance Nicolas From nicolas.morey-chaisemartin at ext.bull.net Sun Feb 8 23:43:27 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 09 Feb 2009 08:43:27 +0100 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <20090207202319.GE27757@sashak.voltaire.com> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> Message-ID: <498FDE9F.7080604@ext.bull.net> Sasha Khapyorsky wrote: > On 20:48 Sat 07 Feb , Nicolas Morey-Chaisemartin wrote: > >>> "IO" is specific for your setup. Could we find more generic name for such >>> nodes? >>> >>> >>> >> Sure. Any ideas? >> > > No, I didn't think about it. > > >> >> Okay, I'll fix the indentation and few coding style error. There is also >> a bug in the current patch as the hops counter are not set to the right >> value when creating route which had reverse hops number of reverse >> hops*2 should be added. >> >> I'll have to rewrite the patches so they work with the current HEAD. >> Specially with the option system changes it won't merge cleanly. >> > > Use 'git-rebase master ' - it does the job with only two > trivial conflicts. > > Sasha > > > Well I still need to rename the option, fix the hop counts but most of all it will conflict (and bug) with the fix I've just reposted about port incrementation (reverse_hop also has boolean value to return which it doesn't right now). I have one working in Bull tree but there has been to many modifications in the code around to merge it cleanly. I'll rewrite it cleanly as soon as I got some time. Nicolas From nicolas.morey-chaisemartin at ext.bull.net Mon Feb 9 00:26:04 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 09 Feb 2009 09:26:04 +0100 Subject: [ofa-general] [PATCH] opensm/osm_console.c : Added getguid function to console to generate a list of guid matching one or more regexps Message-ID: <498FE89C.2020304@ext.bull.net> This add a getguid functionnality to openSM console which makes it really easy to generate cn_guid_file, root_guid_file and such by dumping into a file all port guids whom nodedesc contains at least one of the provided regexps Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_console.c | 131 +++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 131 insertions(+), 0 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 006049bce16cd282d40dc9598f4baaa2aa5b0fdf.diff Type: text/x-patch Size: 4324 bytes Desc: not available URL: From vlad at lists.openfabrics.org Mon Feb 9 03:16:53 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 9 Feb 2009 03:16:53 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090209-0200 daily build status Message-ID: <20090209111653.A76E5E60F20@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From dorfman.eli at gmail.com Mon Feb 9 05:47:54 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 09 Feb 2009 15:47:54 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <20090208213826.GA24254@sashak.voltaire.com> References: <20090202205924.GF5910@sashak.voltaire.com> <49880E4D.2090107@gmail.com> <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> <498A9888.5010003@gmail.com> <20090205121634.GQ11874@sashak.voltaire.com> <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com> <20090208213826.GA24254@sashak.voltaire.com> Message-ID: <4990340A.10004@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 21:23 Sun 08 Feb , Eli Dorfman wrote: >> yes, but wouldn't it be better to separate between heavy sweep and >> config rescan (due to SIGHUP). > > SIGHUP main purpose always was to trigger heavy sweep. > >> I think that user should know when configuration is updated and not >> wait for heavy sweep. > > I'm not following - SIGHUP will cause heavy sweep and config update, > where is a waiting? > i meant that if the user is changing config file and there is a heavy sweep then config may be updated, while using specific flag for config rescan will avoid this case. Eli From sashak at voltaire.com Mon Feb 9 06:17:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 16:17:32 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <4990340A.10004@gmail.com> References: <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> <498A9888.5010003@gmail.com> <20090205121634.GQ11874@sashak.voltaire.com> <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com> <20090208213826.GA24254@sashak.voltaire.com> <4990340A.10004@gmail.com> Message-ID: <20090209141732.GF26139@sashak.voltaire.com> On 15:47 Mon 09 Feb , Eli Dorfman (Voltaire) wrote: > Sasha Khapyorsky wrote: > > Hi Eli, > > > > On 21:23 Sun 08 Feb , Eli Dorfman wrote: > >> yes, but wouldn't it be better to separate between heavy sweep and > >> config rescan (due to SIGHUP). > > > > SIGHUP main purpose always was to trigger heavy sweep. > > > >> I think that user should know when configuration is updated and not > >> wait for heavy sweep. > > > > I'm not following - SIGHUP will cause heavy sweep and config update, > > where is a waiting? > > > > i meant that if the user is changing config file and there is a heavy sweep then > config may be updated, Are you about race between file reading (by OpenSM) and writing (by user)? Using write lock on reading would solve an issue. > while using specific flag for config rescan will avoid this case. What do you mean by "specific flag"? Using separate signal? Assuming so, this will not prevent read/write race. Sasha From sashak at voltaire.com Mon Feb 9 06:19:15 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 16:19:15 +0200 Subject: [ofa-general] [PATCH] ibsim: fix port initial state Message-ID: <20090209141915.GG26139@sashak.voltaire.com> Port initial state was ACTIVE in PortInfo template for connected ports. This prevented from OpenSM to make INIT -> ARMED -> ACTIVE PortInfo transition typical for a real fabric. Signed-off-by: Sasha Khapyorsky --- ibsim/sim_net.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c index ee268e0..7a42cb6 100644 --- a/ibsim/sim_net.c +++ b/ibsim/sim_net.c @@ -80,7 +80,7 @@ static const uint8_t swport[] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x03, 0x02, - 0x14, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08, + 0x12, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08, 0x08, 0x04, 0xFF, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, @@ -102,7 +102,7 @@ static const uint8_t hcaport[] = { 0xFE, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x01, 0x00, 0x50, 0x02, 0x48, 0x00, 0x00, 0x0F, 0xF9, 0x01, 0x03, 0x03, 0x02, - 0x14, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08, + 0x12, 0x52, 0x00, 0x11, 0x40, 0x40, 0x00, 0x08, 0x08, 0x04, 0xFF, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x1F, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Mon Feb 9 06:41:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 16:41:35 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation In-Reply-To: <498FD4CC.8070900@ext.bull.net> References: <498FD4CC.8070900@ext.bull.net> Message-ID: <20090209144135.GH26139@sashak.voltaire.com> Hi Nicolas, On 08:01 Mon 09 Feb , Nicolas Morey Chaisemartin wrote: > Here is an updated version of the patch including Yevgeni's feedback. Could you provide more descriptive commit message? This text will be stored in OpenSM change history and your current comment doesn't say a lot. If you need to place in patch message some text which should not enter change log (such as details about differences against previous version of the patch or any other) it should be placed after '---' below. > > Signed-off-by: Nicolas Morey-Chaisemartin > > --- Sasha > opensm/opensm/osm_ucast_ftree.c | 39 > +++++++++++++++++++++++---------------- > 1 files changed, 23 insertions(+), 16 deletions(-) > > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index 68900d8..3ea61a1 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, > * assign-up-going-port-by-descending-down to r-port node (recursion) > */ > > -static void > +static boolean_t > __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > IN ftree_sw_t * p_sw, > IN ftree_sw_t * p_prev_sw, > @@ -1932,18 +1932,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > uint16_t i; > uint16_t j; > uint16_t k; > + boolean_t created_route = FALSE; > > /* we shouldn't enter here if both real_lid and main_path are false */ > CL_ASSERT(is_real_lid || is_main_path); > > /* if there is no down-going ports */ > if (p_sw->down_port_groups_num == 0) > - return; > - > - /* promote the index that indicates which group should we > - start with when going through all the downgoing groups */ > - p_sw->down_port_groups_idx = > - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; > + return FALSE;; > > /* foreach down-going port group (in indexing order) */ > i = p_sw->down_port_groups_idx; > @@ -1952,9 +1948,12 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > p_group = p_sw->down_port_groups[i]; > i = (i + 1) % p_sw->down_port_groups_num; > > - /* Skip this port group unless it points to a switch */ > - if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) > + /* If this port group doesn't point to a switch, mark > + that the route was created and skip to the next group */ > + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH){ > + created_route = TRUE; > continue; > + } > > if (p_prev_sw > && (p_group->remote_base_lid == p_prev_sw->base_lid)) { > @@ -2073,16 +2072,24 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > > /* Recursion step: > Assign upgoing ports by stepping down, starting on REMOTE switch */ > - __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ > - NULL, /* prev. position - NULL to mark that we went down and not up */ > - target_lid, /* LID that we're routing to */ > - target_rank, /* rank of the LID that we're routing to */ > - is_real_lid, /* whether the target LID is real or dummy */ > - is_main_path, /* whether this is path to HCA that should by tracked by counters */ > - highest_rank_in_route); /* highest visited point in the tree before going down */ > + created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ > + NULL, /* prev. position - NULL to mark that we went down and not up */ > + target_lid, /* LID that we're routing to */ > + target_rank, /* rank of the LID that we're routing to */ > + is_real_lid, /* whether the target LID is real or dummy */ > + is_main_path, /* whether this is path to HCA that should by tracked by counters */ > + highest_rank_in_route); /* highest visited point in the tree before going down */ > } > /* done scanning all the down-going port groups */ > > + /* if the route was created, promote the index that > + indicates which group should we start with when > + going through all the downgoing groups */ > + if (created_route) > + p_sw->down_port_groups_idx = > + (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; > + > + return created_route; > } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ > > /***************************************************/ > From sashak at voltaire.com Mon Feb 9 07:14:51 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 17:14:51 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c : Added getguid function to console to generate a list of guid matching one or more regexps In-Reply-To: <498FE89C.2020304@ext.bull.net> References: <498FE89C.2020304@ext.bull.net> Message-ID: <20090209151451.GI26139@sashak.voltaire.com> Hi Nicolas, Some initial comments... On 09:26 Mon 09 Feb , Nicolas Morey Chaisemartin wrote: > This add a getguid functionnality to openSM console which makes it really > easy to generate cn_guid_file, root_guid_file and such > by dumping into a file all port guids whom nodedesc contains at least one > of the provided regexps I see that this specific command is about port guids and not node guids. What is about better name such "dump_portguids"? (Another possibility would be implementation of single "dump" command with various parameters such as "config", "portguids", "nodeguids", etc.). > > Signed-off-by: Nicolas Morey-Chaisemartin > > --- > opensm/opensm/osm_console.c | 131 > +++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 131 insertions(+), 0 deletions(-) > > > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c > index c6e8e59..e4dc6e9 100644 > --- a/opensm/opensm/osm_console.c > +++ b/opensm/opensm/osm_console.c > @@ -42,6 +42,7 @@ > #include > #include > #include > +#include > #ifdef ENABLE_OSM_CONSOLE_SOCKET > #include > #endif > @@ -1172,6 +1173,135 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > fprintf(out, "%s build %s %s\n", p_osm->osm_version, __DATE__, __TIME__); > } > > +typedef struct _regexp_list { > + regex_t exp; > + struct _regexp_list* next; > +} regexp_list_t; > + > + > +static void getguid_parse(char **p_last, osm_opensm_t *p_osm, FILE *out) > +{ > + cl_qmap_t *p_port_guid_tbl; > + osm_port_t* p_port; > + osm_port_t* p_next_port; > + > + regexp_list_t* p_head_regexp=NULL; > + regexp_list_t* p_regexp; > + > + /* Option variables*/ > + char* p_cmd=NULL; > + FILE* output=out; > + int exit_after_run=0; > + extern volatile unsigned int osm_exit_flag; > + > + /* Read commande line */ > + > + while(1){ Try opensm/osm_indent (many places in the patch will be affected). > + p_cmd = next_token(p_last); > + if (p_cmd) { > + if (strcmp(p_cmd, "exit_after_run") == 0) { > + exit_after_run = 1; > + } else if (strcmp(p_cmd, "file") == 0) { > + p_cmd=next_token(p_last); > + if(p_cmd){ > + output = fopen(p_cmd,"w+"); > + if(output == NULL){ > + fprintf(out,"Could not open file %s: %s\n",p_cmd,strerror(errno)); > + output = out; > + } > + } else { > + /* No file name passed */ > + fprintf(out,"No file name passed\n"); > + } > + } else { > + p_regexp = malloc(sizeof(*p_regexp)); > + if(regcomp(&(p_regexp->exp),p_cmd,REG_NOSUB|REG_EXTENDED)!=0){ > + fprintf(out,"Couldn't parse regular expression %s. Skipping it.\n",p_cmd); > + } > + p_regexp->next = p_head_regexp; > + p_head_regexp = p_regexp; > + } > + } else { > + /* No more tokens */ > + break; > + } Here and in other places - no need braces about single operation. > + } > + > + /* Check we have at least one expression to match */ > + if(p_head_regexp == NULL){ > + fprintf(out,"No valid expression provided. Aborting\n"); > + return; > + } > + > + /* Ensure this SM is master (so we have the LFT) */ > + > + getguid_wait_init: > + if(osm_exit_flag) > + return; > + cl_spinlock_acquire(&p_osm->sm.state_lock); > + /* If the subnet struct is not properly initialized, we exit */ > + if(p_osm->sm.p_subn == NULL){ > + cl_spinlock_release(&p_osm->sm.state_lock); > + sleep(1); > + goto getguid_wait_init; > + } The console is initialized after osm_subnet. When will the case (p_osm->sm.p_subn == NULL) be valid? > + if(p_osm->sm.p_subn->sm_state != IB_SMINFO_STATE_MASTER){ > + cl_spinlock_release(&p_osm->sm.state_lock); > + sleep(1); > + goto getguid_wait_init; > + } This will cause to endless loop when OpenSM is in Standby or Inactive states. > + cl_spinlock_release(&p_osm->sm.state_lock); > + if(p_osm->sm.p_subn->need_update != 0){ > + sleep(1); > + goto getguid_wait_init; > + } Subnet discovery/setup could take some time. An user may want to use console for other things in this time. I don't think that sleeping is suitable here, better to print "try later" message or like this. > + > + /* Subnet doesn't need to be updated so we can carry on */ > + > + > + CL_PLOCK_EXCL_ACQUIRE(p_osm->sm.p_lock); > + p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl); > + > + > + No need more than one empty line as separator (osm_indent... :)). > + p_next_port = (osm_port_t*)cl_qmap_head(p_port_guid_tbl); > + while (p_next_port != (osm_port_t*)cl_qmap_end(p_port_guid_tbl)) { > + > + p_port = p_next_port; > + p_next_port = (osm_port_t*)cl_qmap_next(&p_next_port->map_item); > + > + for(p_regexp = p_head_regexp;p_regexp!=NULL;p_regexp = p_regexp->next){ > + if(regexec(&(p_regexp->exp),p_port->p_node->print_desc,0,NULL,0) == 0){ > + fprintf(output,"0x%"PRIxLEAST64"\n",cl_ntoh64(p_port->p_physp->port_guid)); > + } > + } > + } > + > +CL_PLOCK_RELEASE(p_osm->sm.p_lock); > + if(output != out) > + fclose(output); > + if(exit_after_run) > + osm_exit_flag = 1; Why this 'exit_after_run'? If you need functionality to exit OpenSM triggered from console (but it is not clear for me why) use another command. > + > +} > + > + > + > + No need more than one empty line as separator (osm_indent... :)). > +static void help_getguid(FILE * out, int detail) > +{ > + fprintf(out, "getguid [exit_after_run|file filename] regexp1 [regexp2 [regexp3 ...]] -- Dump port GUID matching a regexp \n"); > + if (detail) { > + fprintf(out, > + "getguid -- Dump all the port GUID whom node_desc matches one of the provided regexp\n"); > + fprintf(out, > + " [file filename] -- Send the port GUID list to the specified file instead of regular output\n"); > + fprintf(out, > + " [exit_after_run] -- Quit OpenSM once the port GUID have been displayed\n"); > + } > + > +} > + > /* more parse routines go here */ > > static const struct command console_cmds[] = { > @@ -1192,6 +1322,7 @@ static const struct command console_cmds[] = { > #ifdef ENABLE_OSM_PERF_MGR > {"perfmgr", &help_perfmgr, &perfmgr_parse}, > #endif /* ENABLE_OSM_PERF_MGR */ > + {"getguid", &help_getguid, &getguid_parse}, > {NULL, NULL, NULL} /* end of array */ > }; > > From nicolas.morey-chaisemartin at ext.bull.net Mon Feb 9 07:55:46 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 09 Feb 2009 16:55:46 +0100 Subject: [ofa-general] [PATCH v3] opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation Message-ID: <49905202.3050406@ext.bull.net> This patch fixes a bug in index port incrementation in the fat-tree algorithm. Problem happens (at least) with a 4 level Fat tree as below: L3 L3 ___________________|__|____________________ / / \ \ <= All the L2 are connected on 2 L3 switches L2-1 L2-2 L2-1 L2-2 / / \ \ <== The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning. L1 L1 L1 L1 /|\ /|\ /|\ /|\ ==Fully mixed to L1== ==Fully mixed to L1== <=== We have multiple set. In each set, all L0 lead to all L1 of their set. L0 L0 L0 L0 / \ / \ / \ / \ CN CN .. CN CN .... CN CN .. CN CN To detail: We have a bunch of sets. Each set contains compute node, L0 and L1 switches. Plus a common top of L2 and L3 switches. In each set, there are groups of compute nodes. Each group is connected to a single L0 switch. In a given set, all L0 are connected to all L1. The Nth L1 of a set is connected to the Nth L2 and only to this one. (so through a L2, the Nth L1 can only see the Nth L1 of the other sets) All the L2 are connected to a couple of L3. If we dont put the L3. We have a perfectly balanced fat tree and well equilibrated routes. But when we add the L3, it introduce a huge difference. As it is not necessary, no route is going through L3 (which is fine). However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 is twice overused (compared to the balanced state). This comes from the down_port_groups_idx which is incremented each time the algorithm goes down through a node whether it creates routes to HCA (port != switch) or not. As route coming up from a L1 reaches only one L2, the algorithm goes through all the other L2 while going down, incrementing their index. Our case here is a bit specific but in a case where your L1 doesn't have full connectivity to all your L2, and another switch rank above, the problem may appear. To avoid this problem, __osm_ftree_fabric_route_upgoing_by_going_down function has been changed so it returns a value to indicate if routes to HCA (in fact to leaf switch) were created. With this information, we only increase the index when the algorithm has created routes to HCA. After applying this patch and measuring the link usage, we are perfectly balanced (L2<->L3 links are still not used but that is to be expected). Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 39 +++++++++++++++++++++++---------------- 1 files changed, 23 insertions(+), 16 deletions(-) Repost of the patch with Yevgeni's comment and a more complete description :) Hope it's good this time. -------------- next part -------------- A non-text attachment was scrubbed... Name: 2f1d358f2bdf67838fe8776438b7757d9dcd6e15.diff Type: text/x-patch Size: 3806 bytes Desc: not available URL: From devel at morey-chaisemartin.com Mon Feb 9 08:04:58 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Mon, 09 Feb 2009 17:04:58 +0100 Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c : Added getguid function to console to generate a list of guid matching one or more regexps In-Reply-To: <20090209151451.GI26139@sashak.voltaire.com> References: <498FE89C.2020304@ext.bull.net> <20090209151451.GI26139@sashak.voltaire.com> Message-ID: <4990542A.5040907@morey-chaisemartin.com> Sasha Khapyorsky a écrit : > Hi Nicolas, > > Some initial comments... > > On 09:26 Mon 09 Feb , Nicolas Morey Chaisemartin wrote: > >> This add a getguid functionnality to openSM console which makes it really >> easy to generate cn_guid_file, root_guid_file and such >> by dumping into a file all port guids whom nodedesc contains at least one >> of the provided regexps >> > > I see that this specific command is about port guids and not node guids. > What is about better name such "dump_portguids"? (Another possibility > would be implementation of single "dump" command with various parameters > such as "config", "portguids", "nodeguids", etc.). > > Dumping port guid is specially useful to generate config files. I've never had the need to dump nodeguid. If people need it, why not make a global dump. If not, it may be simpler to rename to dump_portguids > > Try opensm/osm_indent (many places in the patch will be affected). > > Last time I tried osm_indent, it introduced a real lot of changes to the code (even the one I didn't edited) so I haven't used it on my patches. I'll fix the indentation. >> + /* Ensure this SM is master (so we have the LFT) */ >> + >> + getguid_wait_init: >> + if(osm_exit_flag) >> + return; >> + cl_spinlock_acquire(&p_osm->sm.state_lock); >> + /* If the subnet struct is not properly initialized, we exit */ >> + if(p_osm->sm.p_subn == NULL){ >> + cl_spinlock_release(&p_osm->sm.state_lock); >> + sleep(1); >> + goto getguid_wait_init; >> + } >> > > The console is initialized after osm_subnet. When will the case > (p_osm->sm.p_subn == NULL) be valid? > > I didn't knew that, I was just checking my pointers to be sure. >> + if(p_osm->sm.p_subn->sm_state != IB_SMINFO_STATE_MASTER){ >> + cl_spinlock_release(&p_osm->sm.state_lock); >> + sleep(1); >> + goto getguid_wait_init; >> + } >> > > This will cause to endless loop when OpenSM is in Standby or Inactive > states. > > This is some code I used for another function that looks at LFT table. In the other case, I need the SM to be master. I'll change it. >> + cl_spinlock_release(&p_osm->sm.state_lock); >> + if(p_osm->sm.p_subn->need_update != 0){ >> + sleep(1); >> + goto getguid_wait_init; >> + } >> > > Subnet discovery/setup could take some time. An user may want to use > console for other things in this time. I don't think that sleeping is > suitable here, better to print "try later" message or like this. > > See comment below > >> + p_next_port = (osm_port_t*)cl_qmap_head(p_port_guid_tbl); >> + while (p_next_port != (osm_port_t*)cl_qmap_end(p_port_guid_tbl)) { >> + >> + p_port = p_next_port; >> + p_next_port = (osm_port_t*)cl_qmap_next(&p_next_port->map_item); >> + >> + for(p_regexp = p_head_regexp;p_regexp!=NULL;p_regexp = p_regexp->next){ >> + if(regexec(&(p_regexp->exp),p_port->p_node->print_desc,0,NULL,0) == 0){ >> + fprintf(output,"0x%"PRIxLEAST64"\n",cl_ntoh64(p_port->p_physp->port_guid)); >> + } >> + } >> + } >> + >> +CL_PLOCK_RELEASE(p_osm->sm.p_lock); >> + if(output != out) >> + fclose(output); >> + if(exit_after_run) >> + osm_exit_flag = 1; >> > > Why this 'exit_after_run'? > > If you need functionality to exit OpenSM triggered from console (but it > is not clear for me why) use another command. > > For the last 2 comments, the purpose is to be able to easily script the configuration file generation. We have netlist generation here and it's much easier to be able to just do echo "getguid exit_after_run file $dir/root_guid_file.txt root_sw" | opensm ... Nicolas From yosefe at Voltaire.COM Mon Feb 9 08:49:07 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Mon, 09 Feb 2009 18:49:07 +0200 Subject: [ofa-general] RE: impossibility to bind a device/port with the rdma-cm when the port is down In-Reply-To: References: <49893FAF.3090007@voltaire.com> <7A76E9B9A2E84721A09AA8FB75C49D7A@amr.corp.intel.com> <4989E6D6.5030109@Voltaire.COM> Message-ID: <49905E83.3020508@Voltaire.COM> When doing rdma_resolve_addr() and relevant port is down, the function fails and rdma_cm id is not bound to the device. Therefore, application does not have device handle and cannot wait for the port to become active. The function fails because ipoib is not joined to the multicast group and therefore sa does not have a multicast record to take a qkey from. The proposed patch is to make lazy qkey resolution - cma_set_qkey will set id_priv->qkey if it was not set, and will be called just before the qkey is really required. Signed-off-by: Yossi Etigin Acked-by: Sean Hefty --- Fix checkpatch.pl error. drivers/infiniband/core/cma.c | 41 +++++++++++++++++++++++++++-------------- 1 file changed, 27 insertions(+), 14 deletions(-) Index: kernel-ib/drivers/infiniband/core/cma.c =================================================================== --- kernel-ib.orig/drivers/infiniband/core/cma.c 2009-02-04 20:40:20.000000000 +0200 +++ kernel-ib/drivers/infiniband/core/cma.c 2009-02-09 18:45:13.000000000 +0200 @@ -296,21 +296,25 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_set_qkey(struct ib_device *device, u8 port_num, - enum rdma_port_space ps, - struct rdma_dev_addr *dev_addr, u32 *qkey) +static int cma_set_qkey(struct rdma_id_private *id_priv) { struct ib_sa_mcmember_rec rec; int ret = 0; - switch (ps) { + if (id_priv->qkey) + return; + + switch (id_priv->id.ps) { case RDMA_PS_UDP: - *qkey = RDMA_UDP_QKEY; + id_priv->qkey = RDMA_UDP_QKEY; break; case RDMA_PS_IPOIB: - ib_addr_get_mgid(dev_addr, &rec.mgid); - ret = ib_sa_get_mcmember_rec(device, port_num, &rec.mgid, &rec); - *qkey = be32_to_cpu(rec.qkey); + ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid); + ret = ib_sa_get_mcmember_rec(id_priv->id.device, + id_priv->id.port_num, &rec.mgid, + &rec); + if (!ret) + id_priv->qkey = be32_to_cpu(rec.qkey); break; default: break; @@ -340,12 +344,7 @@ static int cma_acquire_dev(struct rdma_i ret = ib_find_cached_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); if (!ret) { - ret = cma_set_qkey(cma_dev->device, - id_priv->id.port_num, - id_priv->id.ps, dev_addr, - &id_priv->qkey); - if (!ret) - cma_attach_to_dev(id_priv, cma_dev); + cma_attach_to_dev(id_priv, cma_dev); break; } } @@ -577,6 +576,10 @@ static int cma_ib_init_qp_attr(struct rd *qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT; if (cma_is_ud_ps(id_priv->id.ps)) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; + qp_attr->qkey = id_priv->qkey; *qp_attr_mask |= IB_QP_QKEY; } else { @@ -2167,6 +2170,12 @@ static int cma_sidr_rep_handler(struct i event.status = ib_event->param.sidr_rep_rcvd.status; break; } + ret = cma_set_qkey(id_priv); + if (ret) { + event.event = RDMA_CM_EVENT_ADDR_ERROR; + event.status = -EINVAL; + break; + } if (id_priv->qkey != rep->qkey) { event.event = RDMA_CM_EVENT_UNREACHABLE; event.status = -EINVAL; @@ -2446,10 +2455,14 @@ static int cma_send_sidr_rep(struct rdma const void *private_data, int private_data_len) { struct ib_cm_sidr_rep_param rep; + int ret; memset(&rep, 0, sizeof rep); rep.status = status; if (status == IB_SIDR_SUCCESS) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; rep.qp_num = id_priv->qp_num; rep.qkey = id_priv->qkey; } -- --Yossi From bboas at systemfabricworks.com Mon Feb 9 08:50:31 2009 From: bboas at systemfabricworks.com (Bill Boas) Date: Mon, 9 Feb 2009 08:50:31 -0800 Subject: [ofa-general] RE: [ewg] OFED (EWG) meeting agenda for tomorrow (Feb 09) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD01B1CBBA@mtlexch01.mtl.com> Message-ID: Tziporet, EWG members and OFA general list readers Attached is the draft agenda as of Friday morning last week, a few changes since then. Also attached is HPC wire's re-print of the press release. These are sent out as background updates for the RWG call today and to provide information for those considering attending the Sonoma Workshop. The MWG of OFA, chaired by Wayne Augsburger, welcomes your feedback, input and comments - and your presence in Sonoma Mar 22-25 I'll be on the call today in 10 mins. Bill. Bill Boas Executive Director and Vice Chair OFA VP, Business Development System Fabric Works 510-375-8840 bboas at systemfabricworks.com www.systemfabricworks.com _____ From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Sunday, February 08, 2009 7:58 AM To: Tziporet Koren; ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ewg] OFED (EWG) meeting agenda for tomorrow (Feb 09) These are the agenda items for the meeting tomorrow: 1. OFED 1.4.1 release status: * New OSes: * RH 5.3 - done, we still have an issue with Itanium * SLES 11 - schedule is OK. RC3 already available - Any volunteers to prepare the backports? * Open MPI 1.3 - I heard there are some critical bugs. What is the status of 1.3.1? - Jeff S. * RDS with iWARP support - Steve * NFS/RDMA backports - Steve * Critical bug fixes As far as I know these are the critical bugs that should be fixed: 1383 blo P3 jackm at mellanox.co.il Local protection error on transmit from ipoib datagram to... 1471 cri P3 amirv at mellanox.co.il Performance degradation in ofed 1.4 Please send more bugs that are critical for the release 2. Decide on 1.4.1 schedule: Proposal: * RC1 - Mar 3 * RC2 - Mar 17 * RC3 - Mar 31 * GA - Apr 7 3. Sonoma updates (if any) - Bill Boas Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Draft Sonoma 2009 agenda for Feb 6 MWG review.xls Type: application/vnd.ms-excel Size: 101888 bytes Desc: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: Subject: HPCwire: OFA to Host 5th Annual International Sonoma Workshop Date: Mon, 9 Feb 2009 08:44:40 -0800 Size: 871805 URL: From randy.dunlap at oracle.com Mon Feb 9 08:53:39 2009 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Mon, 09 Feb 2009 08:53:39 -0800 Subject: [ofa-general] Re: linux-next: Tree for February 9 (infiniband) In-Reply-To: <20090209193908.1a448944.sfr@canb.auug.org.au> References: <20090209193908.1a448944.sfr@canb.auug.org.au> Message-ID: <49905F93.300@oracle.com> Stephen Rothwell wrote: > Hi all, > > [I accidentally deleted the merge and quilt-import logs today :-( - I > wonder if any would have noticed :-). The merge summary still appears > below.] > > Changes since 20090206: allyesconfig build on i386 fails with: drivers/built-in.o: In function `iwch_sgl2pbl_map': /usr/builds/linux-next-20090209/drivers/infiniband/hw/cxgb3/iwch_qp.c:237: undefined reference to `__umoddi3' make: *** [.tmp_vmlinux1] Error 1 or allmodconfig on i386 fails with: ERROR: "__umoddi3" [drivers/infiniband/hw/cxgb3/iw_cxgb3.ko] undefined! -- ~Randy From swise at opengridcomputing.com Mon Feb 9 09:00:08 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 09 Feb 2009 11:00:08 -0600 Subject: [ofa-general] Re: linux-next: Tree for February 9 (infiniband) In-Reply-To: <49905F93.300@oracle.com> References: <20090209193908.1a448944.sfr@canb.auug.org.au> <49905F93.300@oracle.com> Message-ID: <49906118.3060801@opengridcomputing.com> Randy Dunlap wrote: > Stephen Rothwell wrote: > >> Hi all, >> >> [I accidentally deleted the merge and quilt-import logs today :-( - I >> wonder if any would have noticed :-). The merge summary still appears >> below.] >> >> Changes since 20090206: >> > > > allyesconfig build on i386 fails with: > > drivers/built-in.o: In function `iwch_sgl2pbl_map': > /usr/builds/linux-next-20090209/drivers/infiniband/hw/cxgb3/iwch_qp.c:237: undefined reference to `__umoddi3' > make: *** [.tmp_vmlinux1] Error 1 > > > or allmodconfig on i386 fails with: > > ERROR: "__umoddi3" [drivers/infiniband/hw/cxgb3/iw_cxgb3.ko] undefined! > > Somehow changing offset to a u64 must have caused this. What is __umoddi3? (it can't be good) :) Steve From randy.dunlap at oracle.com Mon Feb 9 09:01:11 2009 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Mon, 09 Feb 2009 09:01:11 -0800 Subject: [ofa-general] Re: linux-next: Tree for February 9 (infiniband) In-Reply-To: <49906118.3060801@opengridcomputing.com> References: <20090209193908.1a448944.sfr@canb.auug.org.au> <49905F93.300@oracle.com> <49906118.3060801@opengridcomputing.com> Message-ID: <49906157.9090707@oracle.com> Steve Wise wrote: > Randy Dunlap wrote: >> Stephen Rothwell wrote: >> >>> Hi all, >>> >>> [I accidentally deleted the merge and quilt-import logs today :-( - I >>> wonder if any would have noticed :-). The merge summary still appears >>> below.] >>> >>> Changes since 20090206: >>> >> >> >> allyesconfig build on i386 fails with: >> >> drivers/built-in.o: In function `iwch_sgl2pbl_map': >> /usr/builds/linux-next-20090209/drivers/infiniband/hw/cxgb3/iwch_qp.c:237: >> undefined reference to `__umoddi3' >> make: *** [.tmp_vmlinux1] Error 1 >> >> >> or allmodconfig on i386 fails with: >> >> ERROR: "__umoddi3" [drivers/infiniband/hw/cxgb3/iw_cxgb3.ko] undefined! >> >> > > Somehow changing offset to a u64 must have caused this. What is > __umoddi3? (it can't be good) :) It's some kind of mod operation, like 64-bit % 32-bit or 64-bit % 64-bit. Should be in a fairly recent change. -- ~Randy From weiny2 at llnl.gov Mon Feb 9 09:04:01 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 9 Feb 2009 09:04:01 -0800 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: References: Message-ID: <20090209090401.3eac78a5.weiny2@llnl.gov> On Fri, 6 Feb 2009 14:47:17 -0500 Hal Rosenstock wrote: > On Fri, Feb 6, 2009 at 2:12 PM, Hal Rosenstock wrote: > > Hi, > > > > I'm looking at adding pkey support into the OpenSM vendor layer. The > > pkey table is a per port structure and is part of ib_port_attr_t. That > > structure also include num_pkeys. There is only related API: > > osm_vendor_get_all_port_attr which takes several pointers, the second > > one is a pointer to a preallocated array of port attributes (memory > > allocation for that is done by the client). ib_port_attr_t includes a > > pointer to the pkey table. So the only way this can work is if that > > allocation is also done by the client which makes that a valid > > parameter on input (as well as output). Similarly for num_pkeys so the > > vendor layer doesn't go past the end of the supplied table. So both > > num_pkeys and p_pkey_table in that struct need to be in/out > > parameters. num_pkeys could always be returned as the total number of > > pkeys for the port when num_pkeys is set to 0 on input. > > > > Similar thing is true for gid table in ib_port_attr_t. > > > > I'm also not sure which vendor layers are important. I don't see how > > to fix them all (e.g. osm_vendor_al.c is one, there are some others) > > as some of them appear to do a straight memory to memory copy of the > > ib_port_attr_t structure (others are OK and fixable). > > > > The only other alternative I see is to change this API and possibly > > this structure which is way more disruptive and risky (especially with > > the inability to test anything but one of the vendor layers). > > Actually, although more disruptive, it might be cleaner (and safer in > the long run) to add to the vendor API. There could be additional osm > vendor APIs for pkeys and gids and these could return some suitable > IB_ error from ib_types in vendor layers where they are unimplemented. > IB_UNSUPPORTED looks good to me. I'm likely to head down this approach > unless I hear otherwise. This sounds more reasonable to me, better to suffer now than later... Ira > > -- Hal > > > Thoughts ? > > > > -- Hal > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Mon Feb 9 09:16:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 19:16:08 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c : Added getguid function to console to generate a list of guid matching one or more regexps In-Reply-To: <4990542A.5040907@morey-chaisemartin.com> References: <498FE89C.2020304@ext.bull.net> <20090209151451.GI26139@sashak.voltaire.com> <4990542A.5040907@morey-chaisemartin.com> Message-ID: <20090209171608.GJ26139@sashak.voltaire.com> On 17:04 Mon 09 Feb , Nicolas Morey-Chaisemartin wrote: > > > Dumping port guid is specially useful to generate config files. I've > never had the need to dump nodeguid. If people need it, why not make a > global dump. > If not, it may be simpler to rename to dump_portguids Sure, we can start this way. > > > > Try opensm/osm_indent (many places in the patch will be affected). > > > > > Last time I tried osm_indent, it introduced a real lot of changes to the > code (even the one I didn't edited) so I haven't used it on my patches. You can extract related changes by editing diff file. In any case osm_indent will let you idea about how code should be formatted. > I'll fix the indentation. > >> + /* Ensure this SM is master (so we have the LFT) */ > >> + > >> + getguid_wait_init: > >> + if(osm_exit_flag) > >> + return; > >> + cl_spinlock_acquire(&p_osm->sm.state_lock); > >> + /* If the subnet struct is not properly initialized, we exit */ > >> + if(p_osm->sm.p_subn == NULL){ > >> + cl_spinlock_release(&p_osm->sm.state_lock); > >> + sleep(1); > >> + goto getguid_wait_init; > >> + } > >> > > > > The console is initialized after osm_subnet. When will the case > > (p_osm->sm.p_subn == NULL) be valid? > > > > > I didn't knew that, I was just checking my pointers to be sure. > >> + if(p_osm->sm.p_subn->sm_state != IB_SMINFO_STATE_MASTER){ > >> + cl_spinlock_release(&p_osm->sm.state_lock); > >> + sleep(1); > >> + goto getguid_wait_init; > >> + } > >> > > > > This will cause to endless loop when OpenSM is in Standby or Inactive > > states. > > > > > This is some code I used for another function that looks at LFT table. It is not in a main stream, right? > In the other case, I need the SM to be master. > I'll change it. > >> + cl_spinlock_release(&p_osm->sm.state_lock); > >> + if(p_osm->sm.p_subn->need_update != 0){ > >> + sleep(1); > >> + goto getguid_wait_init; > >> + } > >> > > > > Subnet discovery/setup could take some time. An user may want to use > > console for other things in this time. I don't think that sleeping is > > suitable here, better to print "try later" message or like this. > > > > > See comment below > > > >> + p_next_port = (osm_port_t*)cl_qmap_head(p_port_guid_tbl); > >> + while (p_next_port != (osm_port_t*)cl_qmap_end(p_port_guid_tbl)) { > >> + > >> + p_port = p_next_port; > >> + p_next_port = (osm_port_t*)cl_qmap_next(&p_next_port->map_item); > >> + > >> + for(p_regexp = p_head_regexp;p_regexp!=NULL;p_regexp = p_regexp->next){ > >> + if(regexec(&(p_regexp->exp),p_port->p_node->print_desc,0,NULL,0) == 0){ > >> + fprintf(output,"0x%"PRIxLEAST64"\n",cl_ntoh64(p_port->p_physp->port_guid)); > >> + } > >> + } > >> + } > >> + > >> +CL_PLOCK_RELEASE(p_osm->sm.p_lock); > >> + if(output != out) > >> + fclose(output); > >> + if(exit_after_run) > >> + osm_exit_flag = 1; > >> > > > > Why this 'exit_after_run'? > > > > If you need functionality to exit OpenSM triggered from console (but it > > is not clear for me why) use another command. > > > > > > For the last 2 comments, the purpose is to be able to easily script the > configuration file generation. We have netlist generation here and it's > much easier to be able to just do > echo "getguid exit_after_run file $dir/root_guid_file.txt root_sw" | > opensm ... Hmm, OpenSM main purpose is much different than just fabric statistics dumps generation :). If the only thing you need is port guids list you can parse 'ibnetdiscover' output - it will be much faster and not destructive (you even can find some trivial script in ibsim tree - 'tests/get_all_ca_port_guids.sh'). And in any case "two command approach" can work via pipe too: ( echo "getguid exit_after_run file $dir/root_guid_file.txt root_sw" ; \ echo "exit_opensm" ) | opensm ... Sasha From kliteyn at dev.mellanox.co.il Mon Feb 9 10:36:01 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 09 Feb 2009 20:36:01 +0200 Subject: [ofa-general] Re: [PATCH] opensm/qos_config: no invalid option message on default values In-Reply-To: <20090208225412.GA24514@sashak.voltaire.com> References: <20090208225412.GA24514@sashak.voltaire.com> Message-ID: <49907791.7050905@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Don't comply about invalid QoS options when its default values are used. Looks good. This also fixes bug #1451: https://bugs.openfabrics.org/show_bug.cgi?id=1451 -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_subnet.c | 18 +++++++++--------- > 1 files changed, 9 insertions(+), 9 deletions(-) > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 3324af9..69937c1 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -911,9 +911,11 @@ static ib_api_status_t osm_parse_prefix_routes_file(IN osm_subn_t * const p_subn > **********************************************************************/ > static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned dflt) > { > - if (!(*max_vls) || *max_vls > 15) { > - log_report(" Invalid Cached Option: %s_max_vls=%u: " > - "Using Default = %u\n", prefix, *max_vls, dflt); > + if (!*max_vls || *max_vls > 15) { > + if (*max_vls) > + log_report(" Invalid Cached Option: %s_max_vls=%u: " > + "Using Default = %u\n", > + prefix, *max_vls, dflt); > *max_vls = dflt; > } > } > @@ -921,8 +923,10 @@ static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned > static void subn_verify_high_limit(int *high_limit, const char *prefix, int dflt) > { > if (*high_limit < 0 || *high_limit > 255) { > - log_report(" Invalid Cached Option: %s_high_limit=%d: " > - "Using Default: %d\n", prefix, *high_limit, dflt); > + if (*high_limit > 255) > + log_report(" Invalid Cached Option: %s_high_limit=%d: " > + "Using Default: %d\n", > + prefix, *high_limit, dflt); > *high_limit = dflt; > } > } > @@ -934,8 +938,6 @@ static void subn_verify_vlarb(char **vlarb, const char *prefix, > int count = 0; > > if (*vlarb == NULL) { > - log_report(" Invalid Cached Option: %s_vlarb_%s: " > - "Using Default\n", prefix, suffix); > *vlarb = strdup(dflt); > return; > } > @@ -1003,8 +1005,6 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt) > int count = 0; > > if (*sl2vl == NULL) { > - log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n", > - prefix); > *sl2vl = strdup(dflt); > return; > } From kliteyn at dev.mellanox.co.il Mon Feb 9 10:43:42 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 09 Feb 2009 20:43:42 +0200 Subject: [ofa-general] Re: [PATCH] opensm/ftree: cleanup ftree_sw_tbl_element_t use In-Reply-To: <20090208230406.GC24514@sashak.voltaire.com> References: <20090208230406.GC24514@sashak.voltaire.com> Message-ID: <4990795E.3060504@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > cl_list() allocates memory needed for storing an object in the list - > no need additional wrappers like ftree_sw_tbl_element_t. Looks good, thanks. -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_ftree.c | 17 ++++------------- > 1 files changed, 4 insertions(+), 13 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index 68900d8..10096c7 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -1418,7 +1418,6 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) > ftree_tuple_t new_tuple; > uint32_t i; > cl_list_t bfs_list; > - ftree_sw_tbl_element_t *p_sw_tbl_element; > > OSM_LOG_ENTER(&p_ftree->p_osm->log); > > @@ -1465,14 +1464,10 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) > */ > > cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); > - cl_list_insert_tail(&bfs_list, > - &__osm_ftree_sw_tbl_element_create(p_sw)->map_item); > + cl_list_insert_tail(&bfs_list, p_sw); > > while (!cl_is_list_empty(&bfs_list)) { > - p_sw_tbl_element = > - (ftree_sw_tbl_element_t *) cl_list_remove_head(&bfs_list); > - p_sw = p_sw_tbl_element->p_sw; > - __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element); > + p_sw = (ftree_sw_t *) cl_list_remove_head(&bfs_list); > > /* Discover all the nodes from ports that are pointing down */ > > @@ -1509,9 +1504,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) > new_tuple); > > /* add the newly discovered switch to the BFS queue */ > - cl_list_insert_tail(&bfs_list, > - &__osm_ftree_sw_tbl_element_create > - (p_remote_sw)->map_item); > + cl_list_insert_tail(&bfs_list, p_sw); > } > /* Done assigning indexes to all the remote switches > that are pointed by the downgoing ports. > @@ -1547,9 +1540,7 @@ static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) > p_remote_sw, > new_tuple); > /* add the newly discovered switch to the BFS queue */ > - cl_list_insert_tail(&bfs_list, > - &__osm_ftree_sw_tbl_element_create > - (p_remote_sw)->map_item); > + cl_list_insert_tail(&bfs_list, p_sw); > } > /* Done assigning indexes to all the remote switches > that are pointed by the upgoing ports. From sashak at voltaire.com Mon Feb 9 11:23:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 21:23:26 +0200 Subject: [ofa-general] Re: [RFC] OpenSM vendor layer In-Reply-To: <20090209090401.3eac78a5.weiny2@llnl.gov> References: <20090209090401.3eac78a5.weiny2@llnl.gov> Message-ID: <20090209192326.GK26139@sashak.voltaire.com> On 09:04 Mon 09 Feb , Ira Weiny wrote: > > > > Actually, although more disruptive, it might be cleaner (and safer in > > the long run) to add to the vendor API. There could be additional osm > > vendor APIs for pkeys and gids and these could return some suitable > > IB_ error from ib_types in vendor layers where they are unimplemented. > > IB_UNSUPPORTED looks good to me. I'm likely to head down this approach > > unless I hear otherwise. > > This sounds more reasonable to me, better to suffer now than later... I don't see how it is "safer" in the long run than just extending. Adding new APIs now will require adding this to another vendor implementations as well (without actual possibility to test :( ). Extending osm_vendor_get_all_port_attr() only requires fixing port_array initializations (I guess it is 3-5 places in total in opensm and ibutils trees) and with other vendor implementation will work automatically as "unsupported" - no pkey table will be returned. I'm not yet saying that following this approach we are opening way for adding various new "doesn't make sense" API call for each port/whatever attribute.... :) Sasha From dotanba at gmail.com Mon Feb 9 11:25:32 2009 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 09 Feb 2009 21:25:32 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** non zero lkey in send(), write() with num_sge > 1? In-Reply-To: <661509.82751.qm@web111205.mail.gq1.yahoo.com> References: <661509.82751.qm@web111205.mail.gq1.yahoo.com> Message-ID: <4990832C.5090204@gmail.com> Ofed User wrote: > Hi, > > Can stack pass num_sge > 1, and lkey !=0 as part of sg_list[] elements, in post_send() call? > What are you trying to achieve? If num_sge > 1 => the HCA will try to read the blocks pointed by the sg_list one by one and validate that the address + size is inside a valid Memory Region which its local key is the lkey. Then i guess that the answer is: Yes. Dotan From sashak at voltaire.com Mon Feb 9 11:44:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 9 Feb 2009 21:44:19 +0200 Subject: [ofa-general] Re: [PATCH v3] opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation In-Reply-To: <49905202.3050406@ext.bull.net> References: <49905202.3050406@ext.bull.net> Message-ID: <20090209194419.GL26139@sashak.voltaire.com> On 16:55 Mon 09 Feb , Nicolas Morey Chaisemartin wrote: > This patch fixes a bug in index port incrementation in the fat-tree > algorithm. > Problem happens (at least) with a 4 level Fat tree as below: > > > L3 L3 > ___________________|__|____________________ > / / \ \ <= All > the L2 are connected on 2 L3 switches > L2-1 L2-2 L2-1 L2-2 > / / \ \ <== The > Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning. > L1 L1 L1 L1 > /|\ /|\ /|\ /|\ > ==Fully mixed to L1== ==Fully mixed to L1== <=== We have > multiple set. In each set, all L0 lead to all L1 of their set. > > L0 L0 L0 L0 > / \ / \ / \ / \ > CN CN .. CN CN .... CN CN .. CN CN > > > To detail: > We have a bunch of sets. Each set contains compute node, L0 and L1 > switches. > Plus a common top of L2 and L3 switches. > > In each set, there are groups of compute nodes. Each group is connected to > a single L0 switch. > In a given set, all L0 are connected to all L1. > > The Nth L1 of a set is connected to the Nth L2 and only to this one. (so > through a L2, the Nth L1 can only see the Nth L1 of the other sets) > All the L2 are connected to a couple of L3. > > > If we dont put the L3. We have a perfectly balanced fat tree and well > equilibrated routes. > But when we add the L3, it introduce a huge difference. As it is not > necessary, no route is going through L3 (which is fine). > However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 > is twice overused (compared to the balanced state). > > This comes from the down_port_groups_idx which is incremented each time the > algorithm goes down through a node whether it creates routes to HCA (port > != switch) > or not. As route coming up from a L1 reaches only one L2, the algorithm > goes through all the other L2 while going down, incrementing their index. > Our case here is a bit specific but in a case where your L1 doesn't have > full connectivity to all your L2, and another switch rank above, the > problem may appear. > > To avoid this problem, __osm_ftree_fabric_route_upgoing_by_going_down > function has been changed so it returns a value to indicate if routes to > HCA (in fact to leaf switch) were created. > With this information, we only increase the index when the algorithm has > created routes to HCA. > After applying this patch and measuring the link usage, we are perfectly > balanced (L2<->L3 links are still not used but that is to be expected). > > Signed-off-by: Nicolas Morey-Chaisemartin > Applied. Thanks. Sasha From stan.smith at intel.com Mon Feb 9 13:16:33 2009 From: stan.smith at intel.com (Smith, Stan) Date: Mon, 9 Feb 2009 13:16:33 -0800 Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names missing from osm_vendor_t ? In-Reply-To: <498F5E7B.6020208@dev.mellanox.co.il> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> Message-ID: <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> Hello all, My initial query, sadly somewhat confusing w.r.t. my confusion of mad vs. umad interfaces, was asking if it is permissible for the Windows OpenSM vendor-ibal to have a dependence on umad? In order for the OFED saquery to work correctly in the Windows environment, saquery code expects the osm_vendor_t struct to have the embedded ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] definition. This definition creates two umad dependencies: 1) #define UMAD_MAX_DEVICES + #define UMAD_CA_NAME_LEN 2) a umad_get_cas_names() call to populate the osm_vendor_t.ca_names struct. The current version of OpenSM vendor_umad already has the umad dependency, so it seemed somewhat reasonable to introduce this dependency in Windows OpenSM. This OpenSM change is considered temporary until such a time as we find a Windows opensm maintainer who has cycles to move Windows OpenSM forward to the current OFED OpenSM code base; Ishai has stated you are unavailable due to other project responsibilities. Much of Sean's WinVerbs/WinMAD/libmad/libumad Windows work provides the necessary infrastructure to make porting the latest OFED OpenSM much easier. I see there is an svn branch where someone is working on OpenSM? Any ideas as to what's going on here? My proposal for the Windows OpenSM code base is to add ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] to OpenSM vendor-ibal definition of osm_vendor_t and a call to umad_get_cas_names() to populate the osm_vendor_t.ca_names struct for. Comments? Thanks, Stan. Yevgeny Kliteynik wrote: > Yevgeny Kliteynik wrote: >> Hi Stan, > > Oops... Looks like I was having a problem with my mail client. > By now my response is partially outdated... > > -- Yevgeny > >> Adding Sasha (OFED management maintainer) >> and the openib mailing list. >> >> Stan C. Smith wrote: >>> Hello, >>> The Windows OpenSM vendor AL struct 'osm_vendor_t' (defined in >>> opensm\user\include\vendor\osm_vendor_al.h) is missing >>> the entry 'ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]'. >>> saquery.c expects to find ca_names in osm_vendor_t. >>> >>> A couple of observations: >>> 1) Windows currently supports a much older version of opensm than >>> what OFED 1.4 tools expect. >> >> Correct. Windows OpenSM is a ported pre-OFED 1.2 OpenSM with couple >> of minor fixes. >> >>> 2) saquery uses OpenSM mad interfaces while it 'could' be using >>> libibmad interfaces. >> >> By "OpenSM mad interfaces" you mean libosmvendor? >> >>> If libibmad from saquery, then OpenSM would not need libibmad >>> references for Windows. >> >> Not sure what you mean here. You mean removing libibmad dependency >> from saquery? >> >>> 3) How bad is it to create libibmad dependencies for OpenSM? >> >> Pretty bad. I don't think we should add a new dependency unless >> there's a really good reason for it. >> >>> 4) saquery.c is the only diags pgms (so far) which uses OpenSM MAD >>> interfaces; the rest use libibmad. >>> >>> Most of the OFED diagnostic tools support the cmd-line option '-C >>> ca_name'. This cmd-line input is resolved thru >>> libibmad interfaces. >>> Saquery is no exception in that it expects to match the '-C ca_name' >>> against osm_vendor_t.ca_names[]. 'ibstat -l' lists >>> CA names. >>> >>> The question becomes how best to resolve the missing ca_names? >>> >>> 1) modify saquery to call libibmad interface to get CA names; >>> leaves osm_vendor_t unmodified. So far, saquery is the only diag >>> pgm which uses OSM mad interfaces; expecting ca_names in >>> osm_vendor_t. >>> >>> 2) Modify OpenSM vendor AL osm_vendor_t struct to include CA names >>> and populate ca_names from OpenSM code? >> >> I'd say that this option is much better. >> >>> Creates libibmad dependencies for opensm. >> >> But it doesn't have to. Can IBAL expose some function to get these >> names, so that Win osmvendor will use this function instead of >> libibmad? >> >> Also, Linux osmvendor doesn't have libibmad dependency. >> It uses umad function umad_get_cas_names() to obtain the CA names. >> I know that there is a Windows version of umad, but I don't know >> what is its status. If we *have* to add an additional dependency, >> then it should be libibumad and not libibmad. >> >> At some point in the future we would really want to have the new >> version of OFED OpenSM ported to WinOF. If there will be a match >> between Linux and Windows libraries, then the whole vendor concept >> can be simplified and there won't be a need to have a separate >> vendor for IBAL. The things >> that would be different are platform-dependent issues like threads, >> locks, syslog, but not IB-related code. >> >> -- Yevgeny >> >> >>> Comments? >>> >>> Thanks, >>> >>> Stan. >>> >>> >>> >>> >> >> _______________________________________________ >> ofw mailing list >> ofw at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw From kliteyn at dev.mellanox.co.il Mon Feb 9 14:46:39 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 10 Feb 2009 00:46:39 +0200 Subject: [ofa-general] Re: [PATCH] opensm/ftree: simplify root guids setup. In-Reply-To: <20090208230830.GD24514@sashak.voltaire.com> References: <20090208230830.GD24514@sashak.voltaire.com> Message-ID: <4990B24F.2070804@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Eliminate root_guid_list storage - parse it directly to bfs list. Looks good, thanks. -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_ftree.c | 101 +++++++++++++------------------------- > 1 files changed, 35 insertions(+), 66 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index 10096c7..35f2ea1 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -100,11 +100,6 @@ struct ftree_fabric_t_; > typedef uint8_t ftree_tuple_t[FTREE_TUPLE_LEN]; > typedef uint64_t ftree_tuple_key_t; > > -struct guid_list_item { > - cl_list_item_t list; > - uint64_t guid; > -}; > - > /*************************************************** > ** > ** ftree_sw_table_element_t definition > @@ -203,7 +198,6 @@ typedef struct ftree_fabric_t_ { > cl_qmap_t hca_tbl; > cl_qmap_t sw_tbl; > cl_qmap_t sw_by_tuple_tbl; > - cl_qlist_t root_guid_list; > cl_qmap_t cn_guid_tbl; > unsigned cn_num; > uint8_t leaf_switch_rank; > @@ -886,8 +880,6 @@ static ftree_fabric_t *__osm_ftree_fabric_create() > cl_qmap_init(&p_ftree->sw_by_tuple_tbl); > cl_qmap_init(&p_ftree->cn_guid_tbl); > > - cl_qlist_init(&p_ftree->root_guid_list); > - > return p_ftree; > } > > @@ -953,10 +945,6 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) > } > cl_qmap_remove_all(&p_ftree->cn_guid_tbl); > > - /* remove all the elements of root_guid_list */ > - while (!cl_is_qlist_empty(&p_ftree->root_guid_list)) > - free(cl_qlist_remove_head(&p_ftree->root_guid_list)); > - > /* free the leaf switches array */ > if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches)) > free(p_ftree->leaf_switches); > @@ -3045,16 +3033,41 @@ Exit: > > /*************************************************** > ***************************************************/ > +struct rank_root_cxt { > + ftree_fabric_t *fabric; > + cl_list_t *list; > +}; > + > +static int rank_root_sw_by_guid(void *cxt, uint64_t guid, char *p) > +{ > + struct rank_root_cxt *c = cxt; > + ftree_sw_t *sw; > + > + sw = __osm_ftree_fabric_get_sw_by_guid(c->fabric, cl_hton64(guid)); > + if (!sw) { > + /* the specified root guid wasn't found in the fabric */ > + OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_ERROR, "ERR AB24: " > + "Root switch GUID 0x%" PRIx64 " not found\n", guid); > + return 0; > + } > + > + OSM_LOG(&c->fabric->p_osm->log, OSM_LOG_DEBUG, > + "Ranking root switch with GUID 0x%" PRIx64 "\n", guid); > + sw->rank = 0; > + cl_list_insert_tail(c->list, sw); > + > + return 0; > +} > > static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree) > { > + struct rank_root_cxt context; > osm_node_t *p_osm_node; > osm_node_t *p_remote_osm_node; > osm_physp_t *p_osm_physp; > ftree_sw_t *p_sw; > ftree_sw_t *p_remote_sw; > cl_list_t ranking_bfs_list; > - struct guid_list_item *item; > int res = 0; > unsigned num_roots; > unsigned max_rank = 0; > @@ -3064,25 +3077,16 @@ static int __osm_ftree_fabric_rank_from_roots(IN ftree_fabric_t * p_ftree) > cl_list_init(&ranking_bfs_list, 10); > > /* Rank all the roots and add them to list */ > - for (item = (void *)cl_qlist_head(&p_ftree->root_guid_list); > - item != (void *)cl_qlist_end(&p_ftree->root_guid_list); > - item = (void *)cl_qlist_next(&item->list)) { > - p_sw = > - __osm_ftree_fabric_get_sw_by_guid(p_ftree, > - cl_hton64(item->guid)); > - if (!p_sw) { > - /* the specified root guid wasn't found in the fabric */ > - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB24: " > - "Root switch GUID 0x%" PRIx64 " not found\n", > - item->guid); > - continue; > - } > + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > + "Fetching root nodes from file %s\n", > + p_ftree->p_osm->subn.opt.root_guid_file); > > - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > - "Ranking root switch with GUID 0x%" PRIx64 "\n", > - item->guid); > - p_sw->rank = 0; > - cl_list_insert_tail(&ranking_bfs_list, p_sw); > + context.fabric = p_ftree; > + context.list = &ranking_bfs_list; > + if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file, > + rank_root_sw_by_guid, &context)) { > + res = -1; > + goto Exit; > } > > num_roots = cl_list_count(&ranking_bfs_list); > @@ -3314,21 +3318,6 @@ Exit: > > /*************************************************** > ***************************************************/ > -static int add_guid_item_to_list(void *cxt, uint64_t guid, char *p) > -{ > - cl_qlist_t *list = cxt; > - struct guid_list_item *item; > - > - item = malloc(sizeof(*item)); > - if (!item) > - return -1; > - > - item->guid = guid; > - cl_qlist_insert_tail(list, &item->list); > - > - return 0; > -} > - > static int add_guid_item_to_map(void *cxt, uint64_t guid, char *p) > { > cl_qmap_t *map = cxt; > @@ -3350,26 +3339,6 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree) > > OSM_LOG_ENTER(&p_ftree->p_osm->log); > > - if (__osm_ftree_fabric_roots_provided(p_ftree)) { > - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > - "Fetching root nodes from file %s\n", > - p_ftree->p_osm->subn.opt.root_guid_file); > - > - if (parse_node_map(p_ftree->p_osm->subn.opt.root_guid_file, > - add_guid_item_to_list, > - &p_ftree->root_guid_list)) { > - status = -1; > - goto Exit; > - } > - > - if (!cl_qlist_count(&p_ftree->root_guid_list)) { > - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB22: " > - "Root guids file has no valid guids\n"); > - status = -1; > - goto Exit; > - } > - } > - > if (__osm_ftree_fabric_cns_provided(p_ftree)) { > OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, > "Fetching compute nodes from file %s\n", From sashak at voltaire.com Mon Feb 9 15:54:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 10 Feb 2009 01:54:14 +0200 Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor IBAL - ca_names missing from osm_vendor_t ? In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> Message-ID: <20090209235414.GM26139@sashak.voltaire.com> Hello Stan, On 13:16 Mon 09 Feb , Smith, Stan wrote: > > My proposal for the Windows OpenSM code base is to add ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] to OpenSM vendor-ibal definition of osm_vendor_t and a call to umad_get_cas_names() to populate the osm_vendor_t.ca_names struct for. > > Comments? Assuming WinOF already has libibumad implementation with preserved API would it be reasonable to switch from vendor-ibal to vendor-ibumad in WinOF? Sasha From sean.hefty at intel.com Mon Feb 9 15:55:13 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 9 Feb 2009 15:55:13 -0800 Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names missing from osm_vendor_t ? In-Reply-To: <20090209235414.GM26139@sashak.voltaire.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> <20090209235414.GM26139@sashak.voltaire.com> Message-ID: <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com> >Assuming WinOF already has libibumad implementation with preserved API >would it be reasonable to switch from vendor-ibal to vendor-ibumad in >WinOF? WinOF does have a libibumad implementation, plus libibmad ports between the two platforms. The saquery code needs structure definitions for the various attributes, so using libibmad may be a better choice. Changing saquery didn't look that hard to me, but it did look like it would modify a fair portion of the code. - Sean From sashak at voltaire.com Mon Feb 9 16:19:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 10 Feb 2009 02:19:42 +0200 Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor IBAL - ca_names missing from osm_vendor_t ? In-Reply-To: <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> <20090209235414.GM26139@sashak.voltaire.com> <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com> Message-ID: <20090210001935.GP26139@sashak.voltaire.com> On 15:55 Mon 09 Feb , Sean Hefty wrote: > >Assuming WinOF already has libibumad implementation with preserved API > >would it be reasonable to switch from vendor-ibal to vendor-ibumad in > >WinOF? > > WinOF does have a libibumad implementation, plus libibmad ports between the two > platforms. The saquery code needs structure definitions for the various > attributes, so using libibmad may be a better choice. I agree, for "saquery" specific case it is better to cleanup osm_vendor there (as we discussed already). My question above was about OpenSM itself, not for purpose of saquery serving. > Changing saquery didn't > look that hard to me, but it did look like it would modify a fair portion of the > code. I guess so. Sasha From stan.smith at intel.com Mon Feb 9 16:34:28 2009 From: stan.smith at intel.com (Smith, Stan) Date: Mon, 9 Feb 2009 16:34:28 -0800 Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names missing from osm_vendor_t ? In-Reply-To: <20090209235414.GM26139@sashak.voltaire.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> <20090209235414.GM26139@sashak.voltaire.com> Message-ID: <3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com> Sasha Khapyorsky wrote: > Hello Stan, > > On 13:16 Mon 09 Feb , Smith, Stan wrote: >> >> My proposal for the Windows OpenSM code base is to add >> ca_names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN] to OpenSM vendor-ibal >> definition of osm_vendor_t and a call to umad_get_cas_names() to >> populate the osm_vendor_t.ca_names struct for. >> >> Comments? > > Assuming WinOF already has libibumad implementation with preserved API > would it be reasonable to switch from vendor-ibal to vendor-ibumad in > WinOF? > > Sasha Hello, Path of least resistance thinking would point towards not doing a switch as the vendor-ibal to vendor-ibumad would be part of the Windows OpenSM migration to OFED 1.4x OpenSM. My thinking is that making a switch to vendor-ibumad would be a good deal more work/involved just to get saquery working. Not knowing the Windows OpenSM code base, moving part of it forward seems like a task 'which' could blossom into a good deal more work for the small return of saquery working? Frankly, I'd rather see work put into getting OFED OpenSM 1.4 working on Windows. Just my $0.02 worth. Stan. From sumeet.lahorani at oracle.com Mon Feb 9 16:41:59 2009 From: sumeet.lahorani at oracle.com (Sumeet Lahorani) Date: Mon, 09 Feb 2009 16:41:59 -0800 Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops Message-ID: <4990CD57.3080108@oracle.com> When we enable IB connected mode and increase MTU to 65520, we see the following in /var/log/messages Feb 6 17:48:32 dadzab01 kernel: ib0: enabling connected mode will cause multicast packet drops Feb 6 17:48:32 dadzab01 kernel: ib0: mtu > 2044 will cause multicast packet drops. Feb 6 17:48:32 dadzab01 kernel: ib1: enabling connected mode will cause multicast packet drops Feb 6 17:48:32 dadzab01 kernel: ib1: mtu > 2044 will cause multicast packet drops. Should we not be doing this? What kind of multicast packets will be dropped? If we are not using multicast, do any OFED drivers (bonding, ipoib etc) internally use multicast in a way that will cause them to not work correctly in connected mode? We are using OFED 1.3.1. - Sumeet From sean.hefty at intel.com Mon Feb 9 18:55:05 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 9 Feb 2009 18:55:05 -0800 Subject: [ofa-general] RE: svn.1936 commits In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com> References: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com> Message-ID: I don't see that my original post ever went out. >Ulp\libibmad\include\infiniband\mad.h > >Added MAD_EXPORT for xdump & smp_query_via needed by ibtracert & ibroute. Changes to libibmad need to go through the management.git tree. The mirror in SVN will be replaced with all upstream code. >Signed off by stan.smith at intel.com > >diff U3 C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h- >revBASE.svn000.tmp.h C:/Documents and Settings/scsmith/My Documents/openIB- >windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h >--- C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h- >revBASE.svn000.tmp.h Mon Feb 09 16:36:46 2009 >+++ C:/Documents and Settings/scsmith/My Documents/openIB- >windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h Mon Feb 09 >15:55:43 2009 >@@ -710,7 +710,7 @@ > unsigned mod, unsigned timeout); > MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, > unsigned mod, unsigned timeout); >-uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, >+MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned >attrid, > unsigned mod, unsigned timeout, const void *srcport); > uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned >mod, > unsigned timeout, const void *srcport); >@@ -837,7 +837,7 @@ > exit(-1); \ > } while(0) > >-void xdump(FILE * file, char *msg, void *p, int size); >+MAD_EXPORT void xdump(FILE * file, char *msg, void *p, int size); > > END_C_DECLS > #endif /* _MAD_H_ */ - Sean From stan.smith at intel.com Mon Feb 9 19:45:42 2009 From: stan.smith at intel.com (Smith, Stan) Date: Mon, 9 Feb 2009 19:45:42 -0800 Subject: [ofa-general] RE: svn.1936 commits In-Reply-To: References: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com> Message-ID: <3F6F638B8D880340AB536D29CD4C1E1931818024@orsmsx501.amr.corp.intel.com> Hefty, Sean wrote: > I don't see that my original post ever went out. > >> Ulp\libibmad\include\infiniband\mad.h >> >> Added MAD_EXPORT for xdump & smp_query_via needed by ibtracert & >> ibroute. > > Changes to libibmad need to go through the management.git tree. The > mirror in SVN will be replaced with all upstream code. Yes I understand this. The note was to inform you that changes need to be pushed back to the git tree. > >> Signed off by stan.smith at intel.com >> >> diff U3 C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h- >> revBASE.svn000.tmp.h C:/Documents and Settings/scsmith/My >> Documents/openIB- >> windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h --- >> C:/Documents and Settings/scsmith/Local Settings/Temp/mad.h- >> revBASE.svn000.tmp.h Mon Feb 09 16:36:46 2009 +++ C:/Documents >> and Settings/scsmith/My Documents/openIB- >> windows/SVN/gen1/trunk/ulp/libibmad/include/infiniband/mad.h Mon >> Feb 09 15:55:43 2009 @@ -710,7 +710,7 >> @@ unsigned mod, unsigned timeout); >> MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned >> attrid, unsigned mod, unsigned timeout); >> -uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, >> +MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, >> unsigned attrid, unsigned mod, unsigned >> timeout, const void *srcport); >> uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, >> unsigned mod, unsigned timeout, const void >> *srcport); @@ -837,7 +837,7 @@ exit(-1); \ >> } while(0) >> >> -void xdump(FILE * file, char *msg, void *p, int size); >> +MAD_EXPORT void xdump(FILE * file, char *msg, void *p, int size); >> >> END_C_DECLS >> #endif /* _MAD_H_ */ > > - Sean From ofedrnicuser at yahoo.com Mon Feb 9 21:22:40 2009 From: ofedrnicuser at yahoo.com (Bill N) Date: Mon, 9 Feb 2009 21:22:40 -0800 (PST) Subject: ***SPAM*** Re: [ofa-general] non zero lkey in send(), write() with num_sge > 1? In-Reply-To: <4990832C.5090204@gmail.com> Message-ID: <809230.93598.qm@web111213.mail.gq1.yahoo.com> > > Can stack pass num_sge > 1, and lkey !=0 as part of > sg_list[] elements, in post_send() call? > > > What are you trying to achieve? [Bill] I just wanted to confirm, that even when Stag !=0, (a) there can be multiple SGEs in the list with different lkey and TO. And (b) HCAs have to validate each of the SGE entry against the lkey. Want to ensure that - As RDMA ULP I can invoke post_send() with multiple lkeys and utilize the allocated MRs, HCAs are designed to handle that. Any example ULP we are aware of that does this? Regards, Bill --- On Mon, 2/9/09, Dotan Barak wrote: > From: Dotan Barak > Subject: Re: [ofa-general] ***SPAM*** non zero lkey in send(), write() with num_sge > 1? > To: "Ofed User" > Cc: "OFED General" > Date: Monday, February 9, 2009, 7:25 PM > Ofed User wrote: > > Hi, > > > If num_sge > 1 => the HCA will try to read the blocks > pointed by the > sg_list one by one and validate that the address + size is > inside a valid > Memory Region which its local key is the lkey. > > Then i guess that the answer is: Yes. > > Dotan From sean.hefty at intel.com Mon Feb 9 23:02:34 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 9 Feb 2009 23:02:34 -0800 Subject: [ofa-general] RE: svn.1936 commits In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931818024@orsmsx501.amr.corp.intel.com> References: <3F6F638B8D880340AB536D29CD4C1E1931817FC1@orsmsx501.amr.corp.intel.com> <3F6F638B8D880340AB536D29CD4C1E1931818024@orsmsx501.amr.corp.intel.com> Message-ID: <7507A78ACA634E9A9AE3CB694B629246@amr.corp.intel.com> >> Changes to libibmad need to go through the management.git tree. The >> mirror in SVN will be replaced with all upstream code. > >Yes I understand this. >The note was to inform you that changes need to be pushed back to the git tree. I am asking that all changes be submitted through to the main git tree first, especially for changes that hit the SVN trunk. I do not want to try to keep diverging trees in sync. As for the patch, commits should at least by reviewed by the maintainer before they are committed. WinOF has been very lax about this practice. And the subject should be more detailed than 'svn 1936 commits'. - Sean From nicolas.morey-chaisemartin at ext.bull.net Mon Feb 9 23:49:09 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Tue, 10 Feb 2009 08:49:09 +0100 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init value for down port index Message-ID: <49913175.609@ext.bull.net> We have to add the module value to the index before actually doing the module, or we get a value of -1 which makes OpenSM segfaults Signed-off-by: Nicolas Morey-Chaisemartin --- I missed this one in my previous patch. Sorry for that opensm/opensm/osm_ucast_ftree.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: c02f9ea241a7150d1cb1c9846408feeeeb4ef024.diff Type: text/x-patch Size: 591 bytes Desc: not available URL: From nicolas.morey-chaisemartin at ext.bull.net Tue Feb 10 00:08:01 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Tue, 10 Feb 2009 09:08:01 +0100 Subject: [ofa-general] [PATCH v2] opensm/osm_console.c : Added dump_portguid function to console to generate a list of port guids matching one or more regexps Message-ID: <499135E1.1080307@ext.bull.net> This add a dump_portguid functionnality to openSM console which makes it really easy to generate cn_guid_file, root_guid_file and such by dumping into a file all port guids whom nodedesc contains at least one of the provided regexps Signed-off-by: Nicolas Morey-Chaisemartin --- Repost without exit_after_run flag, active sleep init loop and indented. opensm/opensm/osm_console.c | 105 +++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 105 insertions(+), 0 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: a72ba8239575ad93b59015c9c4c1a0c8020d0db7.diff Type: text/x-patch Size: 3613 bytes Desc: not available URL: From kliteyn at dev.mellanox.co.il Tue Feb 10 00:59:31 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 10 Feb 2009 10:59:31 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init value for down port index In-Reply-To: <49913175.609@ext.bull.net> References: <49913175.609@ext.bull.net> Message-ID: <499141F3.9020001@dev.mellanox.co.il> Hi Nicolas, Nicolas Morey Chaisemartin wrote: > We have to add the module value to the index before actually doing the > module, or we get a value of -1 which makes OpenSM segfaults > > Signed-off-by: Nicolas Morey-Chaisemartin > > --- > > I missed this one in my previous patch. Sorry for that > > opensm/opensm/osm_ucast_ftree.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index 4e65c87..c8f5f08 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -1921,7 +1921,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > return FALSE; > > /* foreach down-going port group (in indexing order) */ > - i = p_sw->down_port_groups_idx; > + i = (p_sw->down_port_groups_idx + > + p_sw->down_port_groups_num) % p_sw->down_port_groups_num; Perhaps it would be simpler just to init the down_port_groups_idx to 0 instead of -1? Something like this: diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 4e65c87..eae1ed8 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN ftree_fabric_t * p_ftree, /* initialize lft buffer */ memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); - p_sw->down_port_groups_idx = -1; + p_sw->down_port_groups_idx = 0; return p_sw; } /* __osm_ftree_sw_create() */ From nicolas.morey-chaisemartin at ext.bull.net Tue Feb 10 01:03:44 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Tue, 10 Feb 2009 10:03:44 +0100 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init value for down port index In-Reply-To: <499141F3.9020001@dev.mellanox.co.il> References: <49913175.609@ext.bull.net> <499141F3.9020001@dev.mellanox.co.il> Message-ID: <499142F0.8000803@ext.bull.net> Yevgeny Kliteynik wrote: > Hi Nicolas, > > Nicolas Morey Chaisemartin wrote: >> We have to add the module value to the index before actually doing >> the module, or we get a value of -1 which makes OpenSM segfaults >> >> Signed-off-by: Nicolas Morey-Chaisemartin >> >> --- >> >> I missed this one in my previous patch. Sorry for that >> >> opensm/opensm/osm_ucast_ftree.c | 3 ++- >> 1 files changed, 2 insertions(+), 1 deletions(-) >> >> >> >> diff --git a/opensm/opensm/osm_ucast_ftree.c >> b/opensm/opensm/osm_ucast_ftree.c >> index 4e65c87..c8f5f08 100644 >> --- a/opensm/opensm/osm_ucast_ftree.c >> +++ b/opensm/opensm/osm_ucast_ftree.c >> @@ -1921,7 +1921,8 @@ >> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * >> p_ftree, >> return FALSE; >> >> /* foreach down-going port group (in indexing order) */ >> - i = p_sw->down_port_groups_idx; >> + i = (p_sw->down_port_groups_idx + >> + p_sw->down_port_groups_num) % p_sw->down_port_groups_num; > > Perhaps it would be simpler just to init the down_port_groups_idx to 0 > instead of -1? > Something like this: > > diff --git a/opensm/opensm/osm_ucast_ftree.c > b/opensm/opensm/osm_ucast_ftree.c > index 4e65c87..eae1ed8 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN > ftree_fabric_t * p_ftree, > /* initialize lft buffer */ > memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); > > - p_sw->down_port_groups_idx = -1; > + p_sw->down_port_groups_idx = 0; > > return p_sw; > } /* __osm_ftree_sw_create() */ > > > > Sure. I wanted to ensure that whatever happens to the index it would always be in the right interval but after checking I doubt anything else than initialization could set it outside its normal interval. Do you want me to make the patch and send it or will you just push yours? Nicolas From kliteyn at dev.mellanox.co.il Tue Feb 10 01:17:49 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 10 Feb 2009 11:17:49 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init value for down port index In-Reply-To: <499142F0.8000803@ext.bull.net> References: <49913175.609@ext.bull.net> <499141F3.9020001@dev.mellanox.co.il> <499142F0.8000803@ext.bull.net> Message-ID: <4991463D.6030705@dev.mellanox.co.il> Nicolas Morey Chaisemartin wrote: > Yevgeny Kliteynik wrote: >> Hi Nicolas, >> >>> >>> /* foreach down-going port group (in indexing order) */ >>> - i = p_sw->down_port_groups_idx; >>> + i = (p_sw->down_port_groups_idx + >>> + p_sw->down_port_groups_num) % p_sw->down_port_groups_num; >> >> Perhaps it would be simpler just to init the down_port_groups_idx to 0 >> instead of -1? >> Something like this: >> >> diff --git a/opensm/opensm/osm_ucast_ftree.c >> b/opensm/opensm/osm_ucast_ftree.c >> index 4e65c87..eae1ed8 100644 >> --- a/opensm/opensm/osm_ucast_ftree.c >> +++ b/opensm/opensm/osm_ucast_ftree.c >> @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN >> ftree_fabric_t * p_ftree, >> /* initialize lft buffer */ >> memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); >> >> - p_sw->down_port_groups_idx = -1; >> + p_sw->down_port_groups_idx = 0; >> >> return p_sw; >> } /* __osm_ftree_sw_create() */ > > Sure. I wanted to ensure that whatever happens to the index it would > always be in the right interval but after checking I doubt anything else > than initialization could set it outside its normal interval. > Do you want me to make the patch and send it or will you just push yours? I'm ok with both options. I can send a clean patch to Sasha tomorrow (I'm OOO today), or you can do it today. -- Yevgeny > Nicolas > From nicolas.morey-chaisemartin at ext.bull.net Tue Feb 10 01:29:28 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Tue, 10 Feb 2009 10:29:28 +0100 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Fixed bad init value for down port index In-Reply-To: <4991463D.6030705@dev.mellanox.co.il> References: <49913175.609@ext.bull.net> <499141F3.9020001@dev.mellanox.co.il> <499142F0.8000803@ext.bull.net> <4991463D.6030705@dev.mellanox.co.il> Message-ID: <499148F8.3000303@ext.bull.net> Yevgeny Kliteynik wrote: > Nicolas Morey Chaisemartin wrote: >> Yevgeny Kliteynik wrote: >>> Hi Nicolas, >>> >>>> >>>> /* foreach down-going port group (in indexing order) */ >>>> - i = p_sw->down_port_groups_idx; >>>> + i = (p_sw->down_port_groups_idx + >>>> + p_sw->down_port_groups_num) % p_sw->down_port_groups_num; >>> >>> Perhaps it would be simpler just to init the down_port_groups_idx to >>> 0 instead of -1? >>> Something like this: >>> >>> diff --git a/opensm/opensm/osm_ucast_ftree.c >>> b/opensm/opensm/osm_ucast_ftree.c >>> index 4e65c87..eae1ed8 100644 >>> --- a/opensm/opensm/osm_ucast_ftree.c >>> +++ b/opensm/opensm/osm_ucast_ftree.c >>> @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN >>> ftree_fabric_t * p_ftree, >>> /* initialize lft buffer */ >>> memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); >>> >>> - p_sw->down_port_groups_idx = -1; >>> + p_sw->down_port_groups_idx = 0; >>> >>> return p_sw; >>> } /* __osm_ftree_sw_create() */ >> >> Sure. I wanted to ensure that whatever happens to the index it would >> always be in the right interval but after checking I doubt anything >> else than initialization could set it outside its normal interval. >> Do you want me to make the patch and send it or will you just push >> yours? > > I'm ok with both options. > I can send a clean patch to Sasha tomorrow (I'm OOO today), or you can > do it today. > > -- Yevgeny > >> Nicolas >> > > > Yours should be faster and I recheck and I see no reason to enforce a "check" in the function so I prefer your solution. I'll repost the patch today as it's breaking opensm/ftree. Nicolas From nicolas.morey-chaisemartin at ext.bull.net Tue Feb 10 01:53:21 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Tue, 10 Feb 2009 10:53:21 +0100 Subject: [ofa-general] [PATCH v2] opensm/osm_ucast_ftree.c Fixed bad init value for down port index Message-ID: <49914E91.4090305@ext.bull.net> Fixes the init value of down_port_groups_idx to 0 so it's in the port group interval. This way __osm_ftree_fabric_route_upgoing_by_going_down can use the index directly without segfaulting. Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 4e65c87..eae1ed8 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN ftree_fabric_t * p_ftree, /* initialize lft buffer */ memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); - p_sw->down_port_groups_idx = -1; + p_sw->down_port_groups_idx = 0; return p_sw; } /* __osm_ftree_sw_create() */ -- 1.6.1 From vlad at lists.openfabrics.org Tue Feb 10 03:13:09 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 10 Feb 2009 03:13:09 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090210-0200 daily build status Message-ID: <20090210111309.DB371E61174@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From acceptany at gmail.com Tue Feb 10 03:56:06 2009 From: acceptany at gmail.com (Jordan) Date: Tue, 10 Feb 2009 19:56:06 +0800 Subject: [ofa-general] ***SPAM*** How to add a new routing algorithm in opensm? Message-ID: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com> How can I add a new routing algorithm in opensm , which files need to be modified? If this can be done , is there a simulator to test this new algorithm and dump some results? -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue Feb 10 10:44:48 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 10 Feb 2009 12:44:48 -0600 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. Message-ID: <20090210184448.22891.31130.stgit@dell3.ogc.int> From: Steve Wise Removes the need for special u64 math on i386 systems. Fixes i386 build break in linux-next introduced by commit 1e27e8cee0698259ccb1fe6abeaf4b48969c0945. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 2cf6f13..5bb299a 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list, return -EINVAL; } offset = sg_list[i].addr - mhp->attr.va_fbo; - offset += ((u64) mhp->attr.va_fbo) % - (1UL << (12 + mhp->attr.page_size)); + offset += mhp->attr.va_fbo & + ((1UL << (12 + mhp->attr.page_size)) - 1); pbl_addr[i] = ((mhp->attr.pbl_addr - rhp->rdev.rnic_info.pbl_base) >> 3) + (offset >> (12 + mhp->attr.page_size)); @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe, wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); /* to in the WQE == the offset into the page */ - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % - (1UL << (12 + page_size[i]))); + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & + ((1UL << (12 + page_size[i]))-1)); /* pbl_addr is the adapters address in the PBL */ wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); From randy.dunlap at oracle.com Tue Feb 10 11:04:55 2009 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Tue, 10 Feb 2009 11:04:55 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <20090210184448.22891.31130.stgit@dell3.ogc.int> References: <20090210184448.22891.31130.stgit@dell3.ogc.int> Message-ID: <4991CFD7.30503@oracle.com> Steve Wise wrote: > From: Steve Wise > > Removes the need for special u64 math on i386 systems. > > Fixes i386 build break in linux-next introduced by > commit 1e27e8cee0698259ccb1fe6abeaf4b48969c0945. > > Signed-off-by: Steve Wise Yes, that works, thanks. But this patch should go into 2.6.29, not just 2.6.30. > --- > > drivers/infiniband/hw/cxgb3/iwch_qp.c | 8 ++++---- > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c > index 2cf6f13..5bb299a 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c > +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c > @@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list, > return -EINVAL; > } > offset = sg_list[i].addr - mhp->attr.va_fbo; > - offset += ((u64) mhp->attr.va_fbo) % > - (1UL << (12 + mhp->attr.page_size)); > + offset += mhp->attr.va_fbo & > + ((1UL << (12 + mhp->attr.page_size)) - 1); > pbl_addr[i] = ((mhp->attr.pbl_addr - > rhp->rdev.rnic_info.pbl_base) >> 3) + > (offset >> (12 + mhp->attr.page_size)); > @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe, > wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); > > /* to in the WQE == the offset into the page */ > - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > - (1UL << (12 + page_size[i]))); > + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & > + ((1UL << (12 + page_size[i]))-1)); > > /* pbl_addr is the adapters address in the PBL */ > wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); -- ~Randy From swise at opengridcomputing.com Tue Feb 10 11:10:34 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 10 Feb 2009 13:10:34 -0600 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <4991CFD7.30503@oracle.com> References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <4991CFD7.30503@oracle.com> Message-ID: <4991D12A.8090309@opengridcomputing.com> Randy Dunlap wrote: > Steve Wise wrote: > >> From: Steve Wise >> >> Removes the need for special u64 math on i386 systems. >> >> Fixes i386 build break in linux-next introduced by >> commit 1e27e8cee0698259ccb1fe6abeaf4b48969c0945. >> >> Signed-off-by: Steve Wise >> > > Yes, that works, thanks. But this patch should go into 2.6.29, not > just 2.6.30. > > > I thought the commit that caused this was: 1e27e8cee0698259ccb1fe6abeaf4b48969c0945 And that was going in 2.6.30. (I thought). >> --- >> >> drivers/infiniband/hw/cxgb3/iwch_qp.c | 8 ++++---- >> 1 files changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c >> index 2cf6f13..5bb299a 100644 >> --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c >> +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c >> @@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, struct ib_sge *sg_list, >> return -EINVAL; >> } >> offset = sg_list[i].addr - mhp->attr.va_fbo; >> - offset += ((u64) mhp->attr.va_fbo) % >> - (1UL << (12 + mhp->attr.page_size)); >> + offset += mhp->attr.va_fbo & >> + ((1UL << (12 + mhp->attr.page_size)) - 1); >> pbl_addr[i] = ((mhp->attr.pbl_addr - >> rhp->rdev.rnic_info.pbl_base) >> 3) + >> (offset >> (12 + mhp->attr.page_size)); >> @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe, >> wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); >> >> /* to in the WQE == the offset into the page */ >> - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % >> - (1UL << (12 + page_size[i]))); >> + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & >> + ((1UL << (12 + page_size[i]))-1)); >> >> /* pbl_addr is the adapters address in the PBL */ >> wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); >> > > > From randy.dunlap at oracle.com Tue Feb 10 11:12:27 2009 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Tue, 10 Feb 2009 11:12:27 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <4991D12A.8090309@opengridcomputing.com> References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <4991CFD7.30503@oracle.com> <4991D12A.8090309@opengridcomputing.com> Message-ID: <4991D19B.5050307@oracle.com> Steve Wise wrote: > > Randy Dunlap wrote: >> Steve Wise wrote: >> >>> From: Steve Wise >>> >>> Removes the need for special u64 math on i386 systems. >>> >>> Fixes i386 build break in linux-next introduced by commit >>> 1e27e8cee0698259ccb1fe6abeaf4b48969c0945. >>> >>> Signed-off-by: Steve Wise >>> >> >> Yes, that works, thanks. But this patch should go into 2.6.29, not >> just 2.6.30. >> >> >> > I thought the commit that caused this was: > > 1e27e8cee0698259ccb1fe6abeaf4b48969c0945 > > And that was going in 2.6.30. (I thought). Oh, OK. If that's the case, then you are obviously correct about [2.6.30]. Thanks. >>> --- >>> >>> drivers/infiniband/hw/cxgb3/iwch_qp.c | 8 ++++---- >>> 1 files changed, 4 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c >>> b/drivers/infiniband/hw/cxgb3/iwch_qp.c >>> index 2cf6f13..5bb299a 100644 >>> --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c >>> +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c >>> @@ -232,8 +232,8 @@ static int iwch_sgl2pbl_map(struct iwch_dev *rhp, >>> struct ib_sge *sg_list, >>> return -EINVAL; >>> } >>> offset = sg_list[i].addr - mhp->attr.va_fbo; >>> - offset += ((u64) mhp->attr.va_fbo) % >>> - (1UL << (12 + mhp->attr.page_size)); >>> + offset += mhp->attr.va_fbo & >>> + ((1UL << (12 + mhp->attr.page_size)) - 1); >>> pbl_addr[i] = ((mhp->attr.pbl_addr - >>> rhp->rdev.rnic_info.pbl_base) >> 3) + >>> (offset >> (12 + mhp->attr.page_size)); >>> @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, >>> union t3_wr *wqe, >>> wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); >>> >>> /* to in the WQE == the offset into the page */ >>> - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % >>> - (1UL << (12 + page_size[i]))); >>> + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & >>> + ((1UL << (12 + page_size[i]))-1)); >>> >>> /* pbl_addr is the adapters address in the PBL */ >>> wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); -- ~Randy From rdreier at cisco.com Tue Feb 10 16:38:03 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Feb 2009 16:38:03 -0800 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <20090210184448.22891.31130.stgit@dell3.ogc.int> (Steve Wise's message of "Tue, 10 Feb 2009 12:44:48 -0600") References: <20090210184448.22891.31130.stgit@dell3.ogc.int> Message-ID: I'll roll this into the offending patch (that is in -next). But: > - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > - (1UL << (12 + page_size[i]))); > + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & > + ((1UL << (12 + page_size[i]))-1)); Is this required? Strength reduction optimization should do this automatically (and the code has been there for quite a while, so obviously it isn't causing problems) - R. From swise at opengridcomputing.com Tue Feb 10 17:03:52 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 10 Feb 2009 19:03:52 -0600 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: References: <20090210184448.22891.31130.stgit@dell3.ogc.int> Message-ID: <499223F8.1010204@opengridcomputing.com> Roland Dreier wrote: > I'll roll this into the offending patch (that is in -next). > > But: > > > - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > > - (1UL << (12 + page_size[i]))); > > + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & > > + ((1UL << (12 + page_size[i]))-1)); > > Is this required? Strength reduction optimization should do this > automatically (and the code has been there for quite a while, so > obviously it isn't causing problems) > > - R. > Ok. From davem at davemloft.net Tue Feb 10 17:07:40 2009 From: davem at davemloft.net (David Miller) Date: Tue, 10 Feb 2009 17:07:40 -0800 (PST) Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <499223F8.1010204@opengridcomputing.com> References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <499223F8.1010204@opengridcomputing.com> Message-ID: <20090210.170740.208470781.davem@davemloft.net> From: Steve Wise Date: Tue, 10 Feb 2009 19:03:52 -0600 > Roland Dreier wrote: > > I'll roll this into the offending patch (that is in -next). > > > > But: > > > > > - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > > > - (1UL << (12 + page_size[i]))); > > > + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & > > > + ((1UL << (12 + page_size[i]))-1)); > > > > Is this required? Strength reduction optimization should do this > > automatically (and the code has been there for quite a while, so > > obviously it isn't causing problems) > > > > - R. > > > Ok. GCC won't optimize that modulus the way you expect, try for yourself and look at the assembler if you don't believe me. :-) From rdreier at cisco.com Tue Feb 10 17:18:49 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Feb 2009 17:18:49 -0800 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <20090210.170740.208470781.davem@davemloft.net> (David Miller's message of "Tue, 10 Feb 2009 17:07:40 -0800 (PST)") References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <499223F8.1010204@opengridcomputing.com> <20090210.170740.208470781.davem@davemloft.net> Message-ID: > > Is this required? Strength reduction optimization should do this > > automatically (and the code has been there for quite a while, so > > obviously it isn't causing problems) > GCC won't optimize that modulus the way you expect, try for yourself > and look at the assembler if you don't believe me. :-) Are you thinking of the case when there are signed integers involved and so "% modulus" might produce a different result than "& (modulus - 1)" (because the compiler can't know that things are never negative)? Because in this case the compiler seems to do what I thought it would; the relevant part of the i386 assembly for wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % (1UL << (12 + page_size[i]))); is movl %eax, 28(%edi,%ebx) # .length, .len movzbl 28(%esp,%esi), %ecx # page_size, tmp89 movl $1, %eax #, tmp92 addl $12, %ecx #, tmp90 sall %cl, %eax # tmp90, tmp92 movl (%esp), %ecx # wr, decl %eax # tmp93 movl 12(%ecx), %edx # .sg_list, .sg_list andl (%edx,%ebx), %eax # .addr, tmp93 ie the compiler computes the modulus, then does decl to compute modulus-1 and then &s with it. Or am I misunderstanding your point? - R. From acceptany at gmail.com Tue Feb 10 17:23:50 2009 From: acceptany at gmail.com (Jordan) Date: Wed, 11 Feb 2009 09:23:50 +0800 Subject: [ofa-general] ***SPAM*** How to add a new routing algorithm in opensm? In-Reply-To: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com> References: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com> Message-ID: <91fe68d50902101723q2ca64b8cl4c4fe03fc2f9fbb@mail.gmail.com> How can I add a new routing algorithm in opensm , which files need to be modified? If this can be done , is there a simulator to test this new algorithm and dump some results? -------------- next part -------------- An HTML attachment was scrubbed... URL: From davem at davemloft.net Tue Feb 10 17:23:47 2009 From: davem at davemloft.net (David Miller) Date: Tue, 10 Feb 2009 17:23:47 -0800 (PST) Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: References: <499223F8.1010204@opengridcomputing.com> <20090210.170740.208470781.davem@davemloft.net> Message-ID: <20090210.172347.189515015.davem@davemloft.net> From: Roland Dreier Date: Tue, 10 Feb 2009 17:18:49 -0800 > > > Is this required? Strength reduction optimization should do this > > > automatically (and the code has been there for quite a while, so > > > obviously it isn't causing problems) > > > GCC won't optimize that modulus the way you expect, try for yourself > > and look at the assembler if you don't believe me. :-) > > Are you thinking of the case when there are signed integers involved and > so "% modulus" might produce a different result than "& (modulus - 1)" > (because the compiler can't know that things are never negative)? > Because in this case the compiler seems to do what I thought it would; > the relevant part of the i386 assembly for > > wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > (1UL << (12 + page_size[i]))); > > is > > movl %eax, 28(%edi,%ebx) # .length, > .len > movzbl 28(%esp,%esi), %ecx # page_size, tmp89 > movl $1, %eax #, tmp92 > addl $12, %ecx #, tmp90 > sall %cl, %eax # tmp90, tmp92 > movl (%esp), %ecx # wr, > decl %eax # tmp93 > movl 12(%ecx), %edx # .sg_list, .sg_list > andl (%edx,%ebx), %eax # .addr, tmp93 > > ie the compiler computes the modulus, then does decl to compute > modulus-1 and then &s with it. > > Or am I misunderstanding your point? Must be compiler and platform specific because with gcc-4.1.3 on sparc with -O2, for the test program: unsigned long page_size[4]; int main(int argc) { unsigned long long x = argc; return x % (1UL << (12 + page_size[argc])); } I get a call to __umoddi3: main: save %sp, -112, %sp sethi %hi(page_size), %g1 sll %i0, 2, %g3 or %g1, %lo(page_size), %g1 mov 1, %o2 ld [%g1+%g3], %g2 add %g2, 12, %g2 sll %o2, %g2, %o2 mov %i0, %o1 mov %o2, %o3 sra %i0, 31, %o0 call __umoddi3, 0 mov 0, %o2 jmp %i7+8 restore %g0, %o1, %o0 I get the same with gcc-4.3.0 and -O2 on 32-bit x86: main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) movl $1, %eax pushl %ebp movl %esp, %ebp pushl %ecx subl $20, %esp movl (%ecx), %edx movl page_size(,%edx,4), %ecx movl $0, 12(%esp) movl %edx, (%esp) addl $12, %ecx sall %cl, %eax movl %eax, 8(%esp) movl %edx, %eax sarl $31, %eax movl %eax, 4(%esp) call __umoddi3 addl $20, %esp popl %ecx popl %ebp leal -4(%ecx), %esp ret From sashak at voltaire.com Tue Feb 10 17:34:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 03:34:41 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: remove osm vendor layer In-Reply-To: <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> <20090209235414.GM26139@sashak.voltaire.com> <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com> Message-ID: <20090211013441.GR26139@sashak.voltaire.com> Replace OSM Vendor layer by libibumad and libibmad (rpc) calls. This patch is done following "minimum changes" rule to demonstrate osm vendor replacement. Many subsequent improvements and simplification can be done. All current saquery functionality is preserved. Signed-off-by: Sasha Khapyorsky --- On 15:55 Mon 09 Feb , Sean Hefty wrote: > Changing saquery didn't > look that hard to me, but it did look like it would modify a fair portion of the > code. Cannot resist... :) Sasha infiniband-diags/configure.in | 4 - infiniband-diags/src/saquery.c | 266 +++++++++++++++++++--------------------- 2 files changed, 127 insertions(+), 143 deletions(-) diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 58eea0a..7d277b2 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -40,10 +40,6 @@ AC_CHECK_LIB(ibmad, port_performance_ext_query, [], AC_MSG_ERROR([port_performance_ext_query() not found. diags require more recent libibmad.])) AC_CHECK_LIB(osmcomp, cl_thread_init, [], AC_MSG_ERROR([cl_thread_init() not found. diags require libosmcomp.])) -AC_CHECK_LIB(osmvendor, osmv_query_sa, [], - AC_MSG_ERROR([osmv_query_sa() not found. diags require libosmvendor.]), [-lopensm]) -AC_CHECK_LIB(opensm, osm_log_init_v2, [], - AC_MSG_ERROR([osm_log_init_v2() not found. diags require libopensm.])) fi dnl Checks for header files. diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 5361184..0a997cf 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -42,20 +42,33 @@ #include #include #include +#include #define _GNU_SOURCE #include +#include #include -#include -#include -#include -#include +#include #include #include #include "ibdiag_common.h" +struct sa_bind_handle { + int fd, agent; + ib_portid_t dport; +}; + +struct sa_result { + int status; + unsigned result_cnt; + void *p_result_madw; +}; + +#define osmv_query_res_t struct sa_result +#define osm_bind_handle_t struct sa_bind_handle * + struct query_params { ib_gid_t sgid, dgid, gid, mgid; uint16_t slid, dlid, mlid; @@ -82,7 +95,7 @@ struct query_cmd { static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; -static ib_net64_t smkey = OSM_DEFAULT_SA_KEY; +static ib_net64_t smkey = CL_HTON64(1); /** * Declare some globals because I don't want this to be too complex. @@ -90,11 +103,6 @@ static ib_net64_t smkey = OSM_DEFAULT_SA_KEY; #define MAX_PORTS (8) #define DEFAULT_SA_TIMEOUT_MS (1000) osmv_query_res_t result; -osm_log_t log_osm; -osm_mad_pool_t mad_pool; -osm_vendor_t *vendor = NULL; -char *sa_hca_name = NULL; -uint32_t sa_port_num = 0; enum { ALL, @@ -112,6 +120,81 @@ int requested_lid_flag = 0; ib_net64_t requested_guid = 0; int requested_guid_flag = 0; +static int sa_query(struct sa_bind_handle *h, uint8_t method, + ib_net16_t attr, ib_net32_t mod, ib_net64_t comp_mask, + ib_net64_t sm_key, void *data) +{ + ib_rpc_t rpc; + void *umad, *mad; + int ret, offset, len = 256; + + memset(&rpc, 0, sizeof(rpc)); + rpc.mgtclass = IB_SA_CLASS; + rpc.method = method; + rpc.attr.id = cl_ntoh16(attr); + rpc.attr.mod = cl_ntoh32(mod); + rpc.mask = cl_ntoh64(comp_mask); + rpc.datasz = IB_SA_DATA_SIZE; + rpc.dataoffs = IB_SA_DATA_OFFS; + + umad = calloc(1, len + umad_size()); + if (!umad) + IBPANIC("cannot alloc mem for umad: %s\n", strerror(errno)); + + mad_build_pkt(umad, &rpc, &h->dport, NULL, data); + + /* SA SM_Key (36/8) - temporary done using IB_MAD_MKEY_F */ + mad_set_field64(umad_get_mad(umad), 12, IB_MAD_MKEY_F, cl_hton64(sm_key)); + + if (ibdebug > 1) + xdump(stdout, "SA Request:\n", umad_get_mad(umad), len); + + ret = umad_send(h->fd, h->agent, umad, len, ibd_timeout, 0); + if (ret < 0) + IBPANIC("umad_send failed: attr %u: %s\n", + attr, strerror(errno)); + +recv_mad: + ret = umad_recv(h->fd, umad, &len, ibd_timeout); + if (ret < 0) { + if (errno == ENOSPC) { + umad = realloc(umad, umad_size() + len); + goto recv_mad; + } + IBPANIC("umad_recv failed: attr %u: %s\n", attr, + strerror(errno)); + } + + if ((ret = umad_status(umad))) + return ret; + + mad = umad_get_mad(umad); + + if (ibdebug > 1) + xdump(stdout, "SA Response:\n", mad, len); + + method = mad_get_field(mad, 0, IB_MAD_METHOD_F); + offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); + result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F); + result.p_result_madw = mad; + if (result.status || !offset) + result.result_cnt = 0; + else if (method != IB_MAD_METHOD_GET_TABLE) + result.result_cnt = 1; + else + result.result_cnt = (len - IB_SA_DATA_OFFS) / (offset << 3); + + return 0; +} + +static void *osmv_get_query_result(void *mad, unsigned i) +{ + int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); + return mad + IB_SA_DATA_OFFS + i * (offset << 3); +} + +#define osmv_get_query_node_rec(mad, i) osmv_get_query_result(mad, i) + static unsigned valid_gid(ib_gid_t *gid) { ib_gid_t zero_gid = { }; @@ -132,14 +215,6 @@ static void format_buf(char *in, char *out, unsigned size) *out = '\0'; } -/** - * Call back for the various record requests. - */ -static void query_res_cb(osmv_query_res_t * res) -{ - result = *res; -} - static void print_node_desc(ib_node_record_t * node_record) { ib_node_info_t *p_ni = &(node_record->node_info); @@ -683,6 +758,7 @@ static void dump_one_mft_record(void *data) cl_ntoh16(mftr->mft[i])); printf("\n"); } + static void dump_results(osmv_query_res_t * r, void (*dump_func) (void *)) { int i; @@ -694,11 +770,8 @@ static void dump_results(osmv_query_res_t * r, void (*dump_func) (void *)) static void return_mad(void) { - /* - * Return the IB query MAD to the pool as necessary. - */ - if (result.p_result_madw != NULL) { - osm_mad_pool_put(&mad_pool, result.p_result_madw); + if (result.p_result_madw) { + free(result.p_result_madw - umad_size()); result.p_result_madw = NULL; } } @@ -711,32 +784,11 @@ get_any_records(osm_bind_handle_t h, ib_net16_t attr_id, ib_net32_t attr_mod, ib_net64_t comp_mask, void *attr, ib_net16_t attr_offset, ib_net64_t sm_key) { - ib_api_status_t status; - osmv_query_req_t req; - osmv_user_query_t user; - - memset(&req, 0, sizeof(req)); - memset(&user, 0, sizeof(user)); - - user.attr_id = attr_id; - user.attr_offset = attr_offset; - user.attr_mod = attr_mod; - user.comp_mask = comp_mask; - user.p_attr = attr; - - req.query_type = OSMV_QUERY_USER_DEFINED; - req.timeout_ms = ibd_timeout; - req.retry_cnt = 1; - req.flags = OSM_SA_FLAGS_SYNC; - req.query_context = NULL; - req.pfn_query_cb = query_res_cb; - req.p_query_input = &user; - req.sm_key = sm_key; - - if ((status = osmv_query_sa(h, &req)) != IB_SUCCESS) { - fprintf(stderr, "Query SA failed: %s\n", - ib_get_err_str(status)); - return status; + int ret = sa_query(h, IB_MAD_METHOD_GET_TABLE, attr_id, attr_mod, + comp_mask, sm_key, attr); + if (ret) { + fprintf(stderr, "Query SA failed: %s\n", ib_get_err_str(ret)); + return ret; } if (result.status != IB_SUCCESS) { @@ -745,7 +797,7 @@ get_any_records(osm_bind_handle_t h, return result.status; } - return status; + return ret; } /** @@ -928,34 +980,21 @@ static ib_api_status_t print_node_records(osm_bind_handle_t h) static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h) { - osmv_query_req_t req; - ib_api_status_t status; - - memset(&req, 0, sizeof(req)); - - req.query_type = OSMV_QUERY_CLASS_PORT_INFO; - req.timeout_ms = ibd_timeout; - req.retry_cnt = 1; - req.flags = OSM_SA_FLAGS_SYNC; - req.query_context = NULL; - req.pfn_query_cb = query_res_cb; - req.p_query_input = NULL; - req.sm_key = 0; - - if ((status = osmv_query_sa(h, &req)) != IB_SUCCESS) { + int ret = sa_query(h, IB_MAD_METHOD_GET, IB_MAD_ATTR_CLASS_PORT_INFO, + 0, 0, 0, NULL); + if (ret) { fprintf(stderr, "ERROR: Query SA failed: %s\n", - ib_get_err_str(status)); - return (status); + ib_get_err_str(ret)); + return ret; } if (result.status != IB_SUCCESS) { fprintf(stderr, "ERROR: Query result returned: %s\n", ib_get_err_str(result.status)); return (result.status); } - status = result.status; dump_results(&result, dump_class_port_info); return_mad(); - return (status); + return ret; } static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h, @@ -1046,11 +1085,8 @@ static ib_api_status_t print_multicast_member_records(osm_bind_handle_t h) return_mad(); return_mc: - /* return_mad for the mc_group_result */ - if (mc_group_result.p_result_madw != NULL) { - osm_mad_pool_put(&mad_pool, mc_group_result.p_result_madw); - mc_group_result.p_result_madw = NULL; - } + if (mc_group_result.p_result_madw) + free(mc_group_result.p_result_madw - umad_size()); return (status); } @@ -1366,78 +1402,30 @@ static int query_mft_records(const struct query_cmd *q, osm_bind_handle_t h, static osm_bind_handle_t get_bind_handle(void) { - uint32_t i = 0; - uint64_t port_guid = (uint64_t) - 1; - osm_bind_handle_t h; - ib_api_status_t status; - ib_port_attr_t attr_array[MAX_PORTS]; - uint32_t num_ports = MAX_PORTS; - uint32_t ca_name_index = 0; - - complib_init(); - - osm_log_construct(&log_osm); - if ((status = osm_log_init_v2(&log_osm, TRUE, 0x0001, NULL, - 0, TRUE)) != IB_SUCCESS) { - fprintf(stderr, "Failed to init osm_log: %s\n", - ib_get_err_str(status)); - exit(-1); - } - osm_log_set_level(&log_osm, OSM_LOG_NONE); - if (ibdebug) - osm_log_set_level(&log_osm, OSM_LOG_DEFAULT_LEVEL); - - vendor = osm_vendor_new(&log_osm, ibd_timeout); - osm_mad_pool_construct(&mad_pool); - if ((status = osm_mad_pool_init(&mad_pool)) != IB_SUCCESS) { - fprintf(stderr, "Failed to init mad pool: %s\n", - ib_get_err_str(status)); - exit(-1); - } + static struct sa_bind_handle handle; + int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; - if ((status = - osm_vendor_get_all_port_attr(vendor, attr_array, - &num_ports)) != IB_SUCCESS) { - fprintf(stderr, "Failed to get port attributes: %s\n", - ib_get_err_str(status)); - exit(-1); - } + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); - for (i = 0; i < num_ports; i++) { - if (i > 1 && cl_ntoh64(attr_array[i].port_guid) - != (cl_ntoh64(attr_array[i - 1].port_guid) + 1)) - ca_name_index++; - if (sa_port_num && sa_port_num != attr_array[i].port_num) - continue; - if (sa_hca_name - && strcmp(sa_hca_name, - vendor->ca_names[ca_name_index]) != 0) - continue; - if (attr_array[i].link_state == IB_LINK_ACTIVE) { - port_guid = attr_array[i].port_guid; - break; - } - } + ib_resolve_smlid(&handle.dport, ibd_timeout); + if (!handle.dport.lid) + IBPANIC("No SM found."); - if (port_guid == (uint64_t) - 1) { - fprintf(stderr, - "Failed to find active port, check port status with \"ibstat\"\n"); - exit(-1); - } + handle.dport.qp = 1; + if (!handle.dport.qkey) + handle.dport.qkey = IB_DEFAULT_QP1_QKEY; - h = osmv_bind_sa(vendor, &mad_pool, port_guid); + handle.fd = madrpc_portid(); + handle.agent = umad_register(handle.fd, IB_SA_CLASS, 2, 1, NULL); - if (h == OSM_BIND_INVALID_HANDLE) { - fprintf(stderr, "Failed to bind to SA\n"); - exit(-1); - } - return h; + return &handle; } -static void clean_up(void) +static void clean_up(struct sa_bind_handle *h) { - osm_mad_pool_destroy(&mad_pool); - osm_vendor_delete(&vendor); + umad_unregister(h->fd, h->agent); + umad_close_port(h->fd); + umad_done(); } static const struct query_cmd query_cmds[] = { @@ -1847,7 +1835,7 @@ int main(int argc, char **argv) if (src_lid) free(src_lid); - clean_up(); + clean_up(h); close_node_name_map(node_name_map); return (status); } -- 1.6.1.2.319.gbd9e From sean.hefty at intel.com Tue Feb 10 17:38:02 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 10 Feb 2009 17:38:02 -0800 Subject: [ofa-general] RE: [PATCH] infiniband-diags/saquery: remove osm vendor layer In-Reply-To: <20090211013441.GR26139@sashak.voltaire.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> <20090209235414.GM26139@sashak.voltaire.com> <2AB4681E1AED47B7A0904E032023F326@amr.corp.intel.com> <20090211013441.GR26139@sashak.voltaire.com> Message-ID: >Replace OSM Vendor layer by libibumad and libibmad (rpc) calls. > >This patch is done following "minimum changes" rule to demonstrate osm >vendor replacement. Many subsequent improvements and simplification can >be done. All current saquery functionality is preserved. > >Signed-off-by: Sasha Khapyorsky >--- > >On 15:55 Mon 09 Feb , Sean Hefty wrote: >> Changing saquery didn't >> look that hard to me, but it did look like it would modify a fair portion of >the >> code. > >Cannot resist... :) Excellent! - thanks Sasha! It even reduced the codebase too. - Sean From sashak at voltaire.com Tue Feb 10 17:46:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 03:46:35 +0200 Subject: [ofa-general] Re: [ofw] Re: saquery & osm vendor IBAL - ca_names missing from osm_vendor_t ? In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> <20090209235414.GM26139@sashak.voltaire.com> <3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com> Message-ID: <20090211014635.GS26139@sashak.voltaire.com> On 16:34 Mon 09 Feb , Smith, Stan wrote: > > Path of least resistance thinking would point towards not doing a switch as the vendor-ibal to vendor-ibumad would be part of the Windows OpenSM migration to OFED 1.4x OpenSM. > My thinking is that making a switch to vendor-ibumad would be a good deal more work/involved just to get saquery working. For just saquery it would be overkill. (BTW I posted patch which cleans osm vendor calls from saquery - hope the problem of vendor-ibal extending will be eliminated soon). I was thinking about vendor switching in context of OpenSM itself - in order to unify OpenSM/umad access layer between different systems (and eventually to cleanup all those osm vendor mess). > Not knowing the Windows OpenSM code base, moving part of it forward seems like a task 'which' could blossom into a good deal more work for the small return of saquery working? > Frankly, I'd rather see work put into getting OFED OpenSM 1.4 working on Windows. Sure, it could be done as part of WinOF OpenSM upgrade process (doing this just for fun against outdated OpenSM codebase doesn't buy a much). Sasha From Minoru.Hamakawa at Sun.COM Tue Feb 10 18:51:09 2009 From: Minoru.Hamakawa at Sun.COM (Minoru Hamakawa) Date: Wed, 11 Feb 2009 11:51:09 +0900 Subject: [ofa-general] Unable to handle kernel NULL pointer dereference Message-ID: <49923D1D.4090202@Sun.COM> Hi experts, Does anyone know the following panic?? -- Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: [] kref_get+0x1/0x3d ... -- It occurrs when we remove IB Cable from HCA and insert cable to HCA. The HCA is X4217A-Z(Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)) OFED is 1.3.1. And Kernel is 2.6.18-92.1.10.el5_lustre.1.6.6.20081218100335smp. #Lustre patched kernel Thank you in advance for your kind attention. Should you have any queries please feel free to contact me. And I appreciate if I could hear from you at your earliest convenience. I'm not in this alias. please reply direct to me. Best regards, Minoru Hamakawa ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: [] kref_get+0x1/0x3d PGD 40a158067 PUD 40a364067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:1c.0/0000:0b:00.1/irq CPU 0 Pid: 3752, comm: ib_mad1 Tainted: GF 2.6.18-92.1.10.el5_lustre.1.6.6.20081218100335smp #1 RIP: 0010:[] [] kref_get+0x1/0x3d RSP: 0018:ffff8104184f5cf0 EFLAGS: 00010002 RAX: ffff81040dcf3000 RBX: ffff81040dcf3000 RCX: 0000000000000000 RDX: 0000000000000100 RSI: ffff8104189f4dc0 RDI: 0000000000000008 RBP: ffff81040dcf3130 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff810416036828 R13: ffff8104189f4c18 R14: ffff8104189f4c00 R15: ffff8104184f7280 FS: 0000000000000000(0000) GS:ffffffff803ea000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000040a2f1000 CR4: 00000000000006e0 Process ib_mad1 (pid: 3752, threadinfo ffff8104184f4000, task ffff81041b993100) Stack: ffff81040dcf3000 ffffffff88585668 040000001d420301 032801001dbe4000 ae64ffff88432100 0000c0fe00007fff 00ba030001000000 0000001ac9cc0001 4580a0d000000000 000000d000000000 fc89ef3000000000 0000000000002ad7 Call Trace: [] :ib_sa:notice_handler+0xaf/0x10b [] :ib_mad:ib_mad_completion_handler+0x433/0x5e0 [] :ib_mad:ib_mad_completion_handler+0x0/0x5e0 [] run_workqueue+0x94/0xe4 [] worker_thread+0x0/0x122 [] keventd_create_kthread+0x0/0xc4 [] worker_thread+0xf0/0x122 [] default_wake_function+0x0/0xe [] keventd_create_kthread+0x0/0xc4 [] keventd_create_kthread+0x0/0xc4 [] kthread+0xfe/0x132 [] keventd_create_kthread+0x0/0xc4 [] child_rip+0xa/0x11 [] keventd_create_kthread+0x0/0xc4 [] kthread+0x0/0x132 [] child_rip+0x0/0x11 From sashak at voltaire.com Tue Feb 10 19:13:38 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 05:13:38 +0200 Subject: [ofa-general] [PATCH] libibmad/mad.h: define more SA attributed Message-ID: <20090211031338.GT26139@sashak.voltaire.com> Define some more SA attributes. Signed-off-by: Sasha Khapyorsky --- libibmad/include/infiniband/mad.h | 11 +++++++++++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 3095f34..bd62ec7 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -127,12 +127,23 @@ enum SMI_ATTR_ID { enum SA_ATTR_ID { IB_SA_ATTR_NOTICE = 0x02, IB_SA_ATTR_INFORMINFO = 0x03, + IB_SA_ATTR_NODERECORD = 0x11, IB_SA_ATTR_PORTINFORECORD = 0x12, + IB_SA_ATTR_SL2VLTABLERECORD = 0x13, + IB_SA_ATTR_SWITCHINFORECORD = 0x14, + IB_SA_ATTR_LFTRECORD = 0x15, + IB_SA_ATTR_RFTRECORD = 0x16, + IB_SA_ATTR_MFTRECORD = 0x17, + IB_SA_ATTR_SMINFORECORD = 0x18, IB_SA_ATTR_LINKRECORD = 0x20, + IB_SA_ATTR_GUIDINFORECORD = 0x30, IB_SA_ATTR_SERVICERECORD = 0x31, + IB_SA_ATTR_PKEYTABLERECORD = 0x33, IB_SA_ATTR_PATHRECORD = 0x35, + IB_SA_ATTR_VLARBTABLERECORD = 0x36, IB_SA_ATTR_MCRECORD = 0x38, IB_SA_ATTR_MULTIPATH = 0x3a, + IB_SA_ATTR_INFORMINFORECORD = 0xf3, IB_SA_ATTR_LAST }; -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Tue Feb 10 19:14:13 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 05:14:13 +0200 Subject: [ofa-general] [PATCH] libibmad/fields.c: define SA SM_Key field details Message-ID: <20090211031413.GU26139@sashak.voltaire.com> Define SA SM_Key field details (offset, length, name, dump_function). Signed-off-by: Sasha Khapyorsky --- libibmad/src/fields.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 08d0ccb..89581dc 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -95,7 +95,7 @@ static const ib_field_t ib_mad_f[] = { {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, /* word 10,11 (36-43 bytes) */ - {0, 0}, /* IB_SA_MKEY_F - reserved as invalid */ + {288, 64, "SaSMkey", mad_dump_hex}, /* word 12 (44-47 bytes) */ {BE_OFFS(46 * 8, 16), "SaAttrOffs", mad_dump_uint}, -- 1.6.1.2.319.gbd9e From Jie.Cai at cs.anu.edu.au Tue Feb 10 23:12:19 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Wed, 11 Feb 2009 18:12:19 +1100 Subject: [ofa-general] uDAPL multi-rail (multi-IAs) sample program?? Message-ID: <49927A53.1020403@cs.anu.edu.au> Is there any sample program for utilizing multi-rail to do RDMA communications? At each node, multiple IAs are opened corresponding to different HCA ports, and then RDMA write from one side to another side with though both rails. If anyone has experience on this or has some sample code, please let me know. Big thanks. -- Mr. Jie Cai From rdreier at cisco.com Tue Feb 10 23:20:39 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Feb 2009 23:20:39 -0800 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <20090210.172347.189515015.davem@davemloft.net> (David Miller's message of "Tue, 10 Feb 2009 17:23:47 -0800 (PST)") References: <499223F8.1010204@opengridcomputing.com> <20090210.170740.208470781.davem@davemloft.net> <20090210.172347.189515015.davem@davemloft.net> Message-ID: > Must be compiler and platform specific because with gcc-4.1.3 on > sparc with -O2, for the test program: > > unsigned long page_size[4]; > > int main(int argc) > { > unsigned long long x = argc; > > return x % (1UL << (12 + page_size[argc])); > } > > I get a call to __umoddi3: You're not testing the same thing. The original code was: wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % (1UL << (12 + page_size[i]))); and it's not that easy to see with all the parentheses, but the expression being done is (u32) % (unsigned long). So rather than unsigned long long in your program, you should have just done unsigned (u32 is unsigned int on all Linux architectures). In that case gcc does not generate a call to any library function in all the versions I have handy, although gcc 4.1 does do a div instead of an and. (And I don't think any 32-bit architectures require a library function for (unsigned) % (unsigned), so the code should be OK) Your example shows that gcc is missing a strength reduction opportunity in not handling (u64) % (unsigned long) on 32 bit architectures, but I guess it is a more difficult optimization to do, since gcc has to know that it can simply zero the top 32 bits. - R. From davem at davemloft.net Wed Feb 11 00:00:49 2009 From: davem at davemloft.net (David Miller) Date: Wed, 11 Feb 2009 00:00:49 -0800 (PST) Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: References: <20090210.172347.189515015.davem@davemloft.net> Message-ID: <20090211.000049.193727089.davem@davemloft.net> From: Roland Dreier Date: Tue, 10 Feb 2009 23:20:39 -0800 > > unsigned long page_size[4]; > > > > int main(int argc) > > { > > unsigned long long x = argc; > > > > return x % (1UL << (12 + page_size[argc])); > > } > > > > I get a call to __umoddi3: > > You're not testing the same thing. The original code was: > > wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > (1UL << (12 + page_size[i]))); > > and it's not that easy to see with all the parentheses, but the > expression being done is (u32) % (unsigned long). So rather than > unsigned long long in your program, you should have just done unsigned > (u32 is unsigned int on all Linux architectures). In that case gcc does > not generate a call to any library function in all the versions I have > handy, although gcc 4.1 does do a div instead of an and. (And I don't > think any 32-bit architectures require a library function for (unsigned) > % (unsigned), so the code should be OK) > > Your example shows that gcc is missing a strength reduction opportunity > in not handling (u64) % (unsigned long) on 32 bit architectures, but I > guess it is a more difficult optimization to do, since gcc has to know > that it can simply zero the top 32 bits. Indeed, I get the divide if I use "unsigned int" for "x". I still think you should make this change, as many systems out there are getting the expensive divide. main: sethi %hi(page_size), %g1 or %g1, %lo(page_size), %g1 mov %o0, %g3 sll %o0, 2, %g4 ld [%g1+%g4], %g2 mov 1, %g1 add %g2, 12, %g2 sll %g1, %g2, %g1 wr %g0, %g0, %y nop nop nop udiv %o0, %g1, %o0 smul %o0, %g1, %o0 jmp %o7+8 sub %g3, %o0, %o0 From dorfman.eli at gmail.com Wed Feb 11 01:22:24 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Wed, 11 Feb 2009 11:22:24 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <20090209141732.GF26139@sashak.voltaire.com> References: <20090203124407.GE11874@sashak.voltaire.com> <49884962.5070601@gmail.com> <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> <498A9888.5010003@gmail.com> <20090205121634.GQ11874@sashak.voltaire.com> <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com> <20090208213826.GA24254@sashak.voltaire.com> <4990340A.10004@gmail.com> <20090209141732.GF26139@sashak.voltaire.com> Message-ID: <499298D0.5060804@gmail.com> Sasha Khapyorsky wrote: > On 15:47 Mon 09 Feb , Eli Dorfman (Voltaire) wrote: >> Sasha Khapyorsky wrote: >>> Hi Eli, >>> >>> On 21:23 Sun 08 Feb , Eli Dorfman wrote: >>>> yes, but wouldn't it be better to separate between heavy sweep and >>>> config rescan (due to SIGHUP). >>> SIGHUP main purpose always was to trigger heavy sweep. >>> >>>> I think that user should know when configuration is updated and not >>>> wait for heavy sweep. >>> I'm not following - SIGHUP will cause heavy sweep and config update, >>> where is a waiting? >>> >> i meant that if the user is changing config file and there is a heavy sweep then >> config may be updated, > > Are you about race between file reading (by OpenSM) and writing (by > user)? Using write lock on reading would solve an issue. > >> while using specific flag for config rescan will avoid this case. > > What do you mean by "specific flag"? Using separate signal? Assuming so, > this will not prevent read/write race. > At the moment force_heavy_sweep is set in many places and also after SIGHUP. opensm rescans the configuration file when this flag is set, so if there is link change in the subnet while the user is modifying the file, the opensm may update the configuration even if the user didn't finish updating it. Using another flag (e.g. rescan_config_file) that will be set only after SIGHUP will assure that opensm updates subnet configuration when user finished updating the file. From nicolas.morey-chaisemartin at ext.bull.net Wed Feb 11 01:25:26 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Wed, 11 Feb 2009 10:25:26 +0100 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <20090207202319.GE27757@sashak.voltaire.com> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> Message-ID: <49929986.40106@ext.bull.net> Sasha Khapyorsky wrote: > On 20:48 Sat 07 Feb , Nicolas Morey-Chaisemartin wrote: > >>> "IO" is specific for your setup. Could we find more generic name for such >>> nodes? >>> >>> >>> >> Sure. Any ideas? >> > > No, I didn't think about it. > > I've rebased and fix the patches against master. I just need a name for the configuration. What about high nodes (HN) as it concerns only nodes which are not at the bottom of the fat tree? Nicolas From acceptany at gmail.com Wed Feb 11 03:03:04 2009 From: acceptany at gmail.com (Jordan) Date: Wed, 11 Feb 2009 19:03:04 +0800 Subject: [ofa-general] ***SPAM*** problem about adding a new routing algorithm in opensm Message-ID: <91fe68d50902110303r2b1dcf27n865bd8b39c9bea76@mail.gmail.com> I want to add a new routing algorithm in opensm , can this idea be supported by opensm ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Wed Feb 11 03:14:15 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 11 Feb 2009 03:14:15 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090211-0200 daily build status Message-ID: <20090211111415.C22DBE60E5A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Wed Feb 11 03:43:47 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 13:43:47 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <49929986.40106@ext.bull.net> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> <49929986.40106@ext.bull.net> Message-ID: <20090211114347.GA27920@sashak.voltaire.com> On 10:25 Wed 11 Feb , Nicolas Morey Chaisemartin wrote: > What about high nodes (HN) as it concerns only nodes which are not at the > bottom of the fat tree? Could be fine. Let's ask Yevgeny too... :) Yevgeny! Any idea about io_nodes more generic name? Sasha From sashak at voltaire.com Wed Feb 11 03:52:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 13:52:55 +0200 Subject: [ofa-general] [PATCH 2/4 v2] opensm/osm_state_mgr.c rescan subnet configuration after SIGHUP In-Reply-To: <499298D0.5060804@gmail.com> References: <20090203134831.GI11874@sashak.voltaire.com> <498850A2.8090701@gmail.com> <20090205000323.GN11874@sashak.voltaire.com> <498A9888.5010003@gmail.com> <20090205121634.GQ11874@sashak.voltaire.com> <694d48600902081123y7ddf63adk5c6562f919173241@mail.gmail.com> <20090208213826.GA24254@sashak.voltaire.com> <4990340A.10004@gmail.com> <20090209141732.GF26139@sashak.voltaire.com> <499298D0.5060804@gmail.com> Message-ID: <20090211115247.GB27920@sashak.voltaire.com> On 11:22 Wed 11 Feb , Eli Dorfman (Voltaire) wrote: > > At the moment force_heavy_sweep is set in many places and also after SIGHUP. > opensm rescans the configuration file when this flag is set, so if there is link change > in the subnet while the user is modifying the file, the opensm may update the configuration > even if the user didn't finish updating it. So what is your concerts here? That OpenSM rescans unmodified file or that file is potentially broken? > Using another flag (e.g. rescan_config_file) that will be set only after SIGHUP will > assure that opensm updates subnet configuration when user finished updating the file. Send SIGHUP. OpenSM will rescan config again and will do heavy sweep. Where is a problem? Sasha From sashak at voltaire.com Wed Feb 11 04:39:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 14:39:34 +0200 Subject: [ofa-general] ***SPAM*** problem about adding a new routing algorithm in opensm In-Reply-To: <91fe68d50902110303r2b1dcf27n865bd8b39c9bea76@mail.gmail.com> References: <91fe68d50902110303r2b1dcf27n865bd8b39c9bea76@mail.gmail.com> Message-ID: <20090211123926.GE27920@sashak.voltaire.com> On 19:03 Wed 11 Feb , Jordan wrote: > I want to add a new routing algorithm in opensm , What is this algorithm and how is it different from existing ones? > can this idea be supported > by opensm ? This idea is already supported by OpenSM - look at 'struct osm_routing_engine' (in osm_opensm.h) and how it is used. Sasha From sashak at voltaire.com Wed Feb 11 04:40:43 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 14:40:43 +0200 Subject: [ofa-general] ***SPAM*** How to add a new routing algorithm in opensm? In-Reply-To: <91fe68d50902101723q2ca64b8cl4c4fe03fc2f9fbb@mail.gmail.com> References: <91fe68d50902100356w790095cdy158c0f681ef5ceec@mail.gmail.com> <91fe68d50902101723q2ca64b8cl4c4fe03fc2f9fbb@mail.gmail.com> Message-ID: <20090211124043.GF27920@sashak.voltaire.com> On 09:23 Wed 11 Feb , Jordan wrote: > If this can be done , is there a simulator to test this new > algorithm and dump some results? Yes, ibsim. Sasha From kliteyn at dev.mellanox.co.il Wed Feb 11 05:26:31 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 11 Feb 2009 15:26:31 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <20090211114347.GA27920@sashak.voltaire.com> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> <49929986.40106@ext.bull.net> <20090211114347.GA27920@sashak.voltaire.com> Message-ID: <4992D207.6010701@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 10:25 Wed 11 Feb , Nicolas Morey Chaisemartin wrote: >> What about high nodes (HN) as it concerns only nodes which are not at the >> bottom of the fat tree? > > Could be fine. Let's ask Yevgeny too... :) > > Yevgeny! Any idea about io_nodes more generic name? Ugh... "IO nodes": Pros: the name is closer to the reality, since in most cases the nodes that would need special treatment are indeed IO nodes. Cons: the name is not "general"... "High nodes" Pros: general name with kinda "hint" to the special treatment. Cons: the "hint" is rather vague... Bottom line - I'm OK with both options (slightly leaning toward IO), as long as it is described well enough in the help message and in man :) -- Yevgeny > Sasha > From sashak at voltaire.com Wed Feb 11 06:04:13 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 16:04:13 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm/osm_ucast_ftree.c Fixed bad init value for down port index In-Reply-To: <49914E91.4090305@ext.bull.net> References: <49914E91.4090305@ext.bull.net> Message-ID: <20090211140413.GJ27920@sashak.voltaire.com> On 10:53 Tue 10 Feb , Nicolas Morey Chaisemartin wrote: > Fixes the init value of down_port_groups_idx to 0 so it's in the port group > interval. > This way __osm_ftree_fabric_route_upgoing_by_going_down can use the index > directly without segfaulting. > > Signed-off-by: Nicolas Morey-Chaisemartin > Applied. Thanks. > --- > opensm/opensm/osm_ucast_ftree.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_ftree.c > b/opensm/opensm/osm_ucast_ftree.c > index 4e65c87..eae1ed8 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -563,7 +563,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN > ftree_fabric_t * p_ftree, > /* initialize lft buffer */ > memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); > > - p_sw->down_port_groups_idx = -1; > + p_sw->down_port_groups_idx = 0; I make it 'unsigned int' (instead of 'int') after all. Sasha > > return p_sw; > } /* __osm_ftree_sw_create() */ > -- > 1.6.1 > From sashak at voltaire.com Wed Feb 11 06:07:09 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 16:07:09 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <4992D207.6010701@dev.mellanox.co.il> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> <49929986.40106@ext.bull.net> <20090211114347.GA27920@sashak.voltaire.com> <4992D207.6010701@dev.mellanox.co.il> Message-ID: <20090211140703.GK27920@sashak.voltaire.com> On 15:26 Wed 11 Feb , Yevgeny Kliteynik wrote: > > Bottom line - I'm OK with both options (slightly leaning toward IO), > as long as it is described well enough in the help message and in man :) Ok, no clear opinions. Nicolas, it is your decision about name :) Sasha From subbukl at gmail.com Wed Feb 11 06:18:11 2009 From: subbukl at gmail.com (subbu kl) Date: Wed, 11 Feb 2009 19:48:11 +0530 Subject: ***SPAM*** Re: [ofa-general] Fwd: pciback module not working In-Reply-To: <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module. No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again. after executing in dom0 echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind #dmesg ACPI: PCI interrupt for device 0000:0e:00.0 disabled tap tap-1-51712: 2 getting info tap tap-2-51712: 2 getting info pciback 0000:0e:00.0: seizing device PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI interrupt for device 0000:0e:00.0 disabled #xm create -c rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. GUEST dmesg: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 in dom0: Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled some more details - [root at p128 ~]# rpm -qa | grep xen kernel-xen-2.6.18-92.1.22.el5 xen-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 [root at p128 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.3.0 node_guid: 0002:c902:0022:cd48 sys_image_guid: 0002:c902:0022:cd4b vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0x20 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 any help greatly appreciated. ~subbu On Sat, Oct 18, 2008 at 4:54 AM, David Brown wrote: > Okay so my question to the openfabrics guys is, why would the OFED > drivers fail to read the firmware? > > Any thoughts? > > Thanks, > - David Brown > > > ---------- Forwarded message ---------- > From: David Brown > Date: Thu, Sep 11, 2008 at 2:24 PM > Subject: pciback module not working > To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com > > > This issue was brought up about a year and a half ago. So I'll bring > it up again and see if anything happens. > > I've got an infiniband network and am attempting to pass the > infiniband card through the host and give it to the guest. > I'm working with standard CentOS 5.2 on both guest and host with their > provided xen (3.0.3 ish). I've also attempted to install the newest > Xen 3.3 and use their standard host kernel and that did the same > thing. The guest dmesg output in the guest is similar on both > permissive and normal mode. > > I'm getting issues with detecting the firmware on the card for some > reason... > > Any help would be appreciated. > > Thanks, > - David Brown > > === GUEST dmesg output === > ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) > ib_mthca: Initializing 0000:00:00.0 > PCI: Enabling device 0000:00:00.0 (0000 -> 0002) > PCI: Setting latency timer of device 0000:00:00.0 to 64 > ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. > ib_mthca: probe of 0000:00:00.0 failed with error -11 > ======================= > > === Host modprobe.conf === > alias eth0 bnx2 > alias eth1 bnx2 > alias scsi_hostadapter cciss > options pciback hide=(41:00.0) > ===================== > > === Host lspci output === > # lspci -vs 41:00.0 > 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx > HCA] (rev 20) > Subsystem: Hewlett-Packard Company Unknown device 170a > Flags: fast devsel, IRQ 16 > Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] > Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] > Capabilities: [40] Power Management version 2 > Capabilities: [48] Vital Product Data > Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 > Enable- > Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 > Capabilities: [60] Express Endpoint IRQ 0 > ===================== > > This makes sure it get loaded first off before anything else. > === Host mkinitrd cmd === > # mkinitrd -f --with=pciback --preload pciback > /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen > ==================== > > === Host pciback dmesg === > pciback 0000:41:00.0: Driver tried to write to a read-only > configuration space field at offset 0x44, size 2. This may be > harmless, but if you have problems with your device: > 1) see permissive attribute in sysfs > 2) report problems to the xen-devel mailing list along with details of > your device obtained from lspci. > PCI: Enabling device 0000:41:00.0 (0000 -> 0002) > ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 > PCI: Setting latency timer of device 0000:41:00.0 to 64 > ACPI: PCI interrupt for device 0000:41:00.0 disabled > ====================== > > === Host pciback dmesg (after setting it permissive) === > pciback 0000:41:00.0: enabling permissive mode configuration space > accesses! > pciback 0000:41:00.0: permissive mode is potentially unsafe! > pciback: vpci: 0000:41:00.0: assign to virtual slot 0 > device vif1.0 entered promiscuous mode > ADDRCONF(NETDEV_UP): vif1.0: link is not ready > blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) > PCI: Enabling device 0000:41:00.0 (0000 -> 0002) > ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 > PCI: Setting latency timer of device 0000:41:00.0 to 64 > ACPI: PCI interrupt for device 0000:41:00.0 disabled > ========================================= > > === Guest lspci output === > # lspci -v > 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx > HCA] (rev 20) > Subsystem: Hewlett-Packard Company Unknown device 170a > Flags: fast devsel, IRQ 16 > Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] > Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] > Capabilities: [40] Power Management version 2 > Capabilities: [48] Vital Product Data > Capabilities: [90] Message Signalled Interrupts: 64bit+ > Queue=0/5 Enable- > Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 > Capabilities: [60] Express Endpoint IRQ 0 > ===================== > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From olga.shern at gmail.com Wed Feb 11 06:34:32 2009 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Wed, 11 Feb 2009 16:34:32 +0200 Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops In-Reply-To: <4990CD57.3080108@oracle.com> References: <4990CD57.3080108@oracle.com> Message-ID: Hi Summet, You can read from the ipoib release notes: "If IPoIB connected mode is enabled, it uses a large MTU for connected mode messages and a small MTU for datagram (in particular, multicast) messages, and relies on path MTU discovery to adjust MTU appropriately. Packets sent in the window before MTU discovery automatically reduces the MTU for a specific destination will be dropped, producing the following message in the system log: "packet len (> ) too long to send, dropping" To warn about this, a message is produced in the system log each time MTU is set to a value higher than 2K." Olga On Tue, Feb 10, 2009 at 2:41 AM, Sumeet Lahorani wrote: > When we enable IB connected mode and increase MTU to 65520, we see the > following in /var/log/messages > > Feb 6 17:48:32 dadzab01 kernel: ib0: enabling connected mode will cause > multicast packet drops > Feb 6 17:48:32 dadzab01 kernel: ib0: mtu > 2044 will cause multicast packet > drops. > Feb 6 17:48:32 dadzab01 kernel: ib1: enabling connected mode will cause > multicast packet drops > Feb 6 17:48:32 dadzab01 kernel: ib1: mtu > 2044 will cause multicast packet > drops. > > Should we not be doing this? What kind of multicast packets will be dropped? > > If we are not using multicast, do any OFED drivers (bonding, ipoib etc) > internally use multicast in a way that will cause them to not work correctly > in connected mode? > > We are using OFED 1.3.1. > > - Sumeet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From tziporet at mellanox.co.il Wed Feb 11 07:09:35 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 11 Feb 2009 17:09:35 +0200 Subject: [ofa-general] OFED (EWG) Feb 9, 2009 meeting minutes Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com> These are the OFED (EWG) meeting minutes for Feb 09 on OFED 1.4.1 release Meeting Summary: ============== 1. Agreed on 1.4.1 release schedule - GA is planed for April 7 2. Reviewed 1.4.1 status 3. Reviewed Sonoma agenda Details: ====== 1. OFED 1.4.1 schedule: * RC1 - Mar 3 * RC2 - Mar 17 * RC3 - Mar 31 * GA - Apr 7 2. OFED 1.4.1 release status: > * New OSes: > * RH 5.3 - done > * SLES 11 - schedule is OK. RC3 already available, need to create > backports Tziporet to check with Novell if we can place the sources on the OFA server * Open MPI - we will take 1.3.1 > * RDS with iWARP support - good progress > * NFS/RDMA backports - RHEL 5.2 should be ready in 2 weeks > * Critical bug fixes > As far as I know these are the critical bugs that should be fixed: > 1383 blo jackm at mellanox.co.il Local protection > error on transmit from ipoib datagram to... - on work 1287 maj jackm at mellanox.co.il IPoIB datagram mode initial packet loss - we will check if we can fix this * Need to add 1.4.1 to bugzilla > 3. Sonoma updates from Bill Boas: > Bill sent the agenda - and we reviewed it in the meeting Comments and suggestions should be sent to Bill. There is a need for more PR - if companies are willing to put the Press release on on their web site > Tziporet > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at Voltaire.com Wed Feb 11 07:11:54 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Wed, 11 Feb 2009 17:11:54 +0200 Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops In-Reply-To: <4990CD57.3080108@oracle.com> References: <4990CD57.3080108@oracle.com> Message-ID: <4992EABA.9090605@Voltaire.com> Sumeet Lahorani wrote: > When we enable IB connected mode and increase MTU to 65520, we see the following > kernel: ib0: enabling connected mode will cause multicast packet drops > kernel: ib0: mtu > 2044 will cause multicast packet drops. > Should we not be doing this? What kind of multicast packets will be dropped? > If we are not using multicast, do any drivers (bonding, ipoib etc) internally use > multicast in a way that will cause them to not work correctly in connected mode? Connected mode is supported only for unicast traffic where multicast traffic keeps going over the IB UD QP whose MTU is much lower (e.g 2-4K). To close the gap between the MTU seen by the network stack to the MTU used by the UD QP, IPoIB emulates receiving an icmp packet that tells the os stack to use a different path mtu for this multicast neighbour, see ipoib_start_xmit --> ipoib_send --> ipoib_cm_skb_too_long(mcast_mtu) --> skb->dst->ops->update_pmtu(skb->dst, mtu) When IP multicast is not used, multicast is used by the network stack and bonding just for the sake of sending ARPs on the broadcast group, and IGMP where the size of both is way below the IB mtu. Or. From swise at opengridcomputing.com Wed Feb 11 07:44:42 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 11 Feb 2009 09:44:42 -0600 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: References: <20090210184448.22891.31130.stgit@dell3.ogc.int> Message-ID: <4992F26A.4030800@opengridcomputing.com> Roland Dreier wrote: > I'll roll this into the offending patch (that is in -next). > > But: > > > - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > > - (1UL << (12 + page_size[i]))); > > + wqe->recv.sgl[i].to = cpu_to_be64(((u64) wr->sg_list[i].addr) & > > + ((1UL << (12 + page_size[i]))-1)); > > Is this required? Strength reduction optimization should do this > automatically (and the code has been there for quite a while, so > obviously it isn't causing problems) > > - R. > Note that wr->sg_list[i].addr was being cast to a u32. That was wrong. From hal.rosenstock at gmail.com Wed Feb 11 08:16:25 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 11 Feb 2009 11:16:25 -0500 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <4992D207.6010701@dev.mellanox.co.il> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> <49929986.40106@ext.bull.net> <20090211114347.GA27920@sashak.voltaire.com> <4992D207.6010701@dev.mellanox.co.il> Message-ID: On Wed, Feb 11, 2009 at 8:26 AM, Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: >> >> On 10:25 Wed 11 Feb , Nicolas Morey Chaisemartin wrote: >>> >>> What about high nodes (HN) as it concerns only nodes which are not at the >>> bottom of the fat tree? >> >> Could be fine. Let's ask Yevgeny too... :) >> >> Yevgeny! Any idea about io_nodes more generic name? > > Ugh... > > "IO nodes": > Pros: the name is closer to the reality, since in most cases > the nodes that would need special treatment are indeed IO nodes. > Cons: the name is not "general"... > > "High nodes" > Pros: general name with kinda "hint" to the special treatment. > Cons: the "hint" is rather vague... > > Bottom line - I'm OK with both options (slightly leaning toward IO), > as long as it is described well enough in the help message and in man :) Maybe consistency is the hobgobblin of small minds but don't we now have: high nodes which is a topology based name and compute nodes which is a functional based name. Is it worth having them consistent ? -- Hal > -- Yevgeny > >> Sasha >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From jean-vincent.ficet at bull.net Wed Feb 11 08:21:13 2009 From: jean-vincent.ficet at bull.net (Vincent Ficet) Date: Wed, 11 Feb 2009 17:21:13 +0100 Subject: [ofa-general] 2.6.16.46-0.12-SLERT-10-15: scheduling while atomic ? Message-ID: <4992FAF9.9070305@bull.net> Hello, On a Suse real time kernel (2.6.16.46-0.12-SLERT-10-15), we get the following kernel stack trace while running SDP traffic: scheduling while atomic: ib_cm/4/0x00000001/18293 Call Trace: {__sched_text_start+125} {lock_timer_base+27} {_spin_unlock_irqrestore+53} {__mod_timer+439} {schedule_timeout+208} {process_timeout+0} {_spin_unlock_irq+52} {wait_for_completion_timeout+127} {default_wake_function+0} {:mlx4_core:__mlx4_cmd+318} {:mlx4_core:mlx4_mr_free+73} {:mlx4_ib:mlx4_ib_dereg_mr+23} {:ib_core:ib_dereg_mr+26} {:ib_sdp:sdp_destroy_qp+161} {:ib_sdp:sdp_reset_sk+276} {:ib_sdp:sdp_cma_handler+2008} {:ib_cm:cm_work_handler+0} {:rdma_cm:cma_modify_qp_err+72} {__wake_up_common+62} {_spin_unlock_irqrestore+53} {:ib_cm:cm_work_handler+0} {:rdma_cm:cma_ib_handler+369} {:ib_cm:cm_process_work+26} {:ib_cm:cm_work_handler+986} {:ib_cm:cm_work_handler+0} {run_workqueue+154} {__sched_text_start+6} {worker_thread+0} {keventd_create_kthread+0} {worker_thread+252} {default_wake_function+0} {keventd_create_kthread+0} {kthread+212} {hracct_exit_syscall+22} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} The OFA kernel package in place is: git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel commit 88ab7955605c5e769e760f6bec980e0c2e72aa5c Looking for the "scheduling while atomic" message in the latest kernel, we see that it was printed out by __schedule_bug in this function: /* * Various schedule()-time debugging checks and statistics: */ static inline void schedule_debug(struct task_struct *prev) { /* * Test if we are atomic. Since do_exit() needs to call into * schedule() atomically, we ignore that path for now. * Otherwise, whine if we are scheduling when we should not be. */ if (unlikely(in_atomic_preempt_off() && !prev->exit_state)) __schedule_bug(prev); profile_hit(SCHED_PROFILING, __builtin_return_address(0)); schedstat_inc(this_rq(), sched_count); #ifdef CONFIG_SCHEDSTATS if (unlikely(prev->lock_depth >= 0)) { schedstat_inc(this_rq(), bkl_count); schedstat_inc(prev, sched_info.bkl_count); } #endif } Any idea as to what is going wrong here ? Thanks for your help, Vincent From devel at morey-chaisemartin.com Wed Feb 11 09:31:35 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 11 Feb 2009 18:31:35 +0100 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: <20090211140703.GK27920@sashak.voltaire.com> References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> <49929986.40106@ext.bull.net> <20090211114347.GA27920@sashak.voltaire.com> <4992D207.6010701@dev.mellanox.co.il> <20090211140703.GK27920@sashak.voltaire.com> Message-ID: <49930B77.5020803@morey-chaisemartin.com> Sasha Khapyorsky a écrit : > On 15:26 Wed 11 Feb , Yevgeny Kliteynik wrote: > >> Bottom line - I'm OK with both options (slightly leaning toward IO), >> as long as it is described well enough in the help message and in man :) >> > > Ok, no clear opinions. Nicolas, it is your decision about name :) > > Sasha > My lazyness would say IO is better because it's what has been written in my code and documentation, but it's not too much work to change anyway. If by 10am tommorow( GMT+1 ) I don't have more clear opinions, I'll repost them with io_guid_file. Feel free to have any idea before then. Nicolas From Jeffrey.C.Becker at nasa.gov Wed Feb 11 09:58:12 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Wed, 11 Feb 2009 09:58:12 -0800 Subject: [ofa-general] OFED (EWG) Feb 9, 2009 meeting minutes In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com> Message-ID: <499311B4.4090607@nasa.gov> Hi Tziporet Tziporet Koren wrote: > > These are the OFED (EWG) meeting minutes for Feb 09 on OFED 1.4.1 release > > Meeting Summary: > > ============== > > 1. Agreed on 1.4.1 release schedule - GA is planed for April 7 > > 2. Reviewed 1.4.1 status > > 3. Reviewed Sonoma agenda > > Details: > > ====== > > 1. OFED 1.4.1 schedule: > > o RC1 - Mar 3 > o RC2 - Mar 17 > o RC3 - Mar 31 > o GA - Apr 7 > > 2. OFED 1.4.1 release status: > > o New OSes: > + RH 5.3 - done > + SLES 11 - schedule is OK. RC3 already available, > need to create backports > Tziporet to check with Novell if we can place the > sources on the OFA server > Thanks to NASA's developing relationship with Novell, I got access to SLES11 rc3 iso's. I'm downloading them now, and will start on the backports when that's done. -jeff > o Open MPI - we will take 1.3.1 > o RDS with iWARP support - good progress > o NFS/RDMA backports - RHEL 5.2 should be ready in 2 weeks > o Critical bug fixes > As far as I know these are the critical bugs that should > be fixed: > > 1383 blo jackm at mellanox.co.il Local > protection error on transmit from ipoib datagram to… - on work > > 1287 maj jackm at mellanox.co.il IPoIB datagram > mode initial packet loss - we will check if we can fix this > > o Need to add 1.4.1 to bugzilla > > 3. Sonoma updates from Bill Boas: > > Bill sent the agenda - and we reviewed it in the meeting > > Comments and suggestions should be sent to Bill. > > There is a need for more PR - if companies are willing to put the > Press release on on their web site > > > Tziporet > From rdreier at cisco.com Wed Feb 11 10:12:09 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Feb 2009 10:12:09 -0800 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <4992F26A.4030800@opengridcomputing.com> (Steve Wise's message of "Wed, 11 Feb 2009 09:44:42 -0600") References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <4992F26A.4030800@opengridcomputing.com> Message-ID: > Note that wr->sg_list[i].addr was being cast to a u32. That was wrong. Is it possible for the page to be bigger than 4GB? If so then yes you might be chopping off high-order bits or something. Anyway please send me this change as a separate patch with a changelog explaining that you're avoiding the div etc.... I don't want to roll it in with the other unrelated fix (which changes code that was never upstream anyway). From swise at opengridcomputing.com Wed Feb 11 10:32:47 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 11 Feb 2009 12:32:47 -0600 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <4992F26A.4030800@opengridcomputing.com> Message-ID: <499319CF.6050204@opengridcomputing.com> Roland Dreier wrote: > > Note that wr->sg_list[i].addr was being cast to a u32. That was wrong. > > Is it possible for the page to be bigger than 4GB? If so then yes you > might be chopping off high-order bits or something. > Yes it is possible. A MR can be created with an iov_base of say 0xffffffff00000000. Then any sge.addr entries would be the iob_base + any offset. > Anyway please send me this change as a separate patch with a changelog > explaining that you're avoiding the div etc.... I don't want to roll it > in with the other unrelated fix (which changes code that was never > upstream anyway). > will do. So you are handling the offset patch that will make it u64 and remove the mod usage, correct? I will post a new patch with just this send change. Steve. From rdreier at cisco.com Wed Feb 11 10:36:01 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Feb 2009 10:36:01 -0800 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: <499319CF.6050204@opengridcomputing.com> (Steve Wise's message of "Wed, 11 Feb 2009 12:32:47 -0600") References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <4992F26A.4030800@opengridcomputing.com> <499319CF.6050204@opengridcomputing.com> Message-ID: > > Is it possible for the page to be bigger than 4GB? If so then yes you > > might be chopping off high-order bits or something. > Yes it is possible. > > A MR can be created with an iov_base of say 0xffffffff00000000. > > Then any sge.addr entries would be the iob_base + any offset. But the code we're talking about is: /* to in the WQE == the offset into the page */ wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % (1UL << (12 + page_size[i]))); so it seems the top address bits don't matter unless page_size[i] is at least 20 -- in which case using 1UL to shift overflows on 32 bits anyway... > So you are handling the offset patch that will make it u64 and remove > the mod usage, correct? Yeah, I rolled the fix into the "offset needs to be u64" patch, it should be in linux-next by now (or at least in my for-next branch). - R. From sashak at voltaire.com Wed Feb 11 10:47:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 20:47:17 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm/osm_console.c : Added dump_portguid function to console to generate a list of port guids matching one or more regexps In-Reply-To: <499135E1.1080307@ext.bull.net> References: <499135E1.1080307@ext.bull.net> Message-ID: <20090211184717.GO5910@sashak.voltaire.com> Hi Nicolas, On 09:08 Tue 10 Feb , Nicolas Morey Chaisemartin wrote: > This add a dump_portguid functionnality to openSM console which makes it > really easy to generate cn_guid_file, root_guid_file and such > by dumping into a file all port guids whom nodedesc contains at least one > of the provided regexps > > Signed-off-by: Nicolas Morey-Chaisemartin > > --- > > Repost without exit_after_run flag, active sleep init loop and indented. > > opensm/opensm/osm_console.c | 105 > +++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 105 insertions(+), 0 deletions(-) > > > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c > index c6e8e59..5fbcd43 100644 > --- a/opensm/opensm/osm_console.c > +++ b/opensm/opensm/osm_console.c > @@ -42,6 +42,7 @@ > #include > #include > #include > +#include > #ifdef ENABLE_OSM_CONSOLE_SOCKET > #include > #endif > @@ -1173,6 +1174,109 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > } > > /* more parse routines go here */ > +typedef struct _regexp_list { > + regex_t exp; > + struct _regexp_list *next; > +} regexp_list_t; > + > +static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +{ > + cl_qmap_t *p_port_guid_tbl; > + osm_port_t *p_port; > + osm_port_t *p_next_port; > + > + regexp_list_t *p_head_regexp = NULL; > + regexp_list_t *p_regexp; > + > + /* Option variables */ > + char *p_cmd = NULL; > + FILE *output = out; > + > + /* Read commande line */ > + > + while (1) { > + p_cmd = next_token(p_last); > + if (p_cmd) { > + if (strcmp(p_cmd, "file") == 0) { > + p_cmd = next_token(p_last); > + if (p_cmd) { > + output = fopen(p_cmd, "w+"); > + if (output == NULL) { > + fprintf(out, > + "Could not open file %s: %s\n", > + p_cmd, strerror(errno)); > + output = out; > + } > + } else > + fprintf(out, "No file name passed\n"); > + } else { > + p_regexp = malloc(sizeof(*p_regexp)); > + if (regcomp > + (&(p_regexp->exp), p_cmd, > + REG_NOSUB | REG_EXTENDED) != 0) { > + fprintf(out, > + "Couldn't parse regular expression %s. Skipping it.\n", > + p_cmd); > + } > + p_regexp->next = p_head_regexp; > + p_head_regexp = p_regexp; > + } > + } else > + break; /* No more tokens */ > + > + } > + > + /* Check we have at least one expression to match */ > + if (p_head_regexp == NULL) { > + fprintf(out, "No valid expression provided. Aborting\n"); > + return; > + } > + > + cl_spinlock_release(&p_osm->sm.state_lock); What is this cl_spinlock_release()? Typo? > + if (p_osm->sm.p_subn->need_update != 0) { > + fprintf(out, "Subnet is not ready yet. Try again later.\n"); > + return; > + } > + > + /* Subnet doesn't need to be updated so we can carry on */ > + > + CL_PLOCK_EXCL_ACQUIRE(p_osm->sm.p_lock); > + p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl); Do we really need exclusive locking here? port_guid_table content is rad-only, I guess "read-only" lock (CL_PLOCK_ACQUIRE()) should be enough. The rest looks fine for me. Sasha > + > + p_next_port = (osm_port_t *) cl_qmap_head(p_port_guid_tbl); > + while (p_next_port != (osm_port_t *) cl_qmap_end(p_port_guid_tbl)) { > + > + p_port = p_next_port; > + p_next_port = > + (osm_port_t *) cl_qmap_next(&p_next_port->map_item); > + > + for (p_regexp = p_head_regexp; p_regexp != NULL; > + p_regexp = p_regexp->next) > + if (regexec > + (&(p_regexp->exp), p_port->p_node->print_desc, 0, > + NULL, 0) == 0) > + fprintf(output, "0x%" PRIxLEAST64 "\n", > + cl_ntoh64(p_port->p_physp->port_guid)); > + } > + > + CL_PLOCK_RELEASE(p_osm->sm.p_lock); > + if (output != out) > + fclose(output); > + > +} > + > +static void help_dump_portguid(FILE * out, int detail) > +{ > + fprintf(out, > + "dump_portguid [file filename] regexp1 [regexp2 [regexp3 ...]] -- Dump port GUID matching a regexp \n"); > + if (detail) { > + fprintf(out, > + "getguidgetguid -- Dump all the port GUID whom node_desc matches one of the provided regexp\n"); > + fprintf(out, > + " [file filename] -- Send the port GUID list to the specified file instead of regular output\n"); > + } > + > +} > > static const struct command console_cmds[] = { > {"help", &help_command, &help_parse}, > @@ -1192,6 +1296,7 @@ static const struct command console_cmds[] = { > #ifdef ENABLE_OSM_PERF_MGR > {"perfmgr", &help_perfmgr, &perfmgr_parse}, > #endif /* ENABLE_OSM_PERF_MGR */ > + {"dump_portguid", &help_dump_portguid, &dump_portguid_parse}, > {NULL, NULL, NULL} /* end of array */ > }; > > From swise at opengridcomputing.com Wed Feb 11 10:44:45 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 11 Feb 2009 12:44:45 -0600 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Remove modulo math. In-Reply-To: References: <20090210184448.22891.31130.stgit@dell3.ogc.int> <4992F26A.4030800@opengridcomputing.com> <499319CF.6050204@opengridcomputing.com> Message-ID: <49931C9D.2090604@opengridcomputing.com> Roland Dreier wrote: > > > Is it possible for the page to be bigger than 4GB? If so then yes you > > > might be chopping off high-order bits or something. > > > Yes it is possible. > > > > A MR can be created with an iov_base of say 0xffffffff00000000. > > > > Then any sge.addr entries would be the iob_base + any offset. > > But the code we're talking about is: > > /* to in the WQE == the offset into the page */ > wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % > (1UL << (12 + page_size[i]))); > > so it seems the top address bits don't matter unless page_size[i] is at > least 20 -- in which case using 1UL to shift overflows on 32 bits anyway... > > Yes yes...you're right. This code is really just saving the offset in a page. I'll send a new patch. From sashak at voltaire.com Wed Feb 11 11:54:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 21:54:42 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: fix types and some cleanup Message-ID: <20090211195442.GP5910@sashak.voltaire.com> Fix types - mostly ib_net*_t -> uint*_t conversion. Use host byte order SA attributes from mad.h (instead of ib_types.h). Fix functions prototypes and return value types. Remove osm* stubs. Remove unused 'offset' argument in get_any_records() and get_all_gecords() functions. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 388 ++++++++++++++++++---------------------- 1 files changed, 171 insertions(+), 217 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 5b66f93..a94a015 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -50,24 +50,22 @@ #include #include #include -#include #include #include "ibdiag_common.h" -struct sa_bind_handle { +struct bind_handle { int fd, agent; ib_portid_t dport; }; -struct sa_result { +struct query_res { int status; unsigned result_cnt; void *p_result_madw; }; -#define osmv_query_res_t struct sa_result -#define osm_bind_handle_t struct sa_bind_handle * +typedef struct bind_handle * bind_handle_t; struct query_params { ib_gid_t sgid, dgid, gid, mgid; @@ -87,22 +85,22 @@ struct query_params { struct query_cmd { const char *name, *alias; - ib_net16_t query_type; + uint16_t query_type; const char *usage; - int (*handler) (const struct query_cmd * q, osm_bind_handle_t h, + int (*handler) (const struct query_cmd * q, bind_handle_t h, struct query_params *p, int argc, char *argv[]); }; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; -static ib_net64_t smkey = CL_HTON64(1); +static uint64_t smkey = 1; /** * Declare some globals because I don't want this to be too complex. */ #define MAX_PORTS (8) #define DEFAULT_SA_TIMEOUT_MS (1000) -osmv_query_res_t result; +static struct query_res result; enum { ALL, @@ -115,14 +113,14 @@ enum { } node_print_desc = ALL; char *requested_name = NULL; -ib_net16_t requested_lid = 0; +uint16_t requested_lid = 0; int requested_lid_flag = 0; -ib_net64_t requested_guid = 0; +uint64_t requested_guid = 0; int requested_guid_flag = 0; -static int sa_query(struct sa_bind_handle *h, uint8_t method, - ib_net16_t attr, ib_net32_t mod, ib_net64_t comp_mask, - ib_net64_t sm_key, void *data) +static int sa_query(struct bind_handle *h, uint8_t method, + uint16_t attr, uint32_t mod, uint64_t comp_mask, + uint64_t sm_key, void *data) { ib_rpc_t rpc; void *umad, *mad; @@ -131,9 +129,9 @@ static int sa_query(struct sa_bind_handle *h, uint8_t method, memset(&rpc, 0, sizeof(rpc)); rpc.mgtclass = IB_SA_CLASS; rpc.method = method; - rpc.attr.id = cl_ntoh16(attr); - rpc.attr.mod = cl_ntoh32(mod); - rpc.mask = cl_ntoh64(comp_mask); + rpc.attr.id = attr; + rpc.attr.mod = mod; + rpc.mask = comp_mask; rpc.datasz = IB_SA_DATA_SIZE; rpc.dataoffs = IB_SA_DATA_OFFS; @@ -143,8 +141,7 @@ static int sa_query(struct sa_bind_handle *h, uint8_t method, mad_build_pkt(umad, &rpc, &h->dport, NULL, data); - /* SA SM_Key (36/8) - temporary done using IB_MAD_MKEY_F */ - mad_set_field64(umad_get_mad(umad), 12, IB_MAD_MKEY_F, cl_hton64(sm_key)); + mad_set_field64(umad_get_mad(umad), 0, IB_SA_MKEY_F, sm_key); if (ibdebug > 1) xdump(stdout, "SA Request:\n", umad_get_mad(umad), len); @@ -189,14 +186,12 @@ recv_mad: return 0; } -static void *osmv_get_query_result(void *mad, unsigned i) +static void *get_query_rec(void *mad, unsigned i) { int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); return mad + IB_SA_DATA_OFFS + i * (offset << 3); } -#define osmv_get_query_node_rec(mad, i) osmv_get_query_result(mad, i) - static unsigned valid_gid(ib_gid_t *gid) { ib_gid_t zero_gid = { }; @@ -456,7 +451,7 @@ static void dump_multicast_member_record(void *data) */ for (i = 0; i < result.result_cnt; i++) { ib_node_record_t *nr = - osmv_get_query_node_rec(result.p_result_madw, i); + get_query_rec(result.p_result_madw, i); if (nr->node_info.port_guid == p_mcmr->port_gid.unicast.interface_id) { node_name = @@ -761,11 +756,11 @@ static void dump_one_mft_record(void *data) printf("\n"); } -static void dump_results(osmv_query_res_t * r, void (*dump_func) (void *)) +static void dump_results(struct query_res *r, void (*dump_func) (void *)) { int i; for (i = 0; i < r->result_cnt; i++) { - void *data = osmv_get_query_result(r->p_result_madw, i); + void *data = get_query_rec(r->p_result_madw, i); dump_func(data); } } @@ -781,13 +776,12 @@ static void return_mad(void) /** * Get any record(s) */ -static ib_api_status_t -get_any_records(osm_bind_handle_t h, - ib_net16_t attr_id, ib_net32_t attr_mod, ib_net64_t comp_mask, - void *attr, ib_net16_t attr_offset, ib_net64_t sm_key) +static int get_any_records(bind_handle_t h, + uint16_t attr_id, uint32_t attr_mod, + ib_net64_t comp_mask, void *attr, uint64_t sm_key) { int ret = sa_query(h, IB_MAD_METHOD_GET_TABLE, attr_id, attr_mod, - comp_mask, sm_key, attr); + cl_ntoh64(comp_mask), sm_key, attr); if (ret) { fprintf(stderr, "Query SA failed: %s\n", ib_get_err_str(ret)); return ret; @@ -805,30 +799,27 @@ get_any_records(osm_bind_handle_t h, /** * Get all the records available for requested query type. */ -static ib_api_status_t get_all_records(osm_bind_handle_t h, ib_net16_t query_id, ib_net16_t attr_offset, int trusted) +static int get_all_records(bind_handle_t h, uint16_t attr_id, int trusted) { - return get_any_records(h, query_id, 0, 0, NULL, attr_offset, - trusted ? smkey : 0); + return get_any_records(h, attr_id, 0, 0, NULL, trusted ? smkey : 0); } /** * return the lid from the node descriptor (name) supplied */ -static ib_api_status_t -get_lid_from_name(osm_bind_handle_t h, const char *name, ib_net16_t * lid) +static int +get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid) { - int i = 0; ib_node_record_t *node_record = NULL; ib_node_info_t *p_ni = NULL; - ib_net16_t attr_offset = ib_get_attr_offset(sizeof(*node_record)); - ib_api_status_t status; + int i = 0, ret; - status = get_all_records(h, IB_MAD_ATTR_NODE_RECORD, attr_offset, 0); - if (status != IB_SUCCESS) - return (status); + ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); + if (ret) + return ret; for (i = 0; i < result.result_cnt; i++) { - node_record = osmv_get_query_node_rec(result.p_result_madw, i); + node_record = get_query_rec(result.p_result_madw, i); p_ni = &(node_record->node_info); if (name && strncmp(name, (char *)node_record->node_desc.description, @@ -839,25 +830,25 @@ get_lid_from_name(osm_bind_handle_t h, const char *name, ib_net16_t * lid) } } return_mad(); - return (status); + return 0; } -static ib_net16_t get_lid(osm_bind_handle_t h, const char *name) +static uint16_t get_lid(bind_handle_t h, const char *name) { - ib_net16_t rc_lid = 0; + uint16_t rc_lid = 0; if (!name) - return (0); + return 0; if (isalpha(name[0])) assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS); else rc_lid = atoi(name); if (rc_lid == 0) fprintf(stderr, "Failed to find lid for \"%s\"\n", name); - return (rc_lid); + return rc_lid; } -static int parse_lid_and_ports(osm_bind_handle_t h, +static int parse_lid_and_ports(bind_handle_t h, char *str, int *lid, int *port1, int *port2) { char *p, *e; @@ -920,38 +911,32 @@ static int parse_lid_and_ports(osm_bind_handle_t h, /* * Get the portinfo records available with IsSM or IsSMdisabled CapabilityMask bit on. */ -static ib_api_status_t get_issm_records(osm_bind_handle_t h, - ib_net32_t capability_mask) +static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask) { ib_portinfo_record_t attr; memset(&attr, 0, sizeof(attr)); attr.port_info.capability_mask = capability_mask; - return get_any_records(h, IB_MAD_ATTR_PORTINFO_RECORD, - cl_hton32(1 << 31), IB_PIR_COMPMASK_CAPMASK, - &attr, - ib_get_attr_offset(sizeof(ib_portinfo_record_t)), - 0); + return get_any_records(h, IB_SA_ATTR_PORTINFORECORD, 1 << 31, + IB_PIR_COMPMASK_CAPMASK, &attr, 0); } -static ib_api_status_t print_node_records(osm_bind_handle_t h) +static int print_node_records(bind_handle_t h) { - int i = 0; - ib_node_record_t *node_record = NULL; - ib_net16_t attr_offset = ib_get_attr_offset(sizeof(*node_record)); - ib_api_status_t status; + int i = 0, ret; - status = get_all_records(h, IB_MAD_ATTR_NODE_RECORD, attr_offset, 0); - if (status != IB_SUCCESS) - return (status); + ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); + if (ret) + return ret; if (node_print_desc == ALL_DESC) { printf(" LID \"name\"\n"); printf("================\n"); } for (i = 0; i < result.result_cnt; i++) { - node_record = osmv_get_query_node_rec(result.p_result_madw, i); + ib_node_record_t *node_record; + node_record = get_query_rec(result.p_result_madw, i); if (node_print_desc == ALL_DESC) { print_node_desc(node_record); } else if (node_print_desc == NAME_OF_LID) { @@ -977,13 +962,13 @@ static ib_api_status_t print_node_records(osm_bind_handle_t h) } } return_mad(); - return (status); + return ret; } -static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h) +static int get_print_class_port_info(bind_handle_t h) { - int ret = sa_query(h, IB_MAD_METHOD_GET, IB_MAD_ATTR_CLASS_PORT_INFO, - 0, 0, 0, NULL); + int ret = sa_query(h, IB_MAD_METHOD_GET, CLASS_PORT_INFO, 0, 0, + 0, NULL); if (ret) { fprintf(stderr, "ERROR: Query SA failed: %s\n", ib_get_err_str(ret)); @@ -999,12 +984,12 @@ static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h) return ret; } -static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_path_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_path_rec_t pr; ib_net64_t comp_mask = 0; - ib_api_status_t status; + int ret; uint32_t flow = 0; uint16_t qos_class = 0; uint8_t reversible = 0; @@ -1029,17 +1014,16 @@ static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h, CHECK_AND_SET_VAL_AND_SEL(p->rate, pr.rate, PR, RATE, SELEC); CHECK_AND_SET_VAL_AND_SEL(p->pkt_life, pr.pkt_life, PR, PKTLIFETIME, SELEC); - status = get_any_records(h, IB_MAD_ATTR_PATH_RECORD, 0, comp_mask, - &pr, ib_get_attr_offset(sizeof(pr)), 0); - if (status != IB_SUCCESS) - return (status); + ret = get_any_records(h, IB_SA_ATTR_PATHRECORD, 0, comp_mask, &pr, 0); + if (ret) + return ret; dump_results(&result, dump_path_record); return_mad(); - return (status); + return ret; } -static ib_api_status_t print_issm_records(osm_bind_handle_t h) +static ib_api_status_t print_issm_records(bind_handle_t h) { ib_api_status_t status; @@ -1064,23 +1048,19 @@ static ib_api_status_t print_issm_records(osm_bind_handle_t h) return (status); } -static ib_api_status_t print_multicast_member_records(osm_bind_handle_t h) +static int print_multicast_member_records(bind_handle_t h) { - osmv_query_res_t mc_group_result; - ib_api_status_t status; + struct query_res mc_group_result; + int ret; - status = get_all_records(h, IB_MAD_ATTR_MCMEMBER_RECORD, - ib_get_attr_offset(sizeof(ib_member_rec_t)), - 1); - if (status != IB_SUCCESS) - return (status); + ret = get_all_records(h, IB_SA_ATTR_MCRECORD, 1); + if (ret) + return ret; mc_group_result = result; - status = get_all_records(h, IB_MAD_ATTR_NODE_RECORD, - ib_get_attr_offset(sizeof(ib_node_record_t)), - 0); - if (status != IB_SUCCESS) + ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); + if (ret) goto return_mc; dump_results(&mc_group_result, dump_multicast_member_record); @@ -1090,37 +1070,32 @@ return_mc: if (mc_group_result.p_result_madw) free(mc_group_result.p_result_madw - umad_size()); - return (status); + return ret; } -static ib_api_status_t print_multicast_group_records(osm_bind_handle_t h) +static int print_multicast_group_records(bind_handle_t h) { - ib_api_status_t status; - - status = get_all_records(h, IB_MAD_ATTR_MCMEMBER_RECORD, - ib_get_attr_offset(sizeof(ib_member_rec_t)), - 0); - if (status != IB_SUCCESS) - return (status); + int ret = get_all_records(h, IB_SA_ATTR_MCRECORD, 0); + if (ret) + return ret; dump_results(&result, dump_multicast_group_record); return_mad(); - return (status); + return ret; } -static int query_class_port_info(const struct query_cmd *q, osm_bind_handle_t h, +static int query_class_port_info(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { return get_print_class_port_info(h); } -static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_node_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_node_record_t nr; ib_net64_t comp_mask = 0; - int lid = 0; - ib_api_status_t status; + int lid = 0, ret; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, NULL, NULL); @@ -1128,10 +1103,9 @@ static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h, memset(&nr, 0, sizeof(nr)); CHECK_AND_SET_VAL(lid, 16, 0, nr.lid, NR, LID); - status = get_any_records(h, IB_MAD_ATTR_NODE_RECORD, 0, comp_mask, - &nr, ib_get_attr_offset(sizeof(nr)), 0); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_NODERECORD, 0, comp_mask, &nr, 0); + if (ret) + return ret; dump_results(&result, dump_node_record); return_mad(); @@ -1140,13 +1114,12 @@ static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h, } static int query_portinfo_records(const struct query_cmd *q, - osm_bind_handle_t h, struct query_params *p, + bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_portinfo_record_t pir; ib_net64_t comp_mask = 0; - int lid = 0, port = -1; - ib_api_status_t status; + int lid = 0, port = -1, ret; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &port, NULL); @@ -1155,10 +1128,10 @@ static int query_portinfo_records(const struct query_cmd *q, CHECK_AND_SET_VAL(lid, 16, 0, pir.lid, PIR, LID); CHECK_AND_SET_VAL(port, 8, -1, pir.port_num, PIR, PORTNUM); - status = get_any_records(h, IB_MAD_ATTR_PORTINFO_RECORD, 0, comp_mask, - &pir, ib_get_attr_offset(sizeof(pir)), 0); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_PORTINFORECORD, 0, comp_mask, + &pir, 0); + if (ret) + return ret; dump_results(&result, dump_one_portinfo_record); return_mad(); @@ -1167,12 +1140,12 @@ static int query_portinfo_records(const struct query_cmd *q, } static int query_mcmember_records(const struct query_cmd *q, - osm_bind_handle_t h, struct query_params *p, + bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_member_rec_t mr; ib_net64_t comp_mask = 0; - ib_api_status_t status; + int ret; uint32_t flow = 0; uint8_t sl = 0, hop = 0, scope = 0; @@ -1195,57 +1168,46 @@ static int query_mcmember_records(const struct query_cmd *q, mr.scope_state |= scope << 4; CHECK_AND_SET_VAL(p->proxy_join, 8, -1, mr.proxy_join, MCR, PROXY); - status = get_any_records(h, IB_MAD_ATTR_MCMEMBER_RECORD, 0, comp_mask, - &mr, ib_get_attr_offset(sizeof(mr)), smkey); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_MCRECORD, 0, comp_mask, &mr, smkey); + if (ret) + return ret; dump_results(&result, dump_one_mcmember_record); return_mad(); - return status; + return 0; } -static int query_service_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_service_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { - ib_net16_t attr_offset = - ib_get_attr_offset(sizeof(ib_service_record_t)); - ib_api_status_t status; - - status = get_all_records(h, IB_MAD_ATTR_SERVICE_RECORD, attr_offset, 0); - if (status != IB_SUCCESS) - return (status); + int ret = get_all_records(h, IB_SA_ATTR_SERVICERECORD, 0); + if (ret) + return ret; dump_results(&result, dump_service_record); return_mad(); - return (status); + return 0; } static int query_informinfo_records(const struct query_cmd *q, - osm_bind_handle_t h, struct query_params *p, + bind_handle_t h, struct query_params *p, int argc, char *argv[]) { - ib_net16_t attr_offset = - ib_get_attr_offset(sizeof(ib_inform_info_record_t)); - ib_api_status_t status; - - status = - get_all_records(h, IB_MAD_ATTR_INFORM_INFO_RECORD, attr_offset, 0); - if (status != IB_SUCCESS) - return (status); + int ret = get_all_records(h, IB_SA_ATTR_INFORMINFORECORD, 0); + if (ret) + return ret; dump_results(&result, dump_inform_info_record); return_mad(); - return (status); + return 0; } -static int query_link_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_link_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_link_record_t lr; ib_net64_t comp_mask = 0; - int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1; - ib_api_status_t status; + int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1, ret; if (argc > 0) parse_lid_and_ports(h, argv[0], &from_lid, &from_port, NULL); @@ -1259,23 +1221,21 @@ static int query_link_records(const struct query_cmd *q, osm_bind_handle_t h, CHECK_AND_SET_VAL(to_lid, 16, 0, lr.to_lid, LR, TO_LID); CHECK_AND_SET_VAL(to_port, 8, -1, lr.to_port_num, LR, TO_PORT); - status = get_any_records(h, IB_MAD_ATTR_LINK_RECORD, 0, comp_mask, - &lr, ib_get_attr_offset(sizeof(lr)), 0); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_LINKRECORD, 0, comp_mask, &lr, 0); + if (ret) + return ret; dump_results(&result, dump_one_link_record); return_mad(); - return status; + return 0; } -static int query_sl2vl_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_slvl_table_record_t slvl; ib_net64_t comp_mask = 0; - int lid = 0, in_port = -1, out_port = -1; - ib_api_status_t status; + int lid = 0, in_port = -1, out_port = -1, ret; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &in_port, &out_port); @@ -1285,23 +1245,22 @@ static int query_sl2vl_records(const struct query_cmd *q, osm_bind_handle_t h, CHECK_AND_SET_VAL(in_port, 8, -1, slvl.in_port_num, SLVL, IN_PORT); CHECK_AND_SET_VAL(out_port, 8, -1, slvl.out_port_num, SLVL, OUT_PORT); - status = get_any_records(h, IB_MAD_ATTR_SLVL_RECORD, 0, comp_mask, - &slvl, ib_get_attr_offset(sizeof(slvl)), 0); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_SL2VLTABLERECORD, 0, comp_mask, + &slvl, 0); + if (ret) + return ret; dump_results(&result, dump_one_slvl_record); return_mad(); - return status; + return 0; } -static int query_vlarb_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_vl_arb_table_record_t vlarb; ib_net64_t comp_mask = 0; - int lid = 0, port = -1, block = -1; - ib_api_status_t status; + int lid = 0, port = -1, block = -1, ret; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &port, &block); @@ -1311,24 +1270,23 @@ static int query_vlarb_records(const struct query_cmd *q, osm_bind_handle_t h, CHECK_AND_SET_VAL(port, 8, -1, vlarb.port_num, VLA, OUT_PORT); CHECK_AND_SET_VAL(block, 8, -1, vlarb.block_num, VLA, BLOCK); - status = get_any_records(h, IB_MAD_ATTR_VLARB_RECORD, 0, comp_mask, - &vlarb, ib_get_attr_offset(sizeof(vlarb)), 0); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_VLARBTABLERECORD, 0, comp_mask, + &vlarb, 0); + if (ret) + return ret; dump_results(&result, dump_one_vlarb_record); return_mad(); - return status; + return 0; } static int query_pkey_tbl_records(const struct query_cmd *q, - osm_bind_handle_t h, struct query_params *p, + bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_pkey_table_record_t pktr; ib_net64_t comp_mask = 0; - int lid = 0, port = -1, block = -1; - ib_api_status_t status; + int lid = 0, port = -1, block = -1, ret; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &port, &block); @@ -1338,23 +1296,22 @@ static int query_pkey_tbl_records(const struct query_cmd *q, CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT); CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK); - status = get_any_records(h, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, comp_mask, - &pktr, ib_get_attr_offset(sizeof(pktr)), smkey); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, comp_mask, + &pktr, smkey); + if (ret) + return ret; dump_results(&result, dump_one_pkey_tbl_record); return_mad(); - return status; + return 0; } -static int query_lft_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_lft_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_lft_record_t lftr; ib_net64_t comp_mask = 0; - int lid = 0, block = -1; - ib_api_status_t status; + int lid = 0, block = -1, ret; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &block, NULL); @@ -1363,24 +1320,22 @@ static int query_lft_records(const struct query_cmd *q, osm_bind_handle_t h, CHECK_AND_SET_VAL(lid, 16, 0, lftr.lid, LFTR, LID); CHECK_AND_SET_VAL(block, 16, -1, lftr.block_num, LFTR, BLOCK); - status = get_any_records(h, IB_MAD_ATTR_LFT_RECORD, 0, comp_mask, - &lftr, ib_get_attr_offset(sizeof(lftr)), 0); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_LFTRECORD, 0, comp_mask, &lftr, 0); + if (ret) + return ret; dump_results(&result, dump_one_lft_record); return_mad(); - return status; + return 0; } -static int query_mft_records(const struct query_cmd *q, osm_bind_handle_t h, +static int query_mft_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { ib_mft_record_t mftr; ib_net64_t comp_mask = 0; - int lid = 0, block = -1, position = -1; + int lid = 0, block = -1, position = -1, ret; uint16_t pos = 0; - ib_api_status_t status; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &position, &block); @@ -1392,19 +1347,18 @@ static int query_mft_records(const struct query_cmd *q, osm_bind_handle_t h, CHECK_AND_SET_VAL(position, 8, -1, pos, MFTR, POSITION); mftr.position_block_num |= cl_hton16(pos << 12); - status = get_any_records(h, IB_MAD_ATTR_MFT_RECORD, 0, comp_mask, - &mftr, ib_get_attr_offset(sizeof(mftr)), 0); - if (status != IB_SUCCESS) - return status; + ret = get_any_records(h, IB_SA_ATTR_MFTRECORD, 0, comp_mask, &mftr, 0); + if (ret) + return ret; dump_results(&result, dump_one_mft_record); return_mad(); - return status; + return 0; } -static osm_bind_handle_t get_bind_handle(void) +static bind_handle_t get_bind_handle(void) { - static struct sa_bind_handle handle; + static struct bind_handle handle; int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); @@ -1423,7 +1377,7 @@ static osm_bind_handle_t get_bind_handle(void) return &handle; } -static void clean_up(struct sa_bind_handle *h) +static void clean_up(struct bind_handle *h) { umad_unregister(h->fd, h->agent); umad_close_port(h->fd); @@ -1431,31 +1385,31 @@ static void clean_up(struct sa_bind_handle *h) } static const struct query_cmd query_cmds[] = { - {"ClassPortInfo", "CPI", IB_MAD_ATTR_CLASS_PORT_INFO, + {"ClassPortInfo", "CPI", CLASS_PORT_INFO, NULL, query_class_port_info}, - {"NodeRecord", "NR", IB_MAD_ATTR_NODE_RECORD, + {"NodeRecord", "NR", IB_SA_ATTR_NODERECORD, "[lid]", query_node_records}, - {"PortInfoRecord", "PIR", IB_MAD_ATTR_PORTINFO_RECORD, + {"PortInfoRecord", "PIR", IB_SA_ATTR_PORTINFORECORD, "[[lid]/[port]]", query_portinfo_records}, - {"SL2VLTableRecord", "SL2VL", IB_MAD_ATTR_SLVL_RECORD, + {"SL2VLTableRecord", "SL2VL", IB_SA_ATTR_SL2VLTABLERECORD, "[[lid]/[in_port]/[out_port]]", query_sl2vl_records}, - {"PKeyTableRecord", "PKTR", IB_MAD_ATTR_PKEY_TBL_RECORD, + {"PKeyTableRecord", "PKTR", IB_SA_ATTR_PKEYTABLERECORD, "[[lid]/[port]/[block]]", query_pkey_tbl_records}, - {"VLArbitrationTableRecord", "VLAR", IB_MAD_ATTR_VLARB_RECORD, + {"VLArbitrationTableRecord", "VLAR", IB_SA_ATTR_VLARBTABLERECORD, "[[lid]/[port]/[block]]", query_vlarb_records}, - {"InformInfoRecord", "IIR", IB_MAD_ATTR_INFORM_INFO_RECORD, + {"InformInfoRecord", "IIR", IB_SA_ATTR_INFORMINFORECORD, NULL, query_informinfo_records}, - {"LinkRecord", "LR", IB_MAD_ATTR_LINK_RECORD, + {"LinkRecord", "LR", IB_SA_ATTR_LINKRECORD, "[[from_lid]/[from_port]] [[to_lid]/[to_port]]", query_link_records}, - {"ServiceRecord", "SR", IB_MAD_ATTR_SERVICE_RECORD, + {"ServiceRecord", "SR", IB_SA_ATTR_SERVICERECORD, NULL, query_service_records}, - {"PathRecord", "PR", IB_MAD_ATTR_PATH_RECORD, + {"PathRecord", "PR", IB_SA_ATTR_PATHRECORD, NULL, query_path_records}, - {"MCMemberRecord", "MCMR", IB_MAD_ATTR_MCMEMBER_RECORD, + {"MCMemberRecord", "MCMR", IB_SA_ATTR_MCRECORD, NULL, query_mcmember_records}, - {"LFTRecord", "LFTR", IB_MAD_ATTR_LFT_RECORD, + {"LFTRecord", "LFTR", IB_SA_ATTR_LFTRECORD, "[[lid]/[block]]", query_lft_records}, - {"MFTRecord", "MFTR", IB_MAD_ATTR_MFT_RECORD, + {"MFTRecord", "MFTR", IB_SA_ATTR_MFTRECORD, "[[mlid]/[position]/[block]]", query_mft_records}, {0} }; @@ -1473,7 +1427,7 @@ static const struct query_cmd *find_query(const char *name) return NULL; } -static const struct query_cmd *find_query_by_type(ib_net16_t type) +static const struct query_cmd *find_query_by_type(uint16_t type) { const struct query_cmd *q; @@ -1494,7 +1448,7 @@ enum saquery_command { }; static enum saquery_command command = SAQUERY_CMD_QUERY; -static ib_net16_t query_type; +static uint16_t query_type; static char *src_lid, *dst_lid; static int process_opt(void *context, int ch, char *optarg) @@ -1511,7 +1465,7 @@ static int process_opt(void *context, int ch, char *optarg) *dst_lid++ = '\0'; } p->numb_path = 0x7f; - query_type = IB_MAD_ATTR_PATH_RECORD; + query_type = IB_SA_ATTR_PATHRECORD; break; case 2: { @@ -1527,7 +1481,7 @@ static int process_opt(void *context, int ch, char *optarg) free(src_addr); } p->numb_path = 0x7f; - query_type = IB_MAD_ATTR_PATH_RECORD; + query_type = IB_SA_ATTR_PATHRECORD; break; case 3: node_name_map_file = strdup(optarg); @@ -1538,22 +1492,22 @@ static int process_opt(void *context, int ch, char *optarg) fprintf(stderr, "cannot get SM_Key\n"); ibdiag_show_usage(); } - smkey = cl_hton64(strtoull(optarg, NULL, 0)); + smkey = strtoull(optarg, NULL, 0); break; case 'p': - query_type = IB_MAD_ATTR_PATH_RECORD; + query_type = IB_SA_ATTR_PATHRECORD; break; case 'D': node_print_desc = ALL_DESC; break; case 'c': - command = SAQUERY_CMD_CLASS_PORT_INFO; + command = CLASS_PORT_INFO; break; case 'S': - query_type = IB_MAD_ATTR_SERVICE_RECORD; + query_type = IB_SA_ATTR_SERVICERECORD; break; case 'I': - query_type = IB_MAD_ATTR_INFORM_INFO_RECORD; + query_type = IB_SA_ATTR_INFORMINFORECORD; break; case 'N': command = SAQUERY_CMD_NODE_RECORD; @@ -1588,7 +1542,7 @@ static int process_opt(void *context, int ch, char *optarg) command = SAQUERY_CMD_MCMEMBERS; break; case 'x': - query_type = IB_MAD_ATTR_LINK_RECORD; + query_type = IB_SA_ATTR_LINKRECORD; break; case 5: p->slid = strtoul(optarg, NULL, 0); @@ -1669,7 +1623,7 @@ static int process_opt(void *context, int ch, char *optarg) int main(int argc, char **argv) { char usage_args[1024]; - osm_bind_handle_t h; + bind_handle_t h; struct query_params params = { .hop_limit = -1, .reversible = -1, @@ -1758,7 +1712,7 @@ int main(int argc, char **argv) if (!query_type && command == SAQUERY_CMD_QUERY) { if (!argc || !(q = find_query(argv[0]))) - query_type = IB_MAD_ATTR_NODE_RECORD; + query_type = IB_SA_ATTR_NODERECORD; else { query_type = q->query_type; argc--; @@ -1768,10 +1722,10 @@ int main(int argc, char **argv) if (argc) { if (node_print_desc == NAME_OF_LID) { - requested_lid = (ib_net16_t) strtoul(argv[0], NULL, 0); + requested_lid = strtoul(argv[0], NULL, 0); requested_lid_flag++; } else if (node_print_desc == NAME_OF_GUID) { - requested_guid = (ib_net64_t) strtoul(argv[0], NULL, 0); + requested_guid = strtoul(argv[0], NULL, 0); requested_guid_flag++; } else requested_name = argv[0]; -- 1.6.1.rc1.45.g123ed From sashak at voltaire.com Wed Feb 11 11:55:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 11 Feb 2009 21:55:25 +0200 Subject: [ofa-general] [PATCH] infiniband-diags: some code consolidation In-Reply-To: <20090211195442.GP5910@sashak.voltaire.com> References: <20090211195442.GP5910@sashak.voltaire.com> Message-ID: <20090211195525.GQ5910@sashak.voltaire.com> Consolidate repeated code using helper functions get_and_dump_any_records() and get_and_dump_all_records(). Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 172 +++++++++++++++------------------------- 1 files changed, 65 insertions(+), 107 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index a94a015..9726d22 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -796,6 +796,21 @@ static int get_any_records(bind_handle_t h, return ret; } +static int get_and_dump_any_records(bind_handle_t h, uint16_t attr_id, + uint32_t attr_mod, ib_net64_t comp_mask, + void *attr, uint64_t sm_key, + void (*dump_func) (void *)) +{ + int ret = get_any_records(h, attr_id, attr_mod, comp_mask, attr, + sm_key); + if (ret) + return ret; + + dump_results(&result, dump_func); + + return 0; +} + /** * Get all the records available for requested query type. */ @@ -804,6 +819,18 @@ static int get_all_records(bind_handle_t h, uint16_t attr_id, int trusted) return get_any_records(h, attr_id, 0, 0, NULL, trusted ? smkey : 0); } +static int get_and_dump_all_records(bind_handle_t h, uint16_t attr_id, + int trusted, void (*dump_func) (void *)) +{ + int ret = get_all_records(h, attr_id, 0); + if (ret) + return ret; + + dump_results(&result, dump_func); + return_mad(); + return ret; +} + /** * return the lid from the node descriptor (name) supplied */ @@ -989,7 +1016,6 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h, { ib_path_rec_t pr; ib_net64_t comp_mask = 0; - int ret; uint32_t flow = 0; uint16_t qos_class = 0; uint8_t reversible = 0; @@ -1014,13 +1040,8 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h, CHECK_AND_SET_VAL_AND_SEL(p->rate, pr.rate, PR, RATE, SELEC); CHECK_AND_SET_VAL_AND_SEL(p->pkt_life, pr.pkt_life, PR, PKTLIFETIME, SELEC); - ret = get_any_records(h, IB_SA_ATTR_PATHRECORD, 0, comp_mask, &pr, 0); - if (ret) - return ret; - - dump_results(&result, dump_path_record); - return_mad(); - return ret; + return get_and_dump_any_records(h, IB_SA_ATTR_PATHRECORD, 0, comp_mask, + &pr, 0, dump_path_record); } static ib_api_status_t print_issm_records(bind_handle_t h) @@ -1075,13 +1096,8 @@ return_mc: static int print_multicast_group_records(bind_handle_t h) { - int ret = get_all_records(h, IB_SA_ATTR_MCRECORD, 0); - if (ret) - return ret; - - dump_results(&result, dump_multicast_group_record); - return_mad(); - return ret; + return get_and_dump_all_records(h, IB_SA_ATTR_MCRECORD, 0, + dump_multicast_group_record); } static int query_class_port_info(const struct query_cmd *q, bind_handle_t h, @@ -1095,7 +1111,7 @@ static int query_node_records(const struct query_cmd *q, bind_handle_t h, { ib_node_record_t nr; ib_net64_t comp_mask = 0; - int lid = 0, ret; + int lid = 0; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, NULL, NULL); @@ -1103,14 +1119,8 @@ static int query_node_records(const struct query_cmd *q, bind_handle_t h, memset(&nr, 0, sizeof(nr)); CHECK_AND_SET_VAL(lid, 16, 0, nr.lid, NR, LID); - ret = get_any_records(h, IB_SA_ATTR_NODERECORD, 0, comp_mask, &nr, 0); - if (ret) - return ret; - - dump_results(&result, dump_node_record); - return_mad(); - - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_NODERECORD, 0, comp_mask, + &nr, 0, dump_node_record); } static int query_portinfo_records(const struct query_cmd *q, @@ -1119,7 +1129,7 @@ static int query_portinfo_records(const struct query_cmd *q, { ib_portinfo_record_t pir; ib_net64_t comp_mask = 0; - int lid = 0, port = -1, ret; + int lid = 0, port = -1; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &port, NULL); @@ -1128,15 +1138,9 @@ static int query_portinfo_records(const struct query_cmd *q, CHECK_AND_SET_VAL(lid, 16, 0, pir.lid, PIR, LID); CHECK_AND_SET_VAL(port, 8, -1, pir.port_num, PIR, PORTNUM); - ret = get_any_records(h, IB_SA_ATTR_PORTINFORECORD, 0, comp_mask, - &pir, 0); - if (ret) - return ret; - - dump_results(&result, dump_one_portinfo_record); - return_mad(); - - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_PORTINFORECORD, 0, + comp_mask, &pir, 0, + dump_one_portinfo_record); } static int query_mcmember_records(const struct query_cmd *q, @@ -1145,7 +1149,6 @@ static int query_mcmember_records(const struct query_cmd *q, { ib_member_rec_t mr; ib_net64_t comp_mask = 0; - int ret; uint32_t flow = 0; uint8_t sl = 0, hop = 0, scope = 0; @@ -1168,38 +1171,23 @@ static int query_mcmember_records(const struct query_cmd *q, mr.scope_state |= scope << 4; CHECK_AND_SET_VAL(p->proxy_join, 8, -1, mr.proxy_join, MCR, PROXY); - ret = get_any_records(h, IB_SA_ATTR_MCRECORD, 0, comp_mask, &mr, smkey); - if (ret) - return ret; - - dump_results(&result, dump_one_mcmember_record); - return_mad(); - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_MCRECORD, 0, comp_mask, + &mr, smkey, dump_one_mcmember_record); } static int query_service_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { - int ret = get_all_records(h, IB_SA_ATTR_SERVICERECORD, 0); - if (ret) - return ret; - - dump_results(&result, dump_service_record); - return_mad(); - return 0; + return get_and_dump_all_records(h, IB_SA_ATTR_SERVICERECORD, 0, + dump_service_record); } static int query_informinfo_records(const struct query_cmd *q, bind_handle_t h, struct query_params *p, int argc, char *argv[]) { - int ret = get_all_records(h, IB_SA_ATTR_INFORMINFORECORD, 0); - if (ret) - return ret; - - dump_results(&result, dump_inform_info_record); - return_mad(); - return 0; + return get_and_dump_all_records(h, IB_SA_ATTR_INFORMINFORECORD, 0, + dump_inform_info_record); } static int query_link_records(const struct query_cmd *q, bind_handle_t h, @@ -1207,7 +1195,7 @@ static int query_link_records(const struct query_cmd *q, bind_handle_t h, { ib_link_record_t lr; ib_net64_t comp_mask = 0; - int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1, ret; + int from_lid = 0, to_lid = 0, from_port = -1, to_port = -1; if (argc > 0) parse_lid_and_ports(h, argv[0], &from_lid, &from_port, NULL); @@ -1221,13 +1209,8 @@ static int query_link_records(const struct query_cmd *q, bind_handle_t h, CHECK_AND_SET_VAL(to_lid, 16, 0, lr.to_lid, LR, TO_LID); CHECK_AND_SET_VAL(to_port, 8, -1, lr.to_port_num, LR, TO_PORT); - ret = get_any_records(h, IB_SA_ATTR_LINKRECORD, 0, comp_mask, &lr, 0); - if (ret) - return ret; - - dump_results(&result, dump_one_link_record); - return_mad(); - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_LINKRECORD, 0, comp_mask, + &lr, 0, dump_one_link_record); } static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h, @@ -1235,7 +1218,7 @@ static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h, { ib_slvl_table_record_t slvl; ib_net64_t comp_mask = 0; - int lid = 0, in_port = -1, out_port = -1, ret; + int lid = 0, in_port = -1, out_port = -1; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &in_port, &out_port); @@ -1245,14 +1228,9 @@ static int query_sl2vl_records(const struct query_cmd *q, bind_handle_t h, CHECK_AND_SET_VAL(in_port, 8, -1, slvl.in_port_num, SLVL, IN_PORT); CHECK_AND_SET_VAL(out_port, 8, -1, slvl.out_port_num, SLVL, OUT_PORT); - ret = get_any_records(h, IB_SA_ATTR_SL2VLTABLERECORD, 0, comp_mask, - &slvl, 0); - if (ret) - return ret; - - dump_results(&result, dump_one_slvl_record); - return_mad(); - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_SL2VLTABLERECORD, 0, + comp_mask, &slvl, 0, + dump_one_slvl_record); } static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h, @@ -1260,7 +1238,7 @@ static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h, { ib_vl_arb_table_record_t vlarb; ib_net64_t comp_mask = 0; - int lid = 0, port = -1, block = -1, ret; + int lid = 0, port = -1, block = -1; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &port, &block); @@ -1270,14 +1248,9 @@ static int query_vlarb_records(const struct query_cmd *q, bind_handle_t h, CHECK_AND_SET_VAL(port, 8, -1, vlarb.port_num, VLA, OUT_PORT); CHECK_AND_SET_VAL(block, 8, -1, vlarb.block_num, VLA, BLOCK); - ret = get_any_records(h, IB_SA_ATTR_VLARBTABLERECORD, 0, comp_mask, - &vlarb, 0); - if (ret) - return ret; - - dump_results(&result, dump_one_vlarb_record); - return_mad(); - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_VLARBTABLERECORD, 0, + comp_mask, &vlarb, 0, + dump_one_vlarb_record); } static int query_pkey_tbl_records(const struct query_cmd *q, @@ -1286,7 +1259,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q, { ib_pkey_table_record_t pktr; ib_net64_t comp_mask = 0; - int lid = 0, port = -1, block = -1, ret; + int lid = 0, port = -1, block = -1; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &port, &block); @@ -1296,14 +1269,9 @@ static int query_pkey_tbl_records(const struct query_cmd *q, CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT); CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK); - ret = get_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, comp_mask, - &pktr, smkey); - if (ret) - return ret; - - dump_results(&result, dump_one_pkey_tbl_record); - return_mad(); - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, + comp_mask, &pktr, smkey, + dump_one_pkey_tbl_record); } static int query_lft_records(const struct query_cmd *q, bind_handle_t h, @@ -1311,7 +1279,7 @@ static int query_lft_records(const struct query_cmd *q, bind_handle_t h, { ib_lft_record_t lftr; ib_net64_t comp_mask = 0; - int lid = 0, block = -1, ret; + int lid = 0, block = -1; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &block, NULL); @@ -1320,13 +1288,8 @@ static int query_lft_records(const struct query_cmd *q, bind_handle_t h, CHECK_AND_SET_VAL(lid, 16, 0, lftr.lid, LFTR, LID); CHECK_AND_SET_VAL(block, 16, -1, lftr.block_num, LFTR, BLOCK); - ret = get_any_records(h, IB_SA_ATTR_LFTRECORD, 0, comp_mask, &lftr, 0); - if (ret) - return ret; - - dump_results(&result, dump_one_lft_record); - return_mad(); - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_LFTRECORD, 0, comp_mask, + &lftr, 0, dump_one_lft_record); } static int query_mft_records(const struct query_cmd *q, bind_handle_t h, @@ -1334,7 +1297,7 @@ static int query_mft_records(const struct query_cmd *q, bind_handle_t h, { ib_mft_record_t mftr; ib_net64_t comp_mask = 0; - int lid = 0, block = -1, position = -1, ret; + int lid = 0, block = -1, position = -1; uint16_t pos = 0; if (argc > 0) @@ -1347,13 +1310,8 @@ static int query_mft_records(const struct query_cmd *q, bind_handle_t h, CHECK_AND_SET_VAL(position, 8, -1, pos, MFTR, POSITION); mftr.position_block_num |= cl_hton16(pos << 12); - ret = get_any_records(h, IB_SA_ATTR_MFTRECORD, 0, comp_mask, &mftr, 0); - if (ret) - return ret; - - dump_results(&result, dump_one_mft_record); - return_mad(); - return 0; + return get_and_dump_any_records(h, IB_SA_ATTR_MFTRECORD, 0, comp_mask, + &mftr, 0, dump_one_mft_record); } static bind_handle_t get_bind_handle(void) -- 1.6.1.rc1.45.g123ed From devel at morey-chaisemartin.com Wed Feb 11 12:05:53 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 11 Feb 2009 21:05:53 +0100 Subject: [ofa-general] Re: [PATCH v2] opensm/osm_console.c : Added dump_portguid function to console to generate a list of port guids matching one or more regexps In-Reply-To: <20090211184717.GO5910@sashak.voltaire.com> References: <499135E1.1080307@ext.bull.net> <20090211184717.GO5910@sashak.voltaire.com> Message-ID: <49932FA1.9050500@morey-chaisemartin.com> Sasha Khapyorsky a écrit : > Hi Nicolas, > > On 09:08 Tue 10 Feb , Nicolas Morey Chaisemartin wrote: > >> This add a dump_portguid functionnality to openSM console which makes it >> really easy to generate cn_guid_file, root_guid_file and such >> by dumping into a file all port guids whom nodedesc contains at least one >> of the provided regexps >> >> Signed-off-by: Nicolas Morey-Chaisemartin >> >> --- >> >> Repost without exit_after_run flag, active sleep init loop and indented. >> >> opensm/opensm/osm_console.c | 105 >> +++++++++++++++++++++++++++++++++++++++++++ >> 1 files changed, 105 insertions(+), 0 deletions(-) >> >> >> > > >> diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c >> index c6e8e59..5fbcd43 100644 >> --- a/opensm/opensm/osm_console.c >> +++ b/opensm/opensm/osm_console.c >> @@ -42,6 +42,7 @@ >> #include >> #include >> #include >> +#include >> #ifdef ENABLE_OSM_CONSOLE_SOCKET >> #include >> #endif >> @@ -1173,6 +1174,109 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) >> } >> >> /* more parse routines go here */ >> +typedef struct _regexp_list { >> + regex_t exp; >> + struct _regexp_list *next; >> +} regexp_list_t; >> + >> +static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) >> +{ >> + cl_qmap_t *p_port_guid_tbl; >> + osm_port_t *p_port; >> + osm_port_t *p_next_port; >> + >> + regexp_list_t *p_head_regexp = NULL; >> + regexp_list_t *p_regexp; >> + >> + /* Option variables */ >> + char *p_cmd = NULL; >> + FILE *output = out; >> + >> + /* Read commande line */ >> + >> + while (1) { >> + p_cmd = next_token(p_last); >> + if (p_cmd) { >> + if (strcmp(p_cmd, "file") == 0) { >> + p_cmd = next_token(p_last); >> + if (p_cmd) { >> + output = fopen(p_cmd, "w+"); >> + if (output == NULL) { >> + fprintf(out, >> + "Could not open file %s: %s\n", >> + p_cmd, strerror(errno)); >> + output = out; >> + } >> + } else >> + fprintf(out, "No file name passed\n"); >> + } else { >> + p_regexp = malloc(sizeof(*p_regexp)); >> + if (regcomp >> + (&(p_regexp->exp), p_cmd, >> + REG_NOSUB | REG_EXTENDED) != 0) { >> + fprintf(out, >> + "Couldn't parse regular expression %s. Skipping it.\n", >> + p_cmd); >> + } >> + p_regexp->next = p_head_regexp; >> + p_head_regexp = p_regexp; >> + } >> + } else >> + break; /* No more tokens */ >> + >> + } >> + >> + /* Check we have at least one expression to match */ >> + if (p_head_regexp == NULL) { >> + fprintf(out, "No valid expression provided. Aborting\n"); >> + return; >> + } >> + >> + cl_spinlock_release(&p_osm->sm.state_lock); >> > > What is this cl_spinlock_release()? Typo? > > >> + if (p_osm->sm.p_subn->need_update != 0) { >> + fprintf(out, "Subnet is not ready yet. Try again later.\n"); >> + return; >> + } >> + >> + /* Subnet doesn't need to be updated so we can carry on */ >> + >> + CL_PLOCK_EXCL_ACQUIRE(p_osm->sm.p_lock); >> + p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl); >> > > Do we really need exclusive locking here? port_guid_table content is > rad-only, I guess "read-only" lock (CL_PLOCK_ACQUIRE()) should be enough. > > The rest looks fine for me. > > Sasha > > Read only is fine. I didn't know complib provided different kinds of lock. Nicolas From kliteyn at dev.mellanox.co.il Wed Feb 11 12:13:43 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 11 Feb 2009 22:13:43 +0200 Subject: [ofa-general] Re: [PATCH OpenSM 0/3] Fat Tree - Routing between non-CN nodes In-Reply-To: References: <494A5339.9030304@ext.bull.net> <20090207185551.GD27757@sashak.voltaire.com> <498DE57D.4030501@morey-chaisemartin.com> <20090207202319.GE27757@sashak.voltaire.com> <49929986.40106@ext.bull.net> <20090211114347.GA27920@sashak.voltaire.com> <4992D207.6010701@dev.mellanox.co.il> Message-ID: <49933177.8010206@dev.mellanox.co.il> Hal Rosenstock wrote: > On Wed, Feb 11, 2009 at 8:26 AM, Yevgeny Kliteynik > wrote: >> Sasha Khapyorsky wrote: >>> On 10:25 Wed 11 Feb , Nicolas Morey Chaisemartin wrote: >>>> What about high nodes (HN) as it concerns only nodes which are not at the >>>> bottom of the fat tree? >>> Could be fine. Let's ask Yevgeny too... :) >>> >>> Yevgeny! Any idea about io_nodes more generic name? >> Ugh... >> >> "IO nodes": >> Pros: the name is closer to the reality, since in most cases >> the nodes that would need special treatment are indeed IO nodes. >> Cons: the name is not "general"... >> >> "High nodes" >> Pros: general name with kinda "hint" to the special treatment. >> Cons: the "hint" is rather vague... >> >> Bottom line - I'm OK with both options (slightly leaning toward IO), >> as long as it is described well enough in the help message and in man :) > > Maybe consistency is the hobgobblin of small minds but don't we now have: > > high nodes which is a topology based name > and > compute nodes which is a functional based name. > > Is it worth having them consistent ? Good point. IO nodes will be consistent with CNs. -- Yevgeny > -- Hal > >> -- Yevgeny >> >>> Sasha >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > From frankose at ifi.uio.no Wed Feb 11 12:17:55 2009 From: frankose at ifi.uio.no (Frank Olaf Sem-Jacobsen) Date: Wed, 11 Feb 2009 21:17:55 +0100 Subject: [ofa-general] fat-tree CN nodes? Message-ID: <49933273.1010504@ifi.uio.no> Hi, I have been looking into the fat tree code, and I was wondering about the definition of a compute node (CN). Are these part of the leaf switches at the bottom of the fat tree, or are they extra switches that are connected to the fat tree, e.g. the switch in a rack of blades which is again connected to the fat tree? Appreciate the help, -- Frank Olaf Sem-Jacobsen From cameron at harr.org Wed Feb 11 12:25:41 2009 From: cameron at harr.org (Cameron Harr) Date: Wed, 11 Feb 2009 13:25:41 -0700 Subject: [ofa-general] fat-tree CN nodes? In-Reply-To: <49933273.1010504@ifi.uio.no> References: <49933273.1010504@ifi.uio.no> Message-ID: <49933445.5030203@harr.org> Hi Frank, A compute node is a computer/server that is generally dedicated to doing computational work in a cluster or group of computers. Cameron Frank Olaf Sem-Jacobsen wrote: > Hi, > > I have been looking into the fat tree code, and I was wondering about > the definition of a compute node (CN). Are these part of the leaf > switches at the bottom of the fat tree, or are they extra switches > that are connected to the fat tree, e.g. the switch in a rack of > blades which is again connected to the fat tree? > > Appreciate the help, From frankose at ifi.uio.no Wed Feb 11 12:31:04 2009 From: frankose at ifi.uio.no (Frank Olaf Sem-Jacobsen) Date: Wed, 11 Feb 2009 21:31:04 +0100 Subject: [ofa-general] fat-tree CN nodes? In-Reply-To: <49933445.5030203@harr.org> References: <49933273.1010504@ifi.uio.no> <49933445.5030203@harr.org> Message-ID: <49933588.6050607@ifi.uio.no> Right,so it has no connection with any topological properties of the fat tree? Which again means that the definition of compute nodes is only necessary for the ability to balance these separately in the tree? Thanks for your answer, Cameron Harr wrote: > Hi Frank, > A compute node is a computer/server that is generally dedicated to doing > computational work in a cluster or group of computers. > Cameron > > Frank Olaf Sem-Jacobsen wrote: >> Hi, >> >> I have been looking into the fat tree code, and I was wondering about >> the definition of a compute node (CN). Are these part of the leaf >> switches at the bottom of the fat tree, or are they extra switches >> that are connected to the fat tree, e.g. the switch in a rack of >> blades which is again connected to the fat tree? >> >> Appreciate the help, -- Frank Olaf Sem-Jacobsen From cameron at harr.org Wed Feb 11 12:47:13 2009 From: cameron at harr.org (Cameron Harr) Date: Wed, 11 Feb 2009 13:47:13 -0700 Subject: [ofa-general] fat-tree CN nodes? In-Reply-To: <49933588.6050607@ifi.uio.no> References: <49933273.1010504@ifi.uio.no> <49933445.5030203@harr.org> <49933588.6050607@ifi.uio.no> Message-ID: <49933951.2030504@harr.org> Frank, I'm going to step out of the discussion because I'm no authority in the code. My understanding is that the CN is there as a GUID (from the HCA) on the very bottom of the fabric - connected to the leaf switch. Someone who knows the code will have to give you a real answer. Sorry. Cameron Frank Olaf Sem-Jacobsen wrote: > Right,so it has no connection with any topological properties of the > fat tree? Which again means that the definition of compute nodes is > only necessary for the ability to balance these separately in the tree? > > Thanks for your answer, > > Cameron Harr wrote: >> Hi Frank, >> A compute node is a computer/server that is generally dedicated to >> doing computational work in a cluster or group of computers. >> Cameron >> >> Frank Olaf Sem-Jacobsen wrote: >>> Hi, >>> >>> I have been looking into the fat tree code, and I was wondering >>> about the definition of a compute node (CN). Are these part of the >>> leaf switches at the bottom of the fat tree, or are they extra >>> switches that are connected to the fat tree, e.g. the switch in a >>> rack of blades which is again connected to the fat tree? >>> >>> Appreciate the help, > > From or.gerlitz at gmail.com Wed Feb 11 12:52:26 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 11 Feb 2009 22:52:26 +0200 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: References: Message-ID: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> On Thu, Feb 5, 2009 at 1:44 PM, Or Gerlitz wrote: > It seems that even when the rdma-cm consumer binds to a specific address, > the rdma-cm address resolution code follows the order of the devices/rules > in routing table. So the user can't really dictate an outgoing interface > based on the src address provided to rdma_resolve_addr. Hi Sean, Did you had the chance to look into that? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Wed Feb 11 13:14:46 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 11 Feb 2009 13:14:46 -0800 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> References: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> Message-ID: <798E955ACF6F4EBDBA311DE3C54C9B9E@amr.corp.intel.com> >Did you had the chance to look into that? Not yet - but should be able to look into it by the end of the week. From what Jason said, it sounds like ip_dev_find() doesn't behave like I was expecting. - Sean From swise at opengridcomputing.com Wed Feb 11 14:29:15 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 11 Feb 2009 16:29:15 -0600 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: remove modulo math from build_rdma_recv(). Message-ID: <20090211222915.19520.22647.stgit@dell3.ogc.int> From: Steve Wise - remove modulo usage Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index c2b3cf7..bf549ed 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -263,8 +263,8 @@ static int build_rdma_recv(struct iwch_qp *qhp, union t3_wr *wqe, wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); /* to in the WQE == the offset into the page */ - wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % - (1UL << (12 + page_size[i]))); + wqe->recv.sgl[i].to = cpu_to_be64(((u32)wr->sg_list[i].addr) & + ((1UL << (12 + page_size[i]) - 1))); /* pbl_addr is the adapters address in the PBL */ wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); From davem at davemloft.net Wed Feb 11 15:00:53 2009 From: davem at davemloft.net (David Miller) Date: Wed, 11 Feb 2009 15:00:53 -0800 (PST) Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: remove modulo math from build_rdma_recv(). In-Reply-To: <20090211222915.19520.22647.stgit@dell3.ogc.int> References: <20090211222915.19520.22647.stgit@dell3.ogc.int> Message-ID: <20090211.150053.02539000.davem@davemloft.net> From: Steve Wise Date: Wed, 11 Feb 2009 16:29:15 -0600 > From: Steve Wise > > - remove modulo usage > > Signed-off-by: Steve Wise Acked-by: David S. Miller From Jie.Cai at cs.anu.edu.au Wed Feb 11 18:30:19 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Thu, 12 Feb 2009 13:30:19 +1100 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: <49927A53.1020403@cs.anu.edu.au> References: <49927A53.1020403@cs.anu.edu.au> Message-ID: <499389BB.6060806@cs.anu.edu.au> I am get a bit confused by description on the DAT_COMPLETION_SUPPRESS_FLAG. Looks like it suppress notification after DTO operations. Is it always true? I have found that when I am using dat_ep_post_rdma_write to transfering data over 128k (within 1 iov). Event will be brought to server side (verified with cookie), and at client side an event with Invalid_DAT_EVENT_NUMBER will be received. What's the problem? Thanks -- Mr. Jie Cai From yunhong.jiang at intel.com Wed Feb 11 17:18:21 2009 From: yunhong.jiang at intel.com (Jiang, Yunhong) Date: Thu, 12 Feb 2009 09:18:21 +0800 Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it. I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think). BTW, where you got following log? That seems suggest config space function not found. PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found." -- Yunhong Jiang ________________________________ From: xen-devel-bounces at lists.xensource.com [mailto:xen-devel-bounces at lists.xensource.com] On Behalf Of subbu kl Sent: 2009年2月11日 22:18 To: David Brown Cc: xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module. No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again. after executing in dom0 echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind #dmesg ACPI: PCI interrupt for device 0000:0e:00.0 disabled tap tap-1-51712: 2 getting info tap tap-2-51712: 2 getting info pciback 0000:0e:00.0: seizing device PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI interrupt for device 0000:0e:00.0 disabled #xm create -c rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. GUEST dmesg: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 in dom0: Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled some more details - [root at p128 ~]# rpm -qa | grep xen kernel-xen-2.6.18-92.1.22.el5 xen-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 [root at p128 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.3.0 node_guid: 0002:c902:0022:cd48 sys_image_guid: 0002:c902:0022:cd4b vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0x20 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 any help greatly appreciated. ~subbu On Sat, Oct 18, 2008 at 4:54 AM, David Brown > wrote: Okay so my question to the openfabrics guys is, why would the OFED drivers fail to read the firmware? Any thoughts? Thanks, - David Brown ---------- Forwarded message ---------- From: David Brown > Date: Thu, Sep 11, 2008 at 2:24 PM Subject: pciback module not working To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com This issue was brought up about a year and a half ago. So I'll bring it up again and see if anything happens. I've got an infiniband network and am attempting to pass the infiniband card through the host and give it to the guest. I'm working with standard CentOS 5.2 on both guest and host with their provided xen (3.0.3 ish). I've also attempted to install the newest Xen 3.3 and use their standard host kernel and that did the same thing. The guest dmesg output in the guest is similar on both permissive and normal mode. I'm getting issues with detecting the firmware on the card for some reason... Any help would be appreciated. Thanks, - David Brown === GUEST dmesg output === ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 ======================= === Host modprobe.conf === alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter cciss options pciback hide=(41:00.0) ===================== === Host lspci output === # lspci -vs 41:00.0 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== This makes sure it get loaded first off before anything else. === Host mkinitrd cmd === # mkinitrd -f --with=pciback --preload pciback /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen ==================== === Host pciback dmesg === pciback 0000:41:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ====================== === Host pciback dmesg (after setting it permissive) === pciback 0000:41:00.0: enabling permissive mode configuration space accesses! pciback 0000:41:00.0: permissive mode is potentially unsafe! pciback: vpci: 0000:41:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ========================================= === Guest lspci output === # lspci -v 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From subbukl at gmail.com Wed Feb 11 21:52:25 2009 From: subbukl at gmail.com (subbu kl) Date: Thu, 12 Feb 2009 11:22:25 +0530 Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: no luck ! dmesg in XEN PV guest shows : ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 even after executingh the following in dom0: #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive I am getting the follwing messages on the console as part of the initial bootup messages of the guest: Started domain rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. after executing the following in dom0 : #xm create -c rhel52_64_3 so, problem persisits, ~subbu 2009/2/12 Jiang, Yunhong > Seems it is because PCI frontend try to write some configuration space > that PCIback has no config_field entry to support it. > I think you can firstly try to do as dom0's dmesg suggested: "see > permissive attribute in sysfs" (it should be "set permissive attribute...", > I think). > > BTW, where you got following log? That seems suggest config space function > not found. > > PCI: Fatal: No PCI config space access function found > rtc: IRQ 8 is not free. > i8042.c: No controller found." > > -- Yunhong Jiang > > ------------------------------ > *From:* xen-devel-bounces at lists.xensource.com [mailto: > xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl > *Sent:* 2009年2月11日 22:18 > *To:* David Brown > *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org > *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not working > > I am getting the same QUERY_FW failed on RHEL5.2 with xenxen > paravirtualized guest with pciback module. > > No one seems to have tried answering this question on the list, let me ping > xen-devel and ofed people again. > > after executing in dom0 > echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind > echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot > echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind > > #dmesg > ACPI: PCI interrupt for device 0000:0e:00.0 disabled > tap tap-1-51712: 2 getting info > tap tap-2-51712: 2 getting info > pciback 0000:0e:00.0: seizing device > PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) > ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 > ACPI: PCI interrupt for device 0000:0e:00.0 disabled > > #xm create -c rhel52_64_3 > > PCI: Fatal: No PCI config space access function found > rtc: IRQ 8 is not free. > i8042.c: No controller found. > > > GUEST dmesg: > > ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) > ib_mthca: Initializing 0000:00:00.0 > PCI: Enabling device 0000:00:00.0 (0000 -> 0002) > PCI: Setting latency timer of device 0000:00:00.0 to 64 > ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. > ib_mthca: probe of 0000:00:00.0 failed with error -11 > > in dom0: > Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info > Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual > slot 0 > Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode > Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready > Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol > 1 (x86_64-abi) > Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to > a read-only configuration space field at offset 0x44, size 2. This may be > harmless, but if you have problems with your device: > Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs > Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing > list along with details of your device obtained from lspci. > Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> > 0002) > Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 > (level, low) -> IRQ 16 > Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 > disabled > > > > some more details - [root at p128 ~]# rpm -qa | grep xen > kernel-xen-2.6.18-92.1.22.el5 > xen-3.0.3-64.el5_2.9 > xen-libs-3.0.3-64.el5_2.9 > xen-libs-3.0.3-64.el5_2.9 > > [root at p128 ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.3.0 > node_guid: 0002:c902:0022:cd48 > sys_image_guid: 0002:c902:0022:cd4b > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0x20 > board_id: MT_0370130002 > phys_port_cnt: 2 > port: 1 > state: PORT_INIT (2) > max_mtu: 2048 (4) > active_mtu: 512 (2) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > port: 2 > state: PORT_DOWN (1) > max_mtu: 2048 (4) > active_mtu: 512 (2) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > > any help greatly appreciated. > > ~subbu > > On Sat, Oct 18, 2008 at 4:54 AM, David Brown wrote: > >> Okay so my question to the openfabrics guys is, why would the OFED >> drivers fail to read the firmware? >> >> Any thoughts? >> >> Thanks, >> - David Brown >> >> >> ---------- Forwarded message ---------- >> From: David Brown >> Date: Thu, Sep 11, 2008 at 2:24 PM >> Subject: pciback module not working >> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com >> >> >> This issue was brought up about a year and a half ago. So I'll bring >> it up again and see if anything happens. >> >> I've got an infiniband network and am attempting to pass the >> infiniband card through the host and give it to the guest. >> I'm working with standard CentOS 5.2 on both guest and host with their >> provided xen (3.0.3 ish). I've also attempted to install the newest >> Xen 3.3 and use their standard host kernel and that did the same >> thing. The guest dmesg output in the guest is similar on both >> permissive and normal mode. >> >> I'm getting issues with detecting the firmware on the card for some >> reason... >> >> Any help would be appreciated. >> >> Thanks, >> - David Brown >> >> === GUEST dmesg output === >> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) >> ib_mthca: Initializing 0000:00:00.0 >> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >> PCI: Setting latency timer of device 0000:00:00.0 to 64 >> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >> ib_mthca: probe of 0000:00:00.0 failed with error -11 >> ======================= >> >> === Host modprobe.conf === >> alias eth0 bnx2 >> alias eth1 bnx2 >> alias scsi_hostadapter cciss >> options pciback hide=(41:00.0) >> ===================== >> >> === Host lspci output === >> # lspci -vs 41:00.0 >> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >> HCA] (rev 20) >> Subsystem: Hewlett-Packard Company Unknown device 170a >> Flags: fast devsel, IRQ 16 >> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >> Capabilities: [40] Power Management version 2 >> Capabilities: [48] Vital Product Data >> Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 >> Enable- >> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >> Capabilities: [60] Express Endpoint IRQ 0 >> ===================== >> >> This makes sure it get loaded first off before anything else. >> === Host mkinitrd cmd === >> # mkinitrd -f --with=pciback --preload pciback >> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen >> ==================== >> >> === Host pciback dmesg === >> pciback 0000:41:00.0: Driver tried to write to a read-only >> configuration space field at offset 0x44, size 2. This may be >> harmless, but if you have problems with your device: >> 1) see permissive attribute in sysfs >> 2) report problems to the xen-devel mailing list along with details of >> your device obtained from lspci. >> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >> PCI: Setting latency timer of device 0000:41:00.0 to 64 >> ACPI: PCI interrupt for device 0000:41:00.0 disabled >> ====================== >> >> === Host pciback dmesg (after setting it permissive) === >> pciback 0000:41:00.0: enabling permissive mode configuration space >> accesses! >> pciback 0000:41:00.0: permissive mode is potentially unsafe! >> pciback: vpci: 0000:41:00.0: assign to virtual slot 0 >> device vif1.0 entered promiscuous mode >> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) >> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >> PCI: Setting latency timer of device 0000:41:00.0 to 64 >> ACPI: PCI interrupt for device 0000:41:00.0 disabled >> ========================================= >> >> === Guest lspci output === >> # lspci -v >> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >> HCA] (rev 20) >> Subsystem: Hewlett-Packard Company Unknown device 170a >> Flags: fast devsel, IRQ 16 >> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >> Capabilities: [40] Power Management version 2 >> Capabilities: [48] Vital Product Data >> Capabilities: [90] Message Signalled Interrupts: 64bit+ >> Queue=0/5 Enable- >> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >> Capabilities: [60] Express Endpoint IRQ 0 >> ===================== >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what do > they need you for?" > > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunhong.jiang at intel.com Wed Feb 11 22:20:55 2009 From: yunhong.jiang at intel.com (Jiang, Yunhong) Date: Thu, 12 Feb 2009 14:20:55 +0800 Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: So any changes in dom0's dmesg? ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 13:52 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working no luck ! dmesg in XEN PV guest shows : ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 even after executingh the following in dom0: #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive I am getting the follwing messages on the console as part of the initial bootup messages of the guest: Started domain rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. after executing the following in dom0 : #xm create -c rhel52_64_3 so, problem persisits, ~subbu 2009/2/12 Jiang, Yunhong > Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it. I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think). BTW, where you got following log? That seems suggest config space function not found. PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found." -- Yunhong Jiang ________________________________ From: xen-devel-bounces at lists.xensource.com [mailto:xen-devel-bounces at lists.xensource.com] On Behalf Of subbu kl Sent: 2009年2月11日 22:18 To: David Brown Cc: xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module. No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again. after executing in dom0 echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind #dmesg ACPI: PCI interrupt for device 0000:0e:00.0 disabled tap tap-1-51712: 2 getting info tap tap-2-51712: 2 getting info pciback 0000:0e:00.0: seizing device PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI interrupt for device 0000:0e:00.0 disabled #xm create -c rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. GUEST dmesg: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 in dom0: Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled some more details - [root at p128 ~]# rpm -qa | grep xen kernel-xen-2.6.18-92.1.22.el5 xen-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 [root at p128 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.3.0 node_guid: 0002:c902:0022:cd48 sys_image_guid: 0002:c902:0022:cd4b vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0x20 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 any help greatly appreciated. ~subbu On Sat, Oct 18, 2008 at 4:54 AM, David Brown > wrote: Okay so my question to the openfabrics guys is, why would the OFED drivers fail to read the firmware? Any thoughts? Thanks, - David Brown ---------- Forwarded message ---------- From: David Brown > Date: Thu, Sep 11, 2008 at 2:24 PM Subject: pciback module not working To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com This issue was brought up about a year and a half ago. So I'll bring it up again and see if anything happens. I've got an infiniband network and am attempting to pass the infiniband card through the host and give it to the guest. I'm working with standard CentOS 5.2 on both guest and host with their provided xen (3.0.3 ish). I've also attempted to install the newest Xen 3.3 and use their standard host kernel and that did the same thing. The guest dmesg output in the guest is similar on both permissive and normal mode. I'm getting issues with detecting the firmware on the card for some reason... Any help would be appreciated. Thanks, - David Brown === GUEST dmesg output === ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 ======================= === Host modprobe.conf === alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter cciss options pciback hide=(41:00.0) ===================== === Host lspci output === # lspci -vs 41:00.0 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== This makes sure it get loaded first off before anything else. === Host mkinitrd cmd === # mkinitrd -f --with=pciback --preload pciback /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen ==================== === Host pciback dmesg === pciback 0000:41:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ====================== === Host pciback dmesg (after setting it permissive) === pciback 0000:41:00.0: enabling permissive mode configuration space accesses! pciback 0000:41:00.0: permissive mode is potentially unsafe! pciback: vpci: 0000:41:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ========================================= === Guest lspci output === # lspci -v 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sumeet.Lahorani at oracle.com Wed Feb 11 22:31:42 2009 From: Sumeet.Lahorani at oracle.com (Sumeet Lahorani) Date: Wed, 11 Feb 2009 22:31:42 -0800 Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops In-Reply-To: <4992EABA.9090605@Voltaire.com> References: <4990CD57.3080108@oracle.com> <4992EABA.9090605@Voltaire.com> Message-ID: <4993C24E.504@oracle.com> Olga, Or, Thanks for the pointers. Does this packet drop always occur at the host or could it also occur in the switches (Voltaire ISR 9024)? Also, besides the "packet len too long ..." message, is the "dropped" statistic in ifconfig ib0 a good way to find out if such packet drops are happening? - Sumeet Or Gerlitz wrote: > Sumeet Lahorani wrote: > >> When we enable IB connected mode and increase MTU to 65520, we see the following >> kernel: ib0: enabling connected mode will cause multicast packet drops >> kernel: ib0: mtu > 2044 will cause multicast packet drops. >> > > >> Should we not be doing this? What kind of multicast packets will be dropped? >> If we are not using multicast, do any drivers (bonding, ipoib etc) internally use >> multicast in a way that will cause them to not work correctly in connected mode? >> > > Connected mode is supported only for unicast traffic where multicast traffic keeps going over the IB UD QP whose MTU is much lower (e.g 2-4K). To close the gap between the MTU seen by the network stack to the MTU used by the UD QP, IPoIB emulates receiving an icmp packet that tells the os stack to use a different path mtu for this multicast neighbour, see > > ipoib_start_xmit --> > ipoib_send --> > ipoib_cm_skb_too_long(mcast_mtu) --> > skb->dst->ops->update_pmtu(skb->dst, mtu) > > When IP multicast is not used, multicast is used by the network stack and bonding just for the sake of sending ARPs on the broadcast group, and IGMP where the size of both is way below the IB mtu. > > Or. > From subbukl at gmail.com Wed Feb 11 22:42:47 2009 From: subbukl at gmail.com (subbu kl) Date: Thu, 12 Feb 2009 12:12:47 +0530 Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: oops missed it, well now I dont see that enable permissive...message. here goes the messages what I got in dom0 while booting domU tap tap-1-51712: 2 getting info pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:0e:00.0 to 64 ACPI: PCI interrupt for device 0000:0e:00.0 disabled ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready xenbr0: topology change detected, propagating xenbr0: port 3(vif1.0) entering forwarding state any suspicious message ? any Idea why I get that : PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. message in domU bootup message ? ~subbu On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong wrote: > So any changes in dom0's dmesg? > > > ------------------------------ > *From:* subbu kl [mailto:subbukl at gmail.com] > *Sent:* 2009年2月12日 13:52 > *To:* Jiang, Yunhong > *Cc:* David Brown; xen-devel at lists.xensource.com; > general at lists.openfabrics.org > *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not > working > > no luck ! > dmesg in XEN PV guest shows : > > ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) > ib_mthca: Initializing 0000:00:00.0 > PCI: Enabling device 0000:00:00.0 (0000 -> 0002) > PCI: Setting latency timer of device 0000:00:00.0 to 64 > ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. > ib_mthca: probe of 0000:00:00.0 failed with error -11 > > even after executingh the following in dom0: > > #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive > > I am getting the follwing messages on the console as part of the initial > bootup messages of the guest: > > Started domain rhel52_64_3 > PCI: Fatal: No PCI config space access function found > rtc: IRQ 8 is not free. > i8042.c: No controller found. > > after executing the following in dom0 : > #xm create -c rhel52_64_3 > > > so, problem persisits, > > ~subbu > > > 2009/2/12 Jiang, Yunhong > >> Seems it is because PCI frontend try to write some configuration space >> that PCIback has no config_field entry to support it. >> I think you can firstly try to do as dom0's dmesg suggested: "see >> permissive attribute in sysfs" (it should be "set permissive attribute...", >> I think). >> >> BTW, where you got following log? That seems suggest config space function >> not found. >> >> PCI: Fatal: No PCI config space access function found >> rtc: IRQ 8 is not free. >> i8042.c: No controller found." >> >> -- Yunhong Jiang >> >> ------------------------------ >> *From:* xen-devel-bounces at lists.xensource.com [mailto: >> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl >> *Sent:* 2009年2月11日 22:18 >> *To:* David Brown >> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org >> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not working >> >> I am getting the same QUERY_FW failed on RHEL5.2 with xenxen >> paravirtualized guest with pciback module. >> >> No one seems to have tried answering this question on the list, let me >> ping xen-devel and ofed people again. >> >> after executing in dom0 >> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind >> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot >> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind >> >> #dmesg >> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >> tap tap-1-51712: 2 getting info >> tap tap-2-51712: 2 getting info >> pciback 0000:0e:00.0: seizing device >> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) >> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >> >> #xm create -c rhel52_64_3 >> >> PCI: Fatal: No PCI config space access function found >> rtc: IRQ 8 is not free. >> i8042.c: No controller found. >> >> >> GUEST dmesg: >> >> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >> ib_mthca: Initializing 0000:00:00.0 >> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >> PCI: Setting latency timer of device 0000:00:00.0 to 64 >> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >> ib_mthca: probe of 0000:00:00.0 failed with error -11 >> >> in dom0: >> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info >> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to >> virtual slot 0 >> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode >> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not >> ready >> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol >> 1 (x86_64-abi) >> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write >> to a read-only configuration space field at offset 0x44, size 2. This may be >> harmless, but if you have problems with your device: >> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs >> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing >> list along with details of your device obtained from lspci. >> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> >> 0002) >> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 >> (level, low) -> IRQ 16 >> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 >> disabled >> >> >> >> some more details - [root at p128 ~]# rpm -qa | grep xen >> kernel-xen-2.6.18-92.1.22.el5 >> xen-3.0.3-64.el5_2.9 >> xen-libs-3.0.3-64.el5_2.9 >> xen-libs-3.0.3-64.el5_2.9 >> >> [root at p128 ~]# ibv_devinfo >> hca_id: mthca0 >> fw_ver: 5.3.0 >> node_guid: 0002:c902:0022:cd48 >> sys_image_guid: 0002:c902:0022:cd4b >> vendor_id: 0x02c9 >> vendor_part_id: 25218 >> hw_ver: 0x20 >> board_id: MT_0370130002 >> phys_port_cnt: 2 >> port: 1 >> state: PORT_INIT (2) >> max_mtu: 2048 (4) >> active_mtu: 512 (2) >> sm_lid: 0 >> port_lid: 0 >> port_lmc: 0x00 >> >> port: 2 >> state: PORT_DOWN (1) >> max_mtu: 2048 (4) >> active_mtu: 512 (2) >> sm_lid: 0 >> port_lid: 0 >> port_lmc: 0x00 >> >> >> any help greatly appreciated. >> >> ~subbu >> >> On Sat, Oct 18, 2008 at 4:54 AM, David Brown wrote: >> >>> Okay so my question to the openfabrics guys is, why would the OFED >>> drivers fail to read the firmware? >>> >>> Any thoughts? >>> >>> Thanks, >>> - David Brown >>> >>> >>> ---------- Forwarded message ---------- >>> From: David Brown >>> Date: Thu, Sep 11, 2008 at 2:24 PM >>> Subject: pciback module not working >>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com >>> >>> >>> This issue was brought up about a year and a half ago. So I'll bring >>> it up again and see if anything happens. >>> >>> I've got an infiniband network and am attempting to pass the >>> infiniband card through the host and give it to the guest. >>> I'm working with standard CentOS 5.2 on both guest and host with their >>> provided xen (3.0.3 ish). I've also attempted to install the newest >>> Xen 3.3 and use their standard host kernel and that did the same >>> thing. The guest dmesg output in the guest is similar on both >>> permissive and normal mode. >>> >>> I'm getting issues with detecting the firmware on the card for some >>> reason... >>> >>> Any help would be appreciated. >>> >>> Thanks, >>> - David Brown >>> >>> === GUEST dmesg output === >>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) >>> ib_mthca: Initializing 0000:00:00.0 >>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>> ======================= >>> >>> === Host modprobe.conf === >>> alias eth0 bnx2 >>> alias eth1 bnx2 >>> alias scsi_hostadapter cciss >>> options pciback hide=(41:00.0) >>> ===================== >>> >>> === Host lspci output === >>> # lspci -vs 41:00.0 >>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>> HCA] (rev 20) >>> Subsystem: Hewlett-Packard Company Unknown device 170a >>> Flags: fast devsel, IRQ 16 >>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>> Capabilities: [40] Power Management version 2 >>> Capabilities: [48] Vital Product Data >>> Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 >>> Enable- >>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>> Capabilities: [60] Express Endpoint IRQ 0 >>> ===================== >>> >>> This makes sure it get loaded first off before anything else. >>> === Host mkinitrd cmd === >>> # mkinitrd -f --with=pciback --preload pciback >>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen >>> ==================== >>> >>> === Host pciback dmesg === >>> pciback 0000:41:00.0: Driver tried to write to a read-only >>> configuration space field at offset 0x44, size 2. This may be >>> harmless, but if you have problems with your device: >>> 1) see permissive attribute in sysfs >>> 2) report problems to the xen-devel mailing list along with details of >>> your device obtained from lspci. >>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>> ====================== >>> >>> === Host pciback dmesg (after setting it permissive) === >>> pciback 0000:41:00.0: enabling permissive mode configuration space >>> accesses! >>> pciback 0000:41:00.0: permissive mode is potentially unsafe! >>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0 >>> device vif1.0 entered promiscuous mode >>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) >>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>> ========================================= >>> >>> === Guest lspci output === >>> # lspci -v >>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>> HCA] (rev 20) >>> Subsystem: Hewlett-Packard Company Unknown device 170a >>> Flags: fast devsel, IRQ 16 >>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>> Capabilities: [40] Power Management version 2 >>> Capabilities: [48] Vital Product Data >>> Capabilities: [90] Message Signalled Interrupts: 64bit+ >>> Queue=0/5 Enable- >>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>> Capabilities: [60] Express Endpoint IRQ 0 >>> ===================== >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> >> >> >> -- >> . . . s u b b u >> "You've got to be original, because if you're like someone else, what do >> they need you for?" >> >> > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what do > they need you for?" > > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicolas.morey-chaisemartin at ext.bull.net Wed Feb 11 22:46:27 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 12 Feb 2009 07:46:27 +0100 Subject: [ofa-general] [PATCH v3] opensm/osm_console.c : Added dump_portguid function to console to generate a list of port guids matching one or more regexps Message-ID: <4993C5C3.6020700@ext.bull.net> This add a dump_portguid functionnality to openSM console which makes it really easy to generate cn_guid_file, root_guid_file and such by dumping into a file all port guids whom nodedesc contains at least one of the provided regexps Signed-off-by: Nicolas Morey-Chaisemartin --- Diff from v2: - Changed lock to read-only instead of exclusive - Removed useless cl_spinlock_release (remains from 1st patch) opensm/opensm/osm_console.c | 104 +++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 104 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index c6e8e59..5bc1079 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -42,6 +42,7 @@ #include #include #include +#include #ifdef ENABLE_OSM_CONSOLE_SOCKET #include #endif @@ -1173,6 +1174,108 @@ static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } /* more parse routines go here */ +typedef struct _regexp_list { + regex_t exp; + struct _regexp_list *next; +} regexp_list_t; + +static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +{ + cl_qmap_t *p_port_guid_tbl; + osm_port_t *p_port; + osm_port_t *p_next_port; + + regexp_list_t *p_head_regexp = NULL; + regexp_list_t *p_regexp; + + /* Option variables */ + char *p_cmd = NULL; + FILE *output = out; + + /* Read commande line */ + + while (1) { + p_cmd = next_token(p_last); + if (p_cmd) { + if (strcmp(p_cmd, "file") == 0) { + p_cmd = next_token(p_last); + if (p_cmd) { + output = fopen(p_cmd, "w+"); + if (output == NULL) { + fprintf(out, + "Could not open file %s: %s\n", + p_cmd, strerror(errno)); + output = out; + } + } else + fprintf(out, "No file name passed\n"); + } else { + p_regexp = malloc(sizeof(*p_regexp)); + if (regcomp + (&(p_regexp->exp), p_cmd, + REG_NOSUB | REG_EXTENDED) != 0) { + fprintf(out, + "Couldn't parse regular expression %s. Skipping it.\n", + p_cmd); + } + p_regexp->next = p_head_regexp; + p_head_regexp = p_regexp; + } + } else + break; /* No more tokens */ + + } + + /* Check we have at least one expression to match */ + if (p_head_regexp == NULL) { + fprintf(out, "No valid expression provided. Aborting\n"); + return; + } + + if (p_osm->sm.p_subn->need_update != 0) { + fprintf(out, "Subnet is not ready yet. Try again later.\n"); + return; + } + + /* Subnet doesn't need to be updated so we can carry on */ + + CL_PLOCK_ACQUIRE(p_osm->sm.p_lock); + p_port_guid_tbl = &(p_osm->sm.p_subn->port_guid_tbl); + + p_next_port = (osm_port_t *) cl_qmap_head(p_port_guid_tbl); + while (p_next_port != (osm_port_t *) cl_qmap_end(p_port_guid_tbl)) { + + p_port = p_next_port; + p_next_port = + (osm_port_t *) cl_qmap_next(&p_next_port->map_item); + + for (p_regexp = p_head_regexp; p_regexp != NULL; + p_regexp = p_regexp->next) + if (regexec + (&(p_regexp->exp), p_port->p_node->print_desc, 0, + NULL, 0) == 0) + fprintf(output, "0x%" PRIxLEAST64 "\n", + cl_ntoh64(p_port->p_physp->port_guid)); + } + + CL_PLOCK_RELEASE(p_osm->sm.p_lock); + if (output != out) + fclose(output); + +} + +static void help_dump_portguid(FILE * out, int detail) +{ + fprintf(out, + "dump_portguid [file filename] regexp1 [regexp2 [regexp3 ...]] -- Dump port GUID matching a regexp \n"); + if (detail) { + fprintf(out, + "getguidgetguid -- Dump all the port GUID whom node_desc matches one of the provided regexp\n"); + fprintf(out, + " [file filename] -- Send the port GUID list to the specified file instead of regular output\n"); + } + +} static const struct command console_cmds[] = { {"help", &help_command, &help_parse}, @@ -1192,6 +1295,7 @@ static const struct command console_cmds[] = { #ifdef ENABLE_OSM_PERF_MGR {"perfmgr", &help_perfmgr, &perfmgr_parse}, #endif /* ENABLE_OSM_PERF_MGR */ + {"dump_portguid", &help_dump_portguid, &dump_portguid_parse}, {NULL, NULL, NULL} /* end of array */ }; -- 1.6.1 From yunhong.jiang at intel.com Wed Feb 11 22:56:11 2009 From: yunhong.jiang at intel.com (Jiang, Yunhong) Date: Thu, 12 Feb 2009 14:56:11 +0800 Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: Sorry that seems the original mail has tried the permissive already :$ How will So how will the card do the QEUREY_FW command?Through config space or through MMIO? Following information is something strange, why all the MMIO range is disabled? Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] As for the following information, I think it should be harmless since domU has no method of config spacess access method. PCI: Fatal: No PCI config space access function found Thanks Yunhong Jiang ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 14:43 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working oops missed it, well now I dont see that enable permissive...message. here goes the messages what I got in dom0 while booting domU tap tap-1-51712: 2 getting info pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:0e:00.0 to 64 ACPI: PCI interrupt for device 0000:0e:00.0 disabled ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready xenbr0: topology change detected, propagating xenbr0: port 3(vif1.0) entering forwarding state any suspicious message ? any Idea why I get that : PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. message in domU bootup message ? ~subbu On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong > wrote: So any changes in dom0's dmesg? ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 13:52 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working no luck ! dmesg in XEN PV guest shows : ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 even after executingh the following in dom0: #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive I am getting the follwing messages on the console as part of the initial bootup messages of the guest: Started domain rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. after executing the following in dom0 : #xm create -c rhel52_64_3 so, problem persisits, ~subbu 2009/2/12 Jiang, Yunhong > Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it. I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think). BTW, where you got following log? That seems suggest config space function not found. PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found." -- Yunhong Jiang ________________________________ From: xen-devel-bounces at lists.xensource.com [mailto:xen-devel-bounces at lists.xensource.com] On Behalf Of subbu kl Sent: 2009年2月11日 22:18 To: David Brown Cc: xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module. No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again. after executing in dom0 echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind #dmesg ACPI: PCI interrupt for device 0000:0e:00.0 disabled tap tap-1-51712: 2 getting info tap tap-2-51712: 2 getting info pciback 0000:0e:00.0: seizing device PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI interrupt for device 0000:0e:00.0 disabled #xm create -c rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. GUEST dmesg: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 in dom0: Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled some more details - [root at p128 ~]# rpm -qa | grep xen kernel-xen-2.6.18-92.1.22.el5 xen-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 [root at p128 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.3.0 node_guid: 0002:c902:0022:cd48 sys_image_guid: 0002:c902:0022:cd4b vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0x20 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 any help greatly appreciated. ~subbu On Sat, Oct 18, 2008 at 4:54 AM, David Brown > wrote: Okay so my question to the openfabrics guys is, why would the OFED drivers fail to read the firmware? Any thoughts? Thanks, - David Brown ---------- Forwarded message ---------- From: David Brown > Date: Thu, Sep 11, 2008 at 2:24 PM Subject: pciback module not working To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com This issue was brought up about a year and a half ago. So I'll bring it up again and see if anything happens. I've got an infiniband network and am attempting to pass the infiniband card through the host and give it to the guest. I'm working with standard CentOS 5.2 on both guest and host with their provided xen (3.0.3 ish). I've also attempted to install the newest Xen 3.3 and use their standard host kernel and that did the same thing. The guest dmesg output in the guest is similar on both permissive and normal mode. I'm getting issues with detecting the firmware on the card for some reason... Any help would be appreciated. Thanks, - David Brown === GUEST dmesg output === ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 ======================= === Host modprobe.conf === alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter cciss options pciback hide=(41:00.0) ===================== === Host lspci output === # lspci -vs 41:00.0 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== This makes sure it get loaded first off before anything else. === Host mkinitrd cmd === # mkinitrd -f --with=pciback --preload pciback /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen ==================== === Host pciback dmesg === pciback 0000:41:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ====================== === Host pciback dmesg (after setting it permissive) === pciback 0000:41:00.0: enabling permissive mode configuration space accesses! pciback 0000:41:00.0: permissive mode is potentially unsafe! pciback: vpci: 0000:41:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ========================================= === Guest lspci output === # lspci -v 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From subbukl at gmail.com Wed Feb 11 22:58:59 2009 From: subbukl at gmail.com (subbu kl) Date: Thu, 12 Feb 2009 12:28:59 +0530 Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: So getting PCI config space access in domU will solve the problem ? if so how can I achieve that ? ~subbu On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong wrote: > Sorry that seems the original mail has tried the permissive already :$ > How will So how will the card do the QEUREY_FW command?Through config space > or through MMIO? Following information is something strange, why all the > MMIO range is disabled? > > Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] > Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] > > As for the following information, I think it should be harmless since domU > has no method of config spacess access method. > PCI: Fatal: No PCI config space access function found > > Thanks > Yunhong Jiang > > ------------------------------ > *From:* subbu kl [mailto:subbukl at gmail.com] > *Sent:* 2009年2月12日 14:43 > > *To:* Jiang, Yunhong > *Cc:* David Brown; xen-devel at lists.xensource.com; > general at lists.openfabrics.org > *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not > working > > oops missed it, > > well now I dont see that enable permissive...message. here goes the > messages what I got in dom0 while booting domU > > tap tap-1-51712: 2 getting info > pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 > device vif1.0 entered promiscuous mode > ADDRCONF(NETDEV_UP): vif1.0: link is not ready > blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) > PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) > ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 > PCI: Setting latency timer of device 0000:0e:00.0 to 64 > ACPI: PCI interrupt for device 0000:0e:00.0 disabled > ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready > xenbr0: topology change detected, propagating > xenbr0: port 3(vif1.0) entering forwarding state > > any suspicious message ? > any Idea why I get that : > PCI: Fatal: No PCI config space access function found > rtc: IRQ 8 is not free. > > message in domU bootup message ? > > ~subbu > > On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong wrote: > >> So any changes in dom0's dmesg? >> >> >> ------------------------------ >> *From:* subbu kl [mailto:subbukl at gmail.com] >> *Sent:* 2009年2月12日 13:52 >> *To:* Jiang, Yunhong >> *Cc:* David Brown; xen-devel at lists.xensource.com; >> general at lists.openfabrics.org >> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >> working >> >> no luck ! >> dmesg in XEN PV guest shows : >> >> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >> ib_mthca: Initializing 0000:00:00.0 >> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >> PCI: Setting latency timer of device 0000:00:00.0 to 64 >> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >> ib_mthca: probe of 0000:00:00.0 failed with error -11 >> >> even after executingh the following in dom0: >> >> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive >> >> I am getting the follwing messages on the console as part of the initial >> bootup messages of the guest: >> >> Started domain rhel52_64_3 >> PCI: Fatal: No PCI config space access function found >> rtc: IRQ 8 is not free. >> i8042.c: No controller found. >> >> after executing the following in dom0 : >> #xm create -c rhel52_64_3 >> >> >> so, problem persisits, >> >> ~subbu >> >> >> 2009/2/12 Jiang, Yunhong >> >>> Seems it is because PCI frontend try to write some configuration space >>> that PCIback has no config_field entry to support it. >>> I think you can firstly try to do as dom0's dmesg suggested: "see >>> permissive attribute in sysfs" (it should be "set permissive attribute...", >>> I think). >>> >>> BTW, where you got following log? That seems suggest config space >>> function not found. >>> >>> PCI: Fatal: No PCI config space access function found >>> rtc: IRQ 8 is not free. >>> i8042.c: No controller found." >>> >>> -- Yunhong Jiang >>> >>> ------------------------------ >>> *From:* xen-devel-bounces at lists.xensource.com [mailto: >>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl >>> *Sent:* 2009年2月11日 22:18 >>> *To:* David Brown >>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org >>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not working >>> >>> I am getting the same QUERY_FW failed on RHEL5.2 with xenxen >>> paravirtualized guest with pciback module. >>> >>> No one seems to have tried answering this question on the list, let me >>> ping xen-devel and ofed people again. >>> >>> after executing in dom0 >>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind >>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot >>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind >>> >>> #dmesg >>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>> tap tap-1-51712: 2 getting info >>> tap tap-2-51712: 2 getting info >>> pciback 0000:0e:00.0: seizing device >>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) >>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>> >>> #xm create -c rhel52_64_3 >>> >>> PCI: Fatal: No PCI config space access function found >>> rtc: IRQ 8 is not free. >>> i8042.c: No controller found. >>> >>> >>> GUEST dmesg: >>> >>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >>> ib_mthca: Initializing 0000:00:00.0 >>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>> >>> in dom0: >>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info >>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to >>> virtual slot 0 >>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode >>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not >>> ready >>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, >>> protocol 1 (x86_64-abi) >>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write >>> to a read-only configuration space field at offset 0x44, size 2. This may be >>> harmless, but if you have problems with your device: >>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs >>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing >>> list along with details of your device obtained from lspci. >>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> >>> 0002) >>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI >>> 16 (level, low) -> IRQ 16 >>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 >>> disabled >>> >>> >>> >>> some more details - [root at p128 ~]# rpm -qa | grep xen >>> kernel-xen-2.6.18-92.1.22.el5 >>> xen-3.0.3-64.el5_2.9 >>> xen-libs-3.0.3-64.el5_2.9 >>> xen-libs-3.0.3-64.el5_2.9 >>> >>> [root at p128 ~]# ibv_devinfo >>> hca_id: mthca0 >>> fw_ver: 5.3.0 >>> node_guid: 0002:c902:0022:cd48 >>> sys_image_guid: 0002:c902:0022:cd4b >>> vendor_id: 0x02c9 >>> vendor_part_id: 25218 >>> hw_ver: 0x20 >>> board_id: MT_0370130002 >>> phys_port_cnt: 2 >>> port: 1 >>> state: PORT_INIT (2) >>> max_mtu: 2048 (4) >>> active_mtu: 512 (2) >>> sm_lid: 0 >>> port_lid: 0 >>> port_lmc: 0x00 >>> >>> port: 2 >>> state: PORT_DOWN (1) >>> max_mtu: 2048 (4) >>> active_mtu: 512 (2) >>> sm_lid: 0 >>> port_lid: 0 >>> port_lmc: 0x00 >>> >>> >>> any help greatly appreciated. >>> >>> ~subbu >>> >>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown wrote: >>> >>>> Okay so my question to the openfabrics guys is, why would the OFED >>>> drivers fail to read the firmware? >>>> >>>> Any thoughts? >>>> >>>> Thanks, >>>> - David Brown >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: David Brown >>>> Date: Thu, Sep 11, 2008 at 2:24 PM >>>> Subject: pciback module not working >>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com >>>> >>>> >>>> This issue was brought up about a year and a half ago. So I'll bring >>>> it up again and see if anything happens. >>>> >>>> I've got an infiniband network and am attempting to pass the >>>> infiniband card through the host and give it to the guest. >>>> I'm working with standard CentOS 5.2 on both guest and host with their >>>> provided xen (3.0.3 ish). I've also attempted to install the newest >>>> Xen 3.3 and use their standard host kernel and that did the same >>>> thing. The guest dmesg output in the guest is similar on both >>>> permissive and normal mode. >>>> >>>> I'm getting issues with detecting the firmware on the card for some >>>> reason... >>>> >>>> Any help would be appreciated. >>>> >>>> Thanks, >>>> - David Brown >>>> >>>> === GUEST dmesg output === >>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) >>>> ib_mthca: Initializing 0000:00:00.0 >>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>> ======================= >>>> >>>> === Host modprobe.conf === >>>> alias eth0 bnx2 >>>> alias eth1 bnx2 >>>> alias scsi_hostadapter cciss >>>> options pciback hide=(41:00.0) >>>> ===================== >>>> >>>> === Host lspci output === >>>> # lspci -vs 41:00.0 >>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>> HCA] (rev 20) >>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>> Flags: fast devsel, IRQ 16 >>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>> Capabilities: [40] Power Management version 2 >>>> Capabilities: [48] Vital Product Data >>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 >>>> Enable- >>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>> Capabilities: [60] Express Endpoint IRQ 0 >>>> ===================== >>>> >>>> This makes sure it get loaded first off before anything else. >>>> === Host mkinitrd cmd === >>>> # mkinitrd -f --with=pciback --preload pciback >>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen >>>> ==================== >>>> >>>> === Host pciback dmesg === >>>> pciback 0000:41:00.0: Driver tried to write to a read-only >>>> configuration space field at offset 0x44, size 2. This may be >>>> harmless, but if you have problems with your device: >>>> 1) see permissive attribute in sysfs >>>> 2) report problems to the xen-devel mailing list along with details of >>>> your device obtained from lspci. >>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>> ====================== >>>> >>>> === Host pciback dmesg (after setting it permissive) === >>>> pciback 0000:41:00.0: enabling permissive mode configuration space >>>> accesses! >>>> pciback 0000:41:00.0: permissive mode is potentially unsafe! >>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0 >>>> device vif1.0 entered promiscuous mode >>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) >>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>> ========================================= >>>> >>>> === Guest lspci output === >>>> # lspci -v >>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>> HCA] (rev 20) >>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>> Flags: fast devsel, IRQ 16 >>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>> Capabilities: [40] Power Management version 2 >>>> Capabilities: [48] Vital Product Data >>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ >>>> Queue=0/5 Enable- >>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>> Capabilities: [60] Express Endpoint IRQ 0 >>>> ===================== >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >>> >>> >>> -- >>> . . . s u b b u >>> "You've got to be original, because if you're like someone else, what do >>> they need you for?" >>> >>> >> >> >> -- >> . . . s u b b u >> "You've got to be original, because if you're like someone else, what do >> they need you for?" >> >> > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what do > they need you for?" > > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunhong.jiang at intel.com Wed Feb 11 23:00:31 2009 From: yunhong.jiang at intel.com (Jiang, Yunhong) Date: Thu, 12 Feb 2009 15:00:31 +0800 Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: DomU access config space through pcibackend, so that message is ok. ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 14:59 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working So getting PCI config space access in domU will solve the problem ? if so how can I achieve that ? ~subbu On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong > wrote: Sorry that seems the original mail has tried the permissive already :$ How will So how will the card do the QEUREY_FW command?Through config space or through MMIO? Following information is something strange, why all the MMIO range is disabled? Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] As for the following information, I think it should be harmless since domU has no method of config spacess access method. PCI: Fatal: No PCI config space access function found Thanks Yunhong Jiang ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 14:43 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working oops missed it, well now I dont see that enable permissive...message. here goes the messages what I got in dom0 while booting domU tap tap-1-51712: 2 getting info pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:0e:00.0 to 64 ACPI: PCI interrupt for device 0000:0e:00.0 disabled ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready xenbr0: topology change detected, propagating xenbr0: port 3(vif1.0) entering forwarding state any suspicious message ? any Idea why I get that : PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. message in domU bootup message ? ~subbu On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong > wrote: So any changes in dom0's dmesg? ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 13:52 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working no luck ! dmesg in XEN PV guest shows : ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 even after executingh the following in dom0: #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive I am getting the follwing messages on the console as part of the initial bootup messages of the guest: Started domain rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. after executing the following in dom0 : #xm create -c rhel52_64_3 so, problem persisits, ~subbu 2009/2/12 Jiang, Yunhong > Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it. I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think). BTW, where you got following log? That seems suggest config space function not found. PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found." -- Yunhong Jiang ________________________________ From: xen-devel-bounces at lists.xensource.com [mailto:xen-devel-bounces at lists.xensource.com] On Behalf Of subbu kl Sent: 2009年2月11日 22:18 To: David Brown Cc: xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module. No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again. after executing in dom0 echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind #dmesg ACPI: PCI interrupt for device 0000:0e:00.0 disabled tap tap-1-51712: 2 getting info tap tap-2-51712: 2 getting info pciback 0000:0e:00.0: seizing device PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI interrupt for device 0000:0e:00.0 disabled #xm create -c rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. GUEST dmesg: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 in dom0: Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled some more details - [root at p128 ~]# rpm -qa | grep xen kernel-xen-2.6.18-92.1.22.el5 xen-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 [root at p128 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.3.0 node_guid: 0002:c902:0022:cd48 sys_image_guid: 0002:c902:0022:cd4b vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0x20 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 any help greatly appreciated. ~subbu On Sat, Oct 18, 2008 at 4:54 AM, David Brown > wrote: Okay so my question to the openfabrics guys is, why would the OFED drivers fail to read the firmware? Any thoughts? Thanks, - David Brown ---------- Forwarded message ---------- From: David Brown > Date: Thu, Sep 11, 2008 at 2:24 PM Subject: pciback module not working To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com This issue was brought up about a year and a half ago. So I'll bring it up again and see if anything happens. I've got an infiniband network and am attempting to pass the infiniband card through the host and give it to the guest. I'm working with standard CentOS 5.2 on both guest and host with their provided xen (3.0.3 ish). I've also attempted to install the newest Xen 3.3 and use their standard host kernel and that did the same thing. The guest dmesg output in the guest is similar on both permissive and normal mode. I'm getting issues with detecting the firmware on the card for some reason... Any help would be appreciated. Thanks, - David Brown === GUEST dmesg output === ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 ======================= === Host modprobe.conf === alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter cciss options pciback hide=(41:00.0) ===================== === Host lspci output === # lspci -vs 41:00.0 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== This makes sure it get loaded first off before anything else. === Host mkinitrd cmd === # mkinitrd -f --with=pciback --preload pciback /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen ==================== === Host pciback dmesg === pciback 0000:41:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ====================== === Host pciback dmesg (after setting it permissive) === pciback 0000:41:00.0: enabling permissive mode configuration space accesses! pciback 0000:41:00.0: permissive mode is potentially unsafe! pciback: vpci: 0000:41:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ========================================= === Guest lspci output === # lspci -v 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Wed Feb 11 23:16:07 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Feb 2009 09:16:07 +0200 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: <798E955ACF6F4EBDBA311DE3C54C9B9E@amr.corp.intel.com> References: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> <798E955ACF6F4EBDBA311DE3C54C9B9E@amr.corp.intel.com> Message-ID: <4993CCB7.6070203@voltaire.com> Sean Hefty wrote: > Not yet - but should be able to look into it by the end of the week. From what > Jason said, it sounds like ip_dev_find() doesn't behave like I was expecting. > OK, thanks for the update. Or. From wangwhao at cn.ibm.com Wed Feb 11 23:37:17 2009 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Thu, 12 Feb 2009 15:37:17 +0800 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 Message-ID: Hi all: I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and wrote network interface configuration file ifcfg-ib0. ifup ib0 also succeeded. But IB utilites report Connetion timed out. [root at xblade06 network-scripts]# sminfo ibwarn: [32593] _do_madrpc: recv failed: Connection timed out ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) sminfo: iberror: failed: query I had to reboot the blade and rerun "openibd start". Then sminfo reported correct contents. I do not suppose this reboot is required. Did I miss any configuration step? Moreover, "openibd start" report one warning message about hwconf. Anyone has comments about this? [root at xblade07 ~]# /etc/init.d/openibd start Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such file or directory [ OK ] Thanks a lot! Wen Hao Wang Email: wangwhao at cn.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Wed Feb 11 23:42:39 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 12 Feb 2009 09:42:39 +0200 Subject: [ofa-general] fat-tree CN nodes? In-Reply-To: <49933588.6050607@ifi.uio.no> References: <49933273.1010504@ifi.uio.no> <49933445.5030203@harr.org> <49933588.6050607@ifi.uio.no> Message-ID: <4993D2EF.8030009@dev.mellanox.co.il> Hi Frank, Frank Olaf Sem-Jacobsen wrote: > Right,so it has no connection with any topological properties of the fat > tree? You're right, the term "Compute Node" by itself has no connection to the topological properties of the fat tree. However, fat-tree routing has some constraints on the topology, and one of these constraints is that all the compute nodes are required to be located at the same topological level of the tree (same rank). > Which again means that the definition of compute nodes is only > necessary for the ability to balance these separately in the tree? Right again, this is what the fat-tree routing does. -- Yevgeny > Thanks for your answer, > > Cameron Harr wrote: >> Hi Frank, >> A compute node is a computer/server that is generally dedicated to >> doing computational work in a cluster or group of computers. >> Cameron >> >> Frank Olaf Sem-Jacobsen wrote: >>> Hi, >>> >>> I have been looking into the fat tree code, and I was wondering about >>> the definition of a compute node (CN). Are these part of the leaf >>> switches at the bottom of the fat tree, or are they extra switches >>> that are connected to the fat tree, e.g. the switch in a rack of >>> blades which is again connected to the fat tree? >>> >>> Appreciate the help, > > From subbukl at gmail.com Wed Feb 11 23:45:57 2009 From: subbukl at gmail.com (subbu kl) Date: Thu, 12 Feb 2009 13:15:57 +0530 Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: so back to square one ? Why QUERY_FW should fail in domU ? ~subbu On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong wrote: > DomU access config space through pcibackend, so that message is ok. > > ------------------------------ > *From:* subbu kl [mailto:subbukl at gmail.com] > *Sent:* 2009年2月12日 14:59 > > *To:* Jiang, Yunhong > *Cc:* David Brown; xen-devel at lists.xensource.com; > general at lists.openfabrics.org > *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not > working > > So getting PCI config space access in domU will solve the problem ? if so > how can I achieve that ? > > ~subbu > > On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong wrote: > >> Sorry that seems the original mail has tried the permissive already :$ >> How will So how will the card do the QEUREY_FW command?Through config >> space or through MMIO? Following information is something strange, why all >> the MMIO range is disabled? >> >> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >> >> As for the following information, I think it should be harmless since domU >> has no method of config spacess access method. >> PCI: Fatal: No PCI config space access function found >> >> Thanks >> Yunhong Jiang >> >> ------------------------------ >> *From:* subbu kl [mailto:subbukl at gmail.com] >> *Sent:* 2009年2月12日 14:43 >> >> *To:* Jiang, Yunhong >> *Cc:* David Brown; xen-devel at lists.xensource.com; >> general at lists.openfabrics.org >> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >> working >> >> oops missed it, >> >> well now I dont see that enable permissive...message. here goes the >> messages what I got in dom0 while booting domU >> >> tap tap-1-51712: 2 getting info >> pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 >> device vif1.0 entered promiscuous mode >> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >> blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) >> PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) >> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >> PCI: Setting latency timer of device 0000:0e:00.0 to 64 >> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >> ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready >> xenbr0: topology change detected, propagating >> xenbr0: port 3(vif1.0) entering forwarding state >> >> any suspicious message ? >> any Idea why I get that : >> PCI: Fatal: No PCI config space access function found >> rtc: IRQ 8 is not free. >> >> message in domU bootup message ? >> >> ~subbu >> >> On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong > > wrote: >> >>> So any changes in dom0's dmesg? >>> >>> >>> ------------------------------ >>> *From:* subbu kl [mailto:subbukl at gmail.com] >>> *Sent:* 2009年2月12日 13:52 >>> *To:* Jiang, Yunhong >>> *Cc:* David Brown; xen-devel at lists.xensource.com; >>> general at lists.openfabrics.org >>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>> working >>> >>> no luck ! >>> dmesg in XEN PV guest shows : >>> >>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >>> ib_mthca: Initializing 0000:00:00.0 >>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>> >>> even after executingh the following in dom0: >>> >>> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive >>> >>> I am getting the follwing messages on the console as part of the initial >>> bootup messages of the guest: >>> >>> Started domain rhel52_64_3 >>> PCI: Fatal: No PCI config space access function found >>> rtc: IRQ 8 is not free. >>> i8042.c: No controller found. >>> >>> after executing the following in dom0 : >>> #xm create -c rhel52_64_3 >>> >>> >>> so, problem persisits, >>> >>> ~subbu >>> >>> >>> 2009/2/12 Jiang, Yunhong >>> >>>> Seems it is because PCI frontend try to write some configuration space >>>> that PCIback has no config_field entry to support it. >>>> I think you can firstly try to do as dom0's dmesg suggested: "see >>>> permissive attribute in sysfs" (it should be "set permissive attribute...", >>>> I think). >>>> >>>> BTW, where you got following log? That seems suggest config space >>>> function not found. >>>> >>>> PCI: Fatal: No PCI config space access function found >>>> rtc: IRQ 8 is not free. >>>> i8042.c: No controller found." >>>> >>>> -- Yunhong Jiang >>>> >>>> ------------------------------ >>>> *From:* xen-devel-bounces at lists.xensource.com [mailto: >>>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl >>>> *Sent:* 2009年2月11日 22:18 >>>> *To:* David Brown >>>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org >>>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>>> working >>>> >>>> I am getting the same QUERY_FW failed on RHEL5.2 with xenxen >>>> paravirtualized guest with pciback module. >>>> >>>> No one seems to have tried answering this question on the list, let me >>>> ping xen-devel and ofed people again. >>>> >>>> after executing in dom0 >>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind >>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot >>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind >>>> >>>> #dmesg >>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>>> tap tap-1-51712: 2 getting info >>>> tap tap-2-51712: 2 getting info >>>> pciback 0000:0e:00.0: seizing device >>>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) >>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>>> >>>> #xm create -c rhel52_64_3 >>>> >>>> PCI: Fatal: No PCI config space access function found >>>> rtc: IRQ 8 is not free. >>>> i8042.c: No controller found. >>>> >>>> >>>> GUEST dmesg: >>>> >>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >>>> ib_mthca: Initializing 0000:00:00.0 >>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>> >>>> in dom0: >>>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info >>>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to >>>> virtual slot 0 >>>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode >>>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not >>>> ready >>>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, >>>> protocol 1 (x86_64-abi) >>>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write >>>> to a read-only configuration space field at offset 0x44, size 2. This may be >>>> harmless, but if you have problems with your device: >>>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs >>>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing >>>> list along with details of your device obtained from lspci. >>>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> >>>> 0002) >>>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI >>>> 16 (level, low) -> IRQ 16 >>>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 >>>> disabled >>>> >>>> >>>> >>>> some more details - [root at p128 ~]# rpm -qa | grep xen >>>> kernel-xen-2.6.18-92.1.22.el5 >>>> xen-3.0.3-64.el5_2.9 >>>> xen-libs-3.0.3-64.el5_2.9 >>>> xen-libs-3.0.3-64.el5_2.9 >>>> >>>> [root at p128 ~]# ibv_devinfo >>>> hca_id: mthca0 >>>> fw_ver: 5.3.0 >>>> node_guid: 0002:c902:0022:cd48 >>>> sys_image_guid: 0002:c902:0022:cd4b >>>> vendor_id: 0x02c9 >>>> vendor_part_id: 25218 >>>> hw_ver: 0x20 >>>> board_id: MT_0370130002 >>>> phys_port_cnt: 2 >>>> port: 1 >>>> state: PORT_INIT (2) >>>> max_mtu: 2048 (4) >>>> active_mtu: 512 (2) >>>> sm_lid: 0 >>>> port_lid: 0 >>>> port_lmc: 0x00 >>>> >>>> port: 2 >>>> state: PORT_DOWN (1) >>>> max_mtu: 2048 (4) >>>> active_mtu: 512 (2) >>>> sm_lid: 0 >>>> port_lid: 0 >>>> port_lmc: 0x00 >>>> >>>> >>>> any help greatly appreciated. >>>> >>>> ~subbu >>>> >>>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown wrote: >>>> >>>>> Okay so my question to the openfabrics guys is, why would the OFED >>>>> drivers fail to read the firmware? >>>>> >>>>> Any thoughts? >>>>> >>>>> Thanks, >>>>> - David Brown >>>>> >>>>> >>>>> ---------- Forwarded message ---------- >>>>> From: David Brown >>>>> Date: Thu, Sep 11, 2008 at 2:24 PM >>>>> Subject: pciback module not working >>>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com >>>>> >>>>> >>>>> This issue was brought up about a year and a half ago. So I'll bring >>>>> it up again and see if anything happens. >>>>> >>>>> I've got an infiniband network and am attempting to pass the >>>>> infiniband card through the host and give it to the guest. >>>>> I'm working with standard CentOS 5.2 on both guest and host with their >>>>> provided xen (3.0.3 ish). I've also attempted to install the newest >>>>> Xen 3.3 and use their standard host kernel and that did the same >>>>> thing. The guest dmesg output in the guest is similar on both >>>>> permissive and normal mode. >>>>> >>>>> I'm getting issues with detecting the firmware on the card for some >>>>> reason... >>>>> >>>>> Any help would be appreciated. >>>>> >>>>> Thanks, >>>>> - David Brown >>>>> >>>>> === GUEST dmesg output === >>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) >>>>> ib_mthca: Initializing 0000:00:00.0 >>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>>> ======================= >>>>> >>>>> === Host modprobe.conf === >>>>> alias eth0 bnx2 >>>>> alias eth1 bnx2 >>>>> alias scsi_hostadapter cciss >>>>> options pciback hide=(41:00.0) >>>>> ===================== >>>>> >>>>> === Host lspci output === >>>>> # lspci -vs 41:00.0 >>>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>>> HCA] (rev 20) >>>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>>> Flags: fast devsel, IRQ 16 >>>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] >>>>> [size=1M] >>>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>>> Capabilities: [40] Power Management version 2 >>>>> Capabilities: [48] Vital Product Data >>>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 >>>>> Enable- >>>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>>> Capabilities: [60] Express Endpoint IRQ 0 >>>>> ===================== >>>>> >>>>> This makes sure it get loaded first off before anything else. >>>>> === Host mkinitrd cmd === >>>>> # mkinitrd -f --with=pciback --preload pciback >>>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen >>>>> ==================== >>>>> >>>>> === Host pciback dmesg === >>>>> pciback 0000:41:00.0: Driver tried to write to a read-only >>>>> configuration space field at offset 0x44, size 2. This may be >>>>> harmless, but if you have problems with your device: >>>>> 1) see permissive attribute in sysfs >>>>> 2) report problems to the xen-devel mailing list along with details of >>>>> your device obtained from lspci. >>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>>> ====================== >>>>> >>>>> === Host pciback dmesg (after setting it permissive) === >>>>> pciback 0000:41:00.0: enabling permissive mode configuration space >>>>> accesses! >>>>> pciback 0000:41:00.0: permissive mode is potentially unsafe! >>>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0 >>>>> device vif1.0 entered promiscuous mode >>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >>>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) >>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>>> ========================================= >>>>> >>>>> === Guest lspci output === >>>>> # lspci -v >>>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>>> HCA] (rev 20) >>>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>>> Flags: fast devsel, IRQ 16 >>>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] >>>>> [size=1M] >>>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>>> Capabilities: [40] Power Management version 2 >>>>> Capabilities: [48] Vital Product Data >>>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ >>>>> Queue=0/5 Enable- >>>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>>> Capabilities: [60] Express Endpoint IRQ 0 >>>>> ===================== >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>> >>>> >>>> >>>> -- >>>> . . . s u b b u >>>> "You've got to be original, because if you're like someone else, what do >>>> they need you for?" >>>> >>>> >>> >>> >>> -- >>> . . . s u b b u >>> "You've got to be original, because if you're like someone else, what do >>> they need you for?" >>> >>> >> >> >> -- >> . . . s u b b u >> "You've got to be original, because if you're like someone else, what do >> they need you for?" >> >> > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what do > they need you for?" > > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunhong.jiang at intel.com Thu Feb 12 00:01:27 2009 From: yunhong.jiang at intel.com (Jiang, Yunhong) Date: Thu, 12 Feb 2009 16:01:27 +0800 Subject: ***SPAM*** RE: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Message-ID: Can you please share more information how will the ib_mthca do QUERY_FW? Through config space access? Through MMIO access? I think more information will be helpful. The only thing seems strange to me is, from "Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]" , seems the MMIO is disabled? Thanks Yunhong Jiang ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 15:46 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working so back to square one ? Why QUERY_FW should fail in domU ? ~subbu On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong > wrote: DomU access config space through pcibackend, so that message is ok. ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 14:59 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working So getting PCI config space access in domU will solve the problem ? if so how can I achieve that ? ~subbu On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong > wrote: Sorry that seems the original mail has tried the permissive already :$ How will So how will the card do the QEUREY_FW command?Through config space or through MMIO? Following information is something strange, why all the MMIO range is disabled? Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] As for the following information, I think it should be harmless since domU has no method of config spacess access method. PCI: Fatal: No PCI config space access function found Thanks Yunhong Jiang ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 14:43 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working oops missed it, well now I dont see that enable permissive...message. here goes the messages what I got in dom0 while booting domU tap tap-1-51712: 2 getting info pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:0e:00.0 to 64 ACPI: PCI interrupt for device 0000:0e:00.0 disabled ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready xenbr0: topology change detected, propagating xenbr0: port 3(vif1.0) entering forwarding state any suspicious message ? any Idea why I get that : PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. message in domU bootup message ? ~subbu On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong > wrote: So any changes in dom0's dmesg? ________________________________ From: subbu kl [mailto:subbukl at gmail.com] Sent: 2009年2月12日 13:52 To: Jiang, Yunhong Cc: David Brown; xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working no luck ! dmesg in XEN PV guest shows : ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 even after executingh the following in dom0: #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive I am getting the follwing messages on the console as part of the initial bootup messages of the guest: Started domain rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. after executing the following in dom0 : #xm create -c rhel52_64_3 so, problem persisits, ~subbu 2009/2/12 Jiang, Yunhong > Seems it is because PCI frontend try to write some configuration space that PCIback has no config_field entry to support it. I think you can firstly try to do as dom0's dmesg suggested: "see permissive attribute in sysfs" (it should be "set permissive attribute...", I think). BTW, where you got following log? That seems suggest config space function not found. PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found." -- Yunhong Jiang ________________________________ From: xen-devel-bounces at lists.xensource.com [mailto:xen-devel-bounces at lists.xensource.com] On Behalf Of subbu kl Sent: 2009年2月11日 22:18 To: David Brown Cc: xen-devel at lists.xensource.com; general at lists.openfabrics.org Subject: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working I am getting the same QUERY_FW failed on RHEL5.2 with xenxen paravirtualized guest with pciback module. No one seems to have tried answering this question on the list, let me ping xen-devel and ofed people again. after executing in dom0 echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind #dmesg ACPI: PCI interrupt for device 0000:0e:00.0 disabled tap tap-1-51712: 2 getting info tap tap-2-51712: 2 getting info pciback 0000:0e:00.0: seizing device PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI interrupt for device 0000:0e:00.0 disabled #xm create -c rhel52_64_3 PCI: Fatal: No PCI config space access function found rtc: IRQ 8 is not free. i8042.c: No controller found. GUEST dmesg: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 in dom0: Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not ready Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device 0000:0e:00.0 disabled some more details - [root at p128 ~]# rpm -qa | grep xen kernel-xen-2.6.18-92.1.22.el5 xen-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 xen-libs-3.0.3-64.el5_2.9 [root at p128 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.3.0 node_guid: 0002:c902:0022:cd48 sys_image_guid: 0002:c902:0022:cd4b vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0x20 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 any help greatly appreciated. ~subbu On Sat, Oct 18, 2008 at 4:54 AM, David Brown > wrote: Okay so my question to the openfabrics guys is, why would the OFED drivers fail to read the firmware? Any thoughts? Thanks, - David Brown ---------- Forwarded message ---------- From: David Brown > Date: Thu, Sep 11, 2008 at 2:24 PM Subject: pciback module not working To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com This issue was brought up about a year and a half ago. So I'll bring it up again and see if anything happens. I've got an infiniband network and am attempting to pass the infiniband card through the host and give it to the guest. I'm working with standard CentOS 5.2 on both guest and host with their provided xen (3.0.3 ish). I've also attempted to install the newest Xen 3.3 and use their standard host kernel and that did the same thing. The guest dmesg output in the guest is similar on both permissive and normal mode. I'm getting issues with detecting the firmware on the card for some reason... Any help would be appreciated. Thanks, - David Brown === GUEST dmesg output === ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 ======================= === Host modprobe.conf === alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter cciss options pciback hide=(41:00.0) ===================== === Host lspci output === # lspci -vs 41:00.0 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== This makes sure it get loaded first off before anything else. === Host mkinitrd cmd === # mkinitrd -f --with=pciback --preload pciback /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen ==================== === Host pciback dmesg === pciback 0000:41:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ====================== === Host pciback dmesg (after setting it permissive) === pciback 0000:41:00.0: enabling permissive mode configuration space accesses! pciback 0000:41:00.0: permissive mode is potentially unsafe! pciback: vpci: 0000:41:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ========================================= === Guest lspci output === # lspci -v 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From subbukl at gmail.com Thu Feb 12 00:20:13 2009 From: subbukl at gmail.com (subbu kl) Date: Thu, 12 Feb 2009 13:50:13 +0530 Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> Message-ID: did a quick search, I believe its MMIO, as it is in file - http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_main.c mthca_QUERY_FW () is resulting into mthca_QUERY_FW() which inturn will result into mthca_cmd_post_dbell()/mthca_cmd_post_hcr() which inturn results into __raw_writel((__force u32) cpu_to_be32(in_param >> 32), ptr + offs[0]); in the file - http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_cmd.c OFED people should be more helpful here to comment if I have missed out something. Roland any clue? ~subbu On Thu, Feb 12, 2009 at 1:31 PM, Jiang, Yunhong wrote: > Can you please share more information how will the ib_mthca do QUERY_FW? > Through config space access? Through MMIO access? I think more information > will be helpful. The only thing seems strange to me is, from "Memory at > fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]" , seems the MMIO > is disabled? > > Thanks > Yunhong Jiang > > ------------------------------ > *From:* subbu kl [mailto:subbukl at gmail.com] > *Sent:* 2009年2月12日 15:46 > > *To:* Jiang, Yunhong > *Cc:* David Brown; xen-devel at lists.xensource.com; > general at lists.openfabrics.org > *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not > working > > so back to square one ? > Why QUERY_FW should fail in domU ? > > ~subbu > > On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong wrote: > >> DomU access config space through pcibackend, so that message is ok. >> >> ------------------------------ >> *From:* subbu kl [mailto:subbukl at gmail.com] >> *Sent:* 2009年2月12日 14:59 >> >> *To:* Jiang, Yunhong >> *Cc:* David Brown; xen-devel at lists.xensource.com; >> general at lists.openfabrics.org >> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >> working >> >> So getting PCI config space access in domU will solve the problem ? if >> so how can I achieve that ? >> >> ~subbu >> >> On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong > > wrote: >> >>> Sorry that seems the original mail has tried the permissive already :$ >>> How will So how will the card do the QEUREY_FW command?Through config >>> space or through MMIO? Following information is something strange, why all >>> the MMIO range is disabled? >>> >>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>> >>> As for the following information, I think it should be harmless since >>> domU has no method of config spacess access method. >>> PCI: Fatal: No PCI config space access function found >>> >>> Thanks >>> Yunhong Jiang >>> >>> ------------------------------ >>> *From:* subbu kl [mailto:subbukl at gmail.com] >>> *Sent:* 2009年2月12日 14:43 >>> >>> *To:* Jiang, Yunhong >>> *Cc:* David Brown; xen-devel at lists.xensource.com; >>> general at lists.openfabrics.org >>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>> working >>> >>> oops missed it, >>> >>> well now I dont see that enable permissive...message. here goes the >>> messages what I got in dom0 while booting domU >>> >>> tap tap-1-51712: 2 getting info >>> pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 >>> device vif1.0 entered promiscuous mode >>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >>> blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) >>> PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) >>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>> PCI: Setting latency timer of device 0000:0e:00.0 to 64 >>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>> ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready >>> xenbr0: topology change detected, propagating >>> xenbr0: port 3(vif1.0) entering forwarding state >>> >>> any suspicious message ? >>> any Idea why I get that : >>> PCI: Fatal: No PCI config space access function found >>> rtc: IRQ 8 is not free. >>> >>> message in domU bootup message ? >>> >>> ~subbu >>> >>> On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong < >>> yunhong.jiang at intel.com> wrote: >>> >>>> So any changes in dom0's dmesg? >>>> >>>> >>>> ------------------------------ >>>> *From:* subbu kl [mailto:subbukl at gmail.com] >>>> *Sent:* 2009年2月12日 13:52 >>>> *To:* Jiang, Yunhong >>>> *Cc:* David Brown; xen-devel at lists.xensource.com; >>>> general at lists.openfabrics.org >>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>>> working >>>> >>>> no luck ! >>>> dmesg in XEN PV guest shows : >>>> >>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >>>> ib_mthca: Initializing 0000:00:00.0 >>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>> >>>> even after executingh the following in dom0: >>>> >>>> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive >>>> >>>> I am getting the follwing messages on the console as part of the initial >>>> bootup messages of the guest: >>>> >>>> Started domain rhel52_64_3 >>>> PCI: Fatal: No PCI config space access function found >>>> rtc: IRQ 8 is not free. >>>> i8042.c: No controller found. >>>> >>>> after executing the following in dom0 : >>>> #xm create -c rhel52_64_3 >>>> >>>> >>>> so, problem persisits, >>>> >>>> ~subbu >>>> >>>> >>>> 2009/2/12 Jiang, Yunhong >>>> >>>>> Seems it is because PCI frontend try to write some configuration >>>>> space that PCIback has no config_field entry to support it. >>>>> I think you can firstly try to do as dom0's dmesg suggested: "see >>>>> permissive attribute in sysfs" (it should be "set permissive attribute...", >>>>> I think). >>>>> >>>>> BTW, where you got following log? That seems suggest config space >>>>> function not found. >>>>> >>>>> PCI: Fatal: No PCI config space access function found >>>>> rtc: IRQ 8 is not free. >>>>> i8042.c: No controller found." >>>>> >>>>> -- Yunhong Jiang >>>>> >>>>> ------------------------------ >>>>> *From:* xen-devel-bounces at lists.xensource.com [mailto: >>>>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl >>>>> *Sent:* 2009年2月11日 22:18 >>>>> *To:* David Brown >>>>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org >>>>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>>>> working >>>>> >>>>> I am getting the same QUERY_FW failed on RHEL5.2 with xenxen >>>>> paravirtualized guest with pciback module. >>>>> >>>>> No one seems to have tried answering this question on the list, let me >>>>> ping xen-devel and ofed people again. >>>>> >>>>> after executing in dom0 >>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind >>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot >>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind >>>>> >>>>> #dmesg >>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>>>> tap tap-1-51712: 2 getting info >>>>> tap tap-2-51712: 2 getting info >>>>> pciback 0000:0e:00.0: seizing device >>>>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) >>>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>>>> >>>>> #xm create -c rhel52_64_3 >>>>> >>>>> PCI: Fatal: No PCI config space access function found >>>>> rtc: IRQ 8 is not free. >>>>> i8042.c: No controller found. >>>>> >>>>> >>>>> GUEST dmesg: >>>>> >>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >>>>> ib_mthca: Initializing 0000:00:00.0 >>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>>> >>>>> in dom0: >>>>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info >>>>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to >>>>> virtual slot 0 >>>>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode >>>>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not >>>>> ready >>>>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, >>>>> protocol 1 (x86_64-abi) >>>>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to >>>>> write to a read-only configuration space field at offset 0x44, size 2. This >>>>> may be harmless, but if you have problems with your device: >>>>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs >>>>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel >>>>> mailing list along with details of your device obtained from lspci. >>>>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 -> >>>>> 0002) >>>>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI >>>>> 16 (level, low) -> IRQ 16 >>>>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device >>>>> 0000:0e:00.0 disabled >>>>> >>>>> >>>>> >>>>> some more details - [root at p128 ~]# rpm -qa | grep xen >>>>> kernel-xen-2.6.18-92.1.22.el5 >>>>> xen-3.0.3-64.el5_2.9 >>>>> xen-libs-3.0.3-64.el5_2.9 >>>>> xen-libs-3.0.3-64.el5_2.9 >>>>> >>>>> [root at p128 ~]# ibv_devinfo >>>>> hca_id: mthca0 >>>>> fw_ver: 5.3.0 >>>>> node_guid: 0002:c902:0022:cd48 >>>>> sys_image_guid: 0002:c902:0022:cd4b >>>>> vendor_id: 0x02c9 >>>>> vendor_part_id: 25218 >>>>> hw_ver: 0x20 >>>>> board_id: MT_0370130002 >>>>> phys_port_cnt: 2 >>>>> port: 1 >>>>> state: PORT_INIT (2) >>>>> max_mtu: 2048 (4) >>>>> active_mtu: 512 (2) >>>>> sm_lid: 0 >>>>> port_lid: 0 >>>>> port_lmc: 0x00 >>>>> >>>>> port: 2 >>>>> state: PORT_DOWN (1) >>>>> max_mtu: 2048 (4) >>>>> active_mtu: 512 (2) >>>>> sm_lid: 0 >>>>> port_lid: 0 >>>>> port_lmc: 0x00 >>>>> >>>>> >>>>> any help greatly appreciated. >>>>> >>>>> ~subbu >>>>> >>>>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown wrote: >>>>> >>>>>> Okay so my question to the openfabrics guys is, why would the OFED >>>>>> drivers fail to read the firmware? >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> Thanks, >>>>>> - David Brown >>>>>> >>>>>> >>>>>> ---------- Forwarded message ---------- >>>>>> From: David Brown >>>>>> Date: Thu, Sep 11, 2008 at 2:24 PM >>>>>> Subject: pciback module not working >>>>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com >>>>>> >>>>>> >>>>>> This issue was brought up about a year and a half ago. So I'll bring >>>>>> it up again and see if anything happens. >>>>>> >>>>>> I've got an infiniband network and am attempting to pass the >>>>>> infiniband card through the host and give it to the guest. >>>>>> I'm working with standard CentOS 5.2 on both guest and host with their >>>>>> provided xen (3.0.3 ish). I've also attempted to install the newest >>>>>> Xen 3.3 and use their standard host kernel and that did the same >>>>>> thing. The guest dmesg output in the guest is similar on both >>>>>> permissive and normal mode. >>>>>> >>>>>> I'm getting issues with detecting the firmware on the card for some >>>>>> reason... >>>>>> >>>>>> Any help would be appreciated. >>>>>> >>>>>> Thanks, >>>>>> - David Brown >>>>>> >>>>>> === GUEST dmesg output === >>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) >>>>>> ib_mthca: Initializing 0000:00:00.0 >>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>>>> ======================= >>>>>> >>>>>> === Host modprobe.conf === >>>>>> alias eth0 bnx2 >>>>>> alias eth1 bnx2 >>>>>> alias scsi_hostadapter cciss >>>>>> options pciback hide=(41:00.0) >>>>>> ===================== >>>>>> >>>>>> === Host lspci output === >>>>>> # lspci -vs 41:00.0 >>>>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>>>> HCA] (rev 20) >>>>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>>>> Flags: fast devsel, IRQ 16 >>>>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] >>>>>> [size=1M] >>>>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>>>> Capabilities: [40] Power Management version 2 >>>>>> Capabilities: [48] Vital Product Data >>>>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ >>>>>> Queue=0/5 Enable- >>>>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>>>> Capabilities: [60] Express Endpoint IRQ 0 >>>>>> ===================== >>>>>> >>>>>> This makes sure it get loaded first off before anything else. >>>>>> === Host mkinitrd cmd === >>>>>> # mkinitrd -f --with=pciback --preload pciback >>>>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen >>>>>> ==================== >>>>>> >>>>>> === Host pciback dmesg === >>>>>> pciback 0000:41:00.0: Driver tried to write to a read-only >>>>>> configuration space field at offset 0x44, size 2. This may be >>>>>> harmless, but if you have problems with your device: >>>>>> 1) see permissive attribute in sysfs >>>>>> 2) report problems to the xen-devel mailing list along with details of >>>>>> your device obtained from lspci. >>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>>>> ====================== >>>>>> >>>>>> === Host pciback dmesg (after setting it permissive) === >>>>>> pciback 0000:41:00.0: enabling permissive mode configuration space >>>>>> accesses! >>>>>> pciback 0000:41:00.0: permissive mode is potentially unsafe! >>>>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0 >>>>>> device vif1.0 entered promiscuous mode >>>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >>>>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) >>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>>>> ========================================= >>>>>> >>>>>> === Guest lspci output === >>>>>> # lspci -v >>>>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>>>> HCA] (rev 20) >>>>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>>>> Flags: fast devsel, IRQ 16 >>>>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] >>>>>> [size=1M] >>>>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>>>> Capabilities: [40] Power Management version 2 >>>>>> Capabilities: [48] Vital Product Data >>>>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ >>>>>> Queue=0/5 Enable- >>>>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>>>> Capabilities: [60] Express Endpoint IRQ 0 >>>>>> ===================== >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> . . . s u b b u >>>>> "You've got to be original, because if you're like someone else, what >>>>> do they need you for?" >>>>> >>>>> >>>> >>>> >>>> -- >>>> . . . s u b b u >>>> "You've got to be original, because if you're like someone else, what do >>>> they need you for?" >>>> >>>> >>> >>> >>> -- >>> . . . s u b b u >>> "You've got to be original, because if you're like someone else, what do >>> they need you for?" >>> >>> >> >> >> -- >> . . . s u b b u >> "You've got to be original, because if you're like someone else, what do >> they need you for?" >> >> > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what do > they need you for?" > > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at Voltaire.com Thu Feb 12 00:25:59 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Thu, 12 Feb 2009 10:25:59 +0200 Subject: [ofa-general] Enabling IP_CM warns about multicast packet drops In-Reply-To: <4993C24E.504@oracle.com> References: <4990CD57.3080108@oracle.com> <4992EABA.9090605@Voltaire.com> <4993C24E.504@oracle.com> Message-ID: <4993DD17.4020205@Voltaire.com> Sumeet Lahorani wrote: > Does this packet drop always occur at the host or could it also occur in > the switches (Voltaire ISR 9024)? The drop happens at the host, here's the relevant ipoib code snippet from drivers/infiniband/ulp/ipoib/ipoib_ib.c :: ipoib_send() > if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { > ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", > skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN); > ++dev->stats.tx_dropped; > ++dev->stats.tx_errors; > ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); > return; > Also, besides the "packet len too long ..." message, is the "dropped" > statistic in ifconfig ib0 a good way to find out if such packet drops > are happening? yes, see the code above. Or. From nicolas.morey-chaisemartin at ext.bull.net Thu Feb 12 01:11:24 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 12 Feb 2009 10:11:24 +0100 Subject: [ofa-general] [PATCH 0/3] Fat Tree - Routing between non-CN nodes Message-ID: <4993E7BC.4000105@ext.bull.net> Repost of the previous set of patches: Hi, We are current working on a Ftree topology where IO nodes are connected on spine switches. Using the cn_guid_file and root_guid_file works great. It is possible to route the whole tree as a fat tree. All the CNs are connected to the other CN and IO nodes. However, we are missing some connectivity between IO nodes. This is the expected behavior as the route between those IO nodes would have to go down to go back up on another spine switch. However, we need at least a bit of connectivity between those nodes. There won't be any real traffic but just some "ping" for HA purposes. Therefore, I have implemented two new options to openSM: io_guid_file and max_reverse_hops. The io_guid_file provides a list of all the IO guid (it may differs from the list of non-CN nodes) The max_reverse_hops gives the number of time IO nodes (described by io_guid_file) are allowed to use a switch backward. According to my tests this has absolutely no effects on regular routing and manages to connect the io nodes together, if max_reverse_hops is big enough. Regards Nicolas Morey- Chaisemartin ____ Nicolas Morey-Chaisemartin (3): opensm: Added io_guid_file and max_reverse_hops options opensm/osm_ucast_ftree.c: Added possible reverse hops for Ftree algorithm. Added documentation for io_guid_file and max_reverse_hop feature opensm/doc/current-routing.txt | 32 +++++ opensm/include/opensm/osm_subnet.h | 6 + opensm/man/opensm.8.in | 27 ++++ opensm/opensm/main.c | 26 ++++- opensm/opensm/osm_subnet.c | 12 ++ opensm/opensm/osm_ucast_ftree.c | 244 +++++++++++++++++++++++++++++------- 6 files changed, 303 insertions(+), 44 deletions(-) From nicolas.morey-chaisemartin at ext.bull.net Thu Feb 12 01:11:34 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 12 Feb 2009 10:11:34 +0100 Subject: [ofa-general] [PATCH 1/3] opensm: Added io_guid_file and max_reverse_hops options In-Reply-To: References: Message-ID: <4993E7C6.8020501@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/include/opensm/osm_subnet.h | 6 ++++++ opensm/opensm/main.c | 26 +++++++++++++++++++++++++- opensm/opensm/osm_subnet.c | 12 ++++++++++++ 3 files changed, 43 insertions(+), 1 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 8863e47..671b51f 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -190,6 +190,8 @@ typedef struct osm_subn_opt { char *lfts_file; char *root_guid_file; char *cn_guid_file; + char *io_guid_file; + uint16_t max_reverse_hops; char *ids_guid_file; char *guid_routing_order_file; char *sa_db_file; @@ -383,6 +385,10 @@ typedef struct osm_subn_opt { * Name of the file that contains list of compute node guids that * will be used by fat-tree routing (provided by User) * +* io_guid_file +* Name of the file that contains list of I/O node guids that +* will be used by fat-tree routing (provided by User) +* * ids_guid_file * Name of the file that contains list of ids which should be * used by Up/Down algorithm instead of node GUIDs diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index a8dc9e6..b5e3337 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -212,6 +212,12 @@ static void show_usage(void) printf("--cn_guid_file, -u \n" " Set the compute nodes for the Fat-Tree routing algorithm\n" " to the guids provided in the given file (one to a line)\n\n"); + printf("--io_guid_file, -G \n" + " Set the I/O nodes for the Fat-Tree routing algorithm\n" + " to the guids provided in the given file (one to a line)\n\n"); + printf("--max_reverse_hops, -H \n" + " Set the max number of hops the wrong way around\n" + " an I/O node is allowed to do (connectivity for I/O nodes on top swithces)\n\n"); printf("--ids_guid_file, -m \n" " Name of the map file with set of the IDs which will be used\n" " by Up/Down routing algorithm instead of node GUIDs\n" @@ -526,7 +532,7 @@ int main(int argc, char *argv[]) uint32_t val; unsigned config_file_done = 0; const char *const short_option = - "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:"; + "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:"; /* In the array below, the 2nd parameter specifies the number @@ -570,6 +576,8 @@ int main(int argc, char *argv[]) {"sadb_file", 1, NULL, 'S'}, {"root_guid_file", 1, NULL, 'a'}, {"cn_guid_file", 1, NULL, 'u'}, + {"io_guid_file", 1, NULL, 'G'}, + {"max_reverse_hops", 1, NULL, 'H'}, {"ids_guid_file", 1, NULL, 'm'}, {"guid_routing_order_file", 1, NULL, 'X'}, {"stay_on_fatal", 0, NULL, 'y'}, @@ -880,6 +888,22 @@ int main(int argc, char *argv[]) opt.cn_guid_file); break; + case 'G': + /* + Specifies I/O node guids file + */ + opt.io_guid_file = optarg; + printf(" I/O Node Guid File: %s\n", + opt.io_guid_file); + break; + case 'H': + /* + Specifies I/O max reverted hops + */ + opt.max_reverse_hops = atoi(optarg); + printf(" Max Reverse Hops: %d\n", + opt.max_reverse_hops); + break; case 'm': /* Specifies ids guid file */ SET_STR_OPT(opt.ids_guid_file, optarg); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 69937c1..b356d33 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -578,6 +578,8 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->lfts_file = NULL; p_opt->root_guid_file = NULL; p_opt->cn_guid_file = NULL; + p_opt->io_guid_file = NULL; + p_opt->max_reverse_hops = 0; p_opt->ids_guid_file = NULL; p_opt->guid_routing_order_file = NULL; p_opt->sa_db_file = NULL; @@ -1393,6 +1395,16 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str); fprintf(out, + "# The file holding the fat-tree I/O node guids\n" + "# One guid in each line\nio_guid_file %s\n\n", + p_opts->io_guid_file ? p_opts->io_guid_file : null_str); + + fprintf(out, + "# Number of reverse hops allowed for I/O nodes \n" + "# Used for connectivity between I/O nodes connected to Top Switches\nmax_reverse_hops %d\n\n", + p_opts->max_reverse_hops); + + fprintf(out, "# The file holding the node ids which will be used by" " Up/Down algorithm instead\n# of GUIDs (one guid and" " id in each line)\nids_guid_file %s\n\n", -- 1.6.1 From nicolas.morey-chaisemartin at ext.bull.net Thu Feb 12 01:11:38 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 12 Feb 2009 10:11:38 +0100 Subject: [ofa-general] [PATCH 2/3] opensm/osm_ucast_ftree.c: Added possible reverse hops for Ftree algorithm. In-Reply-To: References: Message-ID: <4993E7CA.60103@ext.bull.net> This allows connectivity between nodes declared in the io_guid_file when they had none with the regular algorithm and it can be solved by doin less than max_reverse_hops in the tree. This is meant to be used for I/O and service nodes connected to the Top Switches of a Fat Tree, that need connectivity but no real bandwidth. Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 244 ++++++++++++++++++++++++++++++++------- 1 files changed, 201 insertions(+), 43 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 53218d1..d92265b 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -150,6 +150,7 @@ typedef struct ftree_port_group_t_ { ftree_hca_or_sw remote_hca_or_sw; /* pointer to remote hca/switch */ cl_ptr_vector_t ports; /* vector of ports to the same lid */ boolean_t is_cn; /* whether this port is a compute node */ + boolean_t is_io; /* whether this port is an I/O node */ uint32_t counter_down; /* number of allocated routs downwards */ } ftree_port_group_t; @@ -199,6 +200,7 @@ typedef struct ftree_fabric_t_ { cl_qmap_t sw_tbl; cl_qmap_t sw_by_tuple_tbl; cl_qmap_t cn_guid_tbl; + cl_qmap_t io_guid_tbl; unsigned cn_num; uint8_t leaf_switch_rank; uint8_t max_switch_rank; @@ -386,7 +388,8 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid, IN ib_net64_t remote_node_guid, IN uint8_t remote_node_type, IN void *p_remote_hca_or_sw, - IN boolean_t is_cn) + IN boolean_t is_cn, + IN boolean_t is_io) { ftree_port_group_t *p_group = (ftree_port_group_t *) malloc(sizeof(ftree_port_group_t)); @@ -434,6 +437,7 @@ __osm_ftree_port_group_create(IN ib_net16_t base_lid, cl_ptr_vector_init(&p_group->ports, 0, /* min size */ 8); /* grow size */ p_group->is_cn = is_cn; + p_group->is_io = is_io; return p_group; } /* __osm_ftree_port_group_create() */ @@ -699,7 +703,7 @@ __osm_ftree_sw_add_port(IN ftree_sw_t * p_sw, remote_node_guid, remote_node_type, p_remote_hca_or_sw, - FALSE); + FALSE, FALSE); CL_ASSERT(p_group); if (direction == FTREE_DIRECTION_UP) @@ -830,7 +834,8 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca, IN ib_net64_t remote_port_guid, IN ib_net64_t remote_node_guid, IN uint8_t remote_node_type, - IN void *p_remote_hca_or_sw, IN boolean_t is_cn) + IN void *p_remote_hca_or_sw, IN boolean_t is_cn, + IN boolean_t is_io) { ftree_port_group_t *p_group; @@ -853,7 +858,7 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca, remote_node_guid, remote_node_type, p_remote_hca_or_sw, - is_cn); + is_cn, is_io); p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group; } __osm_ftree_port_group_add_port(p_group, port_num, remote_port_num); @@ -879,6 +884,7 @@ static ftree_fabric_t *__osm_ftree_fabric_create() cl_qmap_init(&p_ftree->sw_tbl); cl_qmap_init(&p_ftree->sw_by_tuple_tbl); cl_qmap_init(&p_ftree->cn_guid_tbl); + cl_qmap_init(&p_ftree->io_guid_tbl); return p_ftree; } @@ -945,6 +951,18 @@ static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) } cl_qmap_remove_all(&p_ftree->cn_guid_tbl); + /* remove all the elements of io_guid_tbl */ + p_next_guid_element = + (name_map_item_t *) cl_qmap_head(&p_ftree->io_guid_tbl); + while (p_next_guid_element != + (name_map_item_t *) cl_qmap_end(&p_ftree->io_guid_tbl)) { + p_guid_element = p_next_guid_element; + p_next_guid_element = + (name_map_item_t *) cl_qmap_next(&p_guid_element->item); + free(p_guid_element); + } + cl_qmap_remove_all(&p_ftree->io_guid_tbl); + /* free the leaf switches array */ if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches)) free(p_ftree->leaf_switches); @@ -1335,6 +1353,14 @@ static inline boolean_t __osm_ftree_fabric_cns_provided(IN ftree_fabric_t * /***************************************************/ +static inline boolean_t __osm_ftree_fabric_ios_provided(IN ftree_fabric_t * + p_ftree) +{ + return (p_ftree->p_osm->subn.opt.io_guid_file != NULL); +} + +/***************************************************/ + static int __osm_ftree_fabric_mark_leaf_switches(IN ftree_fabric_t * p_ftree) { ftree_sw_t *p_sw; @@ -1901,7 +1927,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, IN uint8_t target_rank, IN boolean_t is_real_lid, IN boolean_t is_main_path, - IN uint8_t highest_rank_in_route) + IN uint8_t highest_rank_in_route, + IN uint16_t reverse_hops) { ftree_sw_t *p_remote_sw; uint16_t ports_num; @@ -2008,13 +2035,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* second case: skip the port group if the remote (lower) switch has been already configured for this target LID */ if (is_real_lid && !is_main_path && - p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH) + p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != + OSM_NO_PATH) continue; /* setting fwd tbl port only if this is real LID */ if (is_real_lid) { p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] = - p_min_port->remote_port_num; + p_min_port->remote_port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CA LID %u through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), @@ -2034,7 +2062,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, ((target_rank - highest_rank_in_route) + (p_remote_sw->rank - - highest_rank_in_route))); + highest_rank_in_route) + + reverse_hops * 2)); } } @@ -2049,15 +2078,13 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* Recursion step: Assign upgoing ports by stepping down, starting on REMOTE switch */ - created_route |= - __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, - p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ - NULL, /* prev. position - NULL to mark that we went down and not up */ - target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ - is_real_lid, /* whether the target LID is real or dummy */ - is_main_path, /* whether this is path to HCA that should by tracked by counters */ - highest_rank_in_route); /* highest visited point in the tree before going down */ + created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ + NULL, /* prev. position - NULL to mark that we went down and not up */ + target_lid, /* LID that we're routing to */ + target_rank, /* rank of the LID that we're routing to */ + is_real_lid, /* whether the target LID is real or dummy */ + is_main_path, /* whether this is path to HCA that should by tracked by counters */ + highest_rank_in_route, reverse_hops); /* highest visited point in the tree before going down */ } /* done scanning all the down-going port groups */ @@ -2066,7 +2093,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, going through all the downgoing groups */ if (created_route) p_sw->down_port_groups_idx = - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; + (p_sw->down_port_groups_idx + + 1) % p_sw->down_port_groups_num; return created_route; } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ @@ -2091,7 +2119,9 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, IN ib_net16_t target_lid, IN uint8_t target_rank, IN boolean_t is_real_lid, - IN boolean_t is_main_path) + IN boolean_t is_main_path, + IN uint16_t reverse_hop_credit, + IN uint16_t reverse_hops) { ftree_sw_t *p_remote_sw; uint16_t ports_num; @@ -2112,11 +2142,42 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this path to HCA should by tracked by counters */ - p_sw->rank); /* the highest visited point in the tree before going down */ + p_sw->rank, /* the highest visited point in the tree before going down */ + reverse_hops); /* Number of reverse_hops done up to this point */ /* recursion stop condition - if it's a root switch, */ - if (p_sw->rank == 0) + if (p_sw->rank == 0) { + if (reverse_hop_credit > 0) { + /* We go up by going down as we have some reverse_hop_credit left */ + /* We use the index to scatter a bit the reverse up routes */ + p_sw->down_port_groups_idx = + (p_sw->down_port_groups_idx + + 1) % p_sw->down_port_groups_num; + i = p_sw->down_port_groups_idx; + for (j = 0; j < p_sw->down_port_groups_num; j++) { + + p_group = p_sw->down_port_groups[i]; + i = (i + 1) % p_sw->down_port_groups_num; + + /* Skip this port group unless it points to a switch */ + if (p_group->remote_node_type != + IB_NODE_TYPE_SWITCH) + continue; + p_remote_sw = p_group->remote_hca_or_sw.p_sw; + + __osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ + p_sw, /* this switch - prev. position switch for the function */ + target_lid, /* LID that we're routing to */ + target_rank, /* rank of the LID that we're routing to */ + is_real_lid, /* whether this target LID is real or dummy */ + is_main_path, /* whether this is path to HCA that should by tracked by counters */ + reverse_hop_credit - 1, /* Remaining reverse_hops allowed */ + reverse_hops + 1); /* Number of reverse_hops done up to this point */ + } + + } return; + } /* Find the least loaded upgoing port group */ p_min_group = NULL; @@ -2202,14 +2263,20 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_min_group->counter_down++; p_min_port->counter_down++; if (is_real_lid) { - p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] = - p_min_port->remote_port_num; - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "Switch %s: set path to CA LID %u through port %u\n", - __osm_ftree_tuple_to_str(p_remote_sw->tuple), - cl_ntoh16(target_lid), - p_min_port->remote_port_num); - + /* This LID may already be in the LFT in the reverse_hop feature is used */ + /* We update the LFT only if this LID isn't already present. */ + if (p_remote_sw->p_osm_sw-> + new_lft[cl_ntoh16(target_lid)] == OSM_NO_PATH) { + p_remote_sw->p_osm_sw-> + new_lft[cl_ntoh16(target_lid)] = + p_min_port->remote_port_num; + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "Switch %s: set path to CA LID %u through port %u\n", + __osm_ftree_tuple_to_str(p_remote_sw-> + tuple), + cl_ntoh16(target_lid), + p_min_port->remote_port_num); + } /* On the remote switch that is pointed by the min_group, set hops for ALL the ports in the remote group. */ @@ -2223,7 +2290,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, cl_ntoh16(target_lid), p_port->remote_port_num, target_rank - - p_remote_sw->rank); + p_remote_sw->rank + + 2 * reverse_hops); } } @@ -2234,7 +2302,9 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_lid, /* LID that we're routing to */ target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ - is_main_path); /* whether this is path to HCA that should by tracked by counters */ + is_main_path, /* whether this is path to HCA that should by tracked by counters */ + reverse_hop_credit, /* Remaining reverse_hops allowed */ + reverse_hops); /* Number of reverse_hops done up to this point */ } /* we're done for the third case */ @@ -2278,7 +2348,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_remote_sw = p_group->remote_hca_or_sw.p_sw; /* skip if target lid has been already set on remote switch fwd tbl */ - if (p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH) + if (p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != + OSM_NO_PATH) continue; if (p_sw->is_leaf) { @@ -2297,7 +2368,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port); p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] = - p_port->remote_port_num; + p_port->remote_port_num; /* On the remote switch that is pointed by the p_group, set hops for ALL the ports in the remote group. */ @@ -2310,7 +2381,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, cl_ntoh16(target_lid), p_port->remote_port_num, target_rank - - p_remote_sw->rank); + p_remote_sw->rank + + 2 * reverse_hops); } /* Recursion step: @@ -2320,7 +2392,37 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_lid, /* LID that we're routing to */ target_rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is real or dummy */ - FALSE); /* whether this is path to HCA that should by tracked by counters */ + FALSE, /* whether this is path to HCA that should by tracked by counters */ + reverse_hop_credit, /* Remaining reverse_hops allowed */ + reverse_hops); /* Number of reverse_hops done up to this point */ + } + + /* If we don't have any reverse hop credits, we are done */ + if (reverse_hop_credit == 0) + return; + + /* We explore all the down group ports */ + /* We try to reverse jump for each of them */ + /* They already have a route to us from the upgoing_by_going_down started earlier */ + /* This is only so it'll continue exploring up, after this step backwards */ + for (i = 0; i < p_sw->down_port_groups_num; i++) { + p_group = p_sw->down_port_groups[i]; + p_remote_sw = p_group->remote_hca_or_sw.p_sw; + + /* Skip this port group unless it points to a switch */ + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) + continue; + + /* Recursion step: + Assign downgoing ports by stepping up, fter doing one step down starting on REMOTE switch. */ + __osm_ftree_fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ + p_sw, /* this switch - prev. position switch for the function */ + target_lid, /* LID that we're routing to */ + target_rank, /* rank of the LID that we're routing to */ + TRUE, /* whether the target LID is real or dummy */ + TRUE, /* whether this is path to HCA that should by tracked by counters */ + reverse_hop_credit - 1, /* Remaining reverse_hops allowed */ + reverse_hops + 1); /* Number of reverse_hops done up to this point */ } } /* ftree_fabric_route_downgoing_by_going_up() */ @@ -2408,7 +2510,9 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree) hca_lid, /* LID that we're routing to */ p_sw->rank + 1, /* rank of the LID that we're routing to */ TRUE, /* whether this HCA LID is real or dummy */ - TRUE); /* whether this path to HCA should by tracked by counters */ + TRUE, /* whether this path to HCA should by tracked by counters */ + 0, /* Number of reverse hops allowed */ + 0); /* Number of reverse hops done yet */ /* count how many real targets have been routed from this leaf switch */ routed_targets_on_leaf++; @@ -2433,7 +2537,9 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree) 0, /* LID that we're routing to - ignored for dummy HCA */ 0, /* rank of the LID that we're routing to - ignored for dummy HCA */ FALSE, /* whether this HCA LID is real or dummy */ - TRUE); /* whether this path to HCA should by tracked by counters */ + TRUE, /* whether this path to HCA should by tracked by counters */ + 0, /* Number of reverse hops allowed */ + 0); /* Number of reverse hops done yet */ } } } @@ -2518,7 +2624,9 @@ static void __osm_ftree_fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) hca_lid, /* LID that we're routing to */ p_sw->rank + 1, /* rank of the LID that we're routing to */ TRUE, /* whether this HCA LID is real or dummy */ - TRUE); /* whether this path to HCA should by tracked by counters */ + TRUE, /* whether this path to HCA should by tracked by counters */ + p_hca_port_group->is_io ? p_ftree->p_osm->subn.opt.max_reverse_hops : 0, /* Number or reverse hops allowed */ + 0); /* Number or reverse hops done yet */ } /* done with all the port groups of this HCA - go to next HCA */ } @@ -2570,7 +2678,9 @@ static void __osm_ftree_fabric_route_to_switches(IN ftree_fabric_t * p_ftree) p_sw->base_lid, /* LID that we're routing to */ p_sw->rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is a real or dummy */ - FALSE); /* whether this path should by tracked by counters */ + FALSE, /* whether this path to HCA should by tracked by counters */ + 0, /* Number of reverse hops allowed */ + 0); /* Number of reverse hops done yet */ } OSM_LOG_EXIT(&p_ftree->p_osm->log); @@ -2802,6 +2912,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree, uint8_t i; uint8_t remote_port_num; boolean_t is_cn = FALSE; + boolean_t is_io = FALSE; int res = 0; for (i = 0; i < osm_node_get_num_physp(p_node); i++) { @@ -2879,9 +2990,31 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree, "Marking CN port GUID 0x%016" PRIx64 "\n", cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); } else { - OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "Marking non-CN port GUID 0x%016" PRIx64 "\n", - cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); + if (__osm_ftree_fabric_ios_provided(p_ftree)) { + name_map_item_t *p_elem = + (name_map_item_t *) + cl_qmap_get(&p_ftree->io_guid_tbl, + cl_ntoh64 + (osm_physp_get_port_guid + (p_osm_port))); + if (p_elem != + (name_map_item_t *) + cl_qmap_end(&p_ftree->io_guid_tbl)) + is_io = TRUE; + + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "Marking I/O port GUID 0x%016" PRIx64 + "\n", + cl_ntoh64(osm_physp_get_port_guid + (p_osm_port))); + + } else { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "Marking non-CN port GUID 0x%016" PRIx64 + "\n", + cl_ntoh64(osm_physp_get_port_guid + (p_osm_port))); + } } __osm_ftree_hca_add_port(p_hca, /* local ftree_hca object */ @@ -2894,7 +3027,7 @@ __osm_ftree_fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree, remote_node_guid, /* remote node guid */ remote_node_type, /* remote node type */ (void *)p_remote_sw, /* remote ftree_hca/sw object */ - is_cn); /* whether this port is compute node */ + is_cn, is_io); /* whether this port is compute node */ } Exit: @@ -3354,6 +3487,8 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree) if (parse_node_map(p_ftree->p_osm->subn.opt.cn_guid_file, add_guid_item_to_map, &p_ftree->cn_guid_tbl)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB23: " + "Problem parsin CN guid file\n"); status = -1; goto Exit; } @@ -3366,6 +3501,29 @@ static int __osm_ftree_fabric_read_guid_files(IN ftree_fabric_t * p_ftree) } } + + if (__osm_ftree_fabric_ios_provided(p_ftree)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "Fetching I/O nodes from file %s\n", + p_ftree->p_osm->subn.opt.io_guid_file); + + if (parse_node_map(p_ftree->p_osm->subn.opt.io_guid_file, + add_guid_item_to_map, + &p_ftree->io_guid_tbl)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB23: " "Problem parsin I/O guid file\n"); + status = -1; + goto Exit; + } + + if (!cl_qmap_count(&p_ftree->io_guid_tbl)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB23: " + "I/O node guids file has no valid guids\n"); + status = -1; + goto Exit; + } + } Exit: OSM_LOG_EXIT(&p_ftree->p_osm->log); return status; -- 1.6.1 From nicolas.morey-chaisemartin at ext.bull.net Thu Feb 12 01:11:42 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 12 Feb 2009 10:11:42 +0100 Subject: [ofa-general] [PATCH 3/3] Added documentation for io_guid_file and max_reverse_hop feature In-Reply-To: References: Message-ID: <4993E7CE.3090908@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/doc/current-routing.txt | 32 ++++++++++++++++++++++++++++++++ opensm/man/opensm.8.in | 27 +++++++++++++++++++++++++++ 2 files changed, 59 insertions(+), 0 deletions(-) diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt index 0034d0e..1302860 100644 --- a/opensm/doc/current-routing.txt +++ b/opensm/doc/current-routing.txt @@ -237,6 +237,38 @@ in the same directory where the OpenSM log resides. This ordering file provides the CN order that may be used to create efficient communication pattern, that will match the routing tables. +Routing between non-CN nodes + + +The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree. +In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes. +In the scheme below, N1, N2 and N3 are non-CN nodes. Although all the CN have routes to and from them, +there will not necessarily be a route between N1,N2 and N3. +Such routes would require to use at least one of the Switch the wrong way around +(In fact, go out of one of the top Switch through a downgoing port while we are supposed to go up). + + Spine1 Spine2 Spine 3 + / \ / | \ / \ + / \ / | \ / \ + N1 Switch N2 Switch N3 + /|\ /|\ + / | \ / | \ + Going down to compute nodes + +To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\' option. +Theses nodes will be allowed to use switches the wrong way around a specific number of times (specified by \'-H\' or \'--max_reverse_hops\'. +With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree. + +In the scheme above, with a max_reverse_hop of 1, routes will be instanciated between N1<->N2 and N2<->N3. +With a max_reverse_hops value of 2, N1,N2 and N3 will all have routes between them. + +Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way. +This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used +to allow connectivity for HA purposes or similar. +Also having routes the other way around can in theory cause credit loops. + +Use these options with extreme care ! + Usage: diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index 7690980..ce14c02 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -22,6 +22,8 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) [\-S | \-\-sadb_file ] [\-a | \-\-root_guid_file ] [\-u | \-\-cn_guid_file ] +[\-G | \-\-io_guid_file ] +[\-H | \-\-max_reverse_hops ] [\-X | \-\-guid_routing_order_file ] [\-m | \-\-ids_guid_file ] [\-o(nce)] @@ -183,6 +185,16 @@ algorithm to the guids provided in the given file (one to a line). Set the compute nodes for the Fat-Tree routing algorithm to the guids provided in the given file (one to a line). .TP +\fB\-G\fR, \fB\-\-io_guid_file\fR +Set the I/O nodes for the Fat-Tree routing algorithm +to the guids provided in the given file (one to a line). +I/O nodes are non-CN nodes allowed to use up to max_reverse_hops switches +the wrong way around to improve connectivity. +.TP +\fB\-H\fR, \fB\-\-max_reverse_hops\fR +Set the maximum number of reverse hops an I/O node is allowed +to make. A reverse hop is the use of a switch the wrong way around. +.TP \fB\-m\fR, \fB\-\-ids_guid_file\fR Name of the map file with set of the IDs which will be used by Up/Down routing algorithm instead of node GUIDs @@ -800,6 +812,21 @@ in the same directory where the OpenSM log resides. This ordering file provides the CN order that may be used to create efficient communication pattern, that will match the routing tables. +Routing between non-CN nodes + +The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree. +In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes. +To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\' option. +Theses nodes will be allowed to use switches the wrong way round a specific number of times (specified by \'-H\' or \'--max_reverse_hops\'. +With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree. + +Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way. +This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used +to allow connectivity for HA purposes or similar. +Also having routes the other way around can in theory cause credit loops. + +Use these options with extreme care ! + Activation through OpenSM Use '-R ftree' option to activate the fat-tree algorithm. -- 1.6.1 From vlad at lists.openfabrics.org Thu Feb 12 03:19:50 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 12 Feb 2009 03:19:50 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090212-0200 daily build status Message-ID: <20090212111950.91876E60888@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From tziporet at dev.mellanox.co.il Thu Feb 12 03:20:44 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 12 Feb 2009 13:20:44 +0200 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: References: Message-ID: <4994060C.4050001@mellanox.co.il> Wen Hao Wang wrote: > > Hi all: > > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and > wrote network interface configuration file ifcfg-ib0. ifup ib0 also > succeeded. But IB utilites report Connetion timed out. > > > [root at xblade06 network-scripts]# sminfo > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > sminfo: iberror: failed: query > > I had to reboot the blade and rerun "openibd start". Then sminfo > reported correct contents. I do not suppose this reboot is required. > Did I miss any configuration step? > > Moreover, "openibd start" report one warning message about hwconf. > Anyone has comments about this? > > [root at xblade07 ~]# /etc/init.d/openibd start > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such > file or directory > [ OK ] > > Thanks a lot! > > Wen Hao Wang > Email: wangwhao at cn.ibm.com > > Doug?? Tziporet From tziporet at dev.mellanox.co.il Thu Feb 12 03:28:29 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 12 Feb 2009 13:28:29 +0200 Subject: [ewg] Re: [ofa-general] OFED (EWG) Feb 9, 2009 meeting minutes In-Reply-To: <499311B4.4090607@nasa.gov> References: <5D49E7A8952DC44FB38C38FA0D758EAD01BDAC2D@mtlexch01.mtl.com> <499311B4.4090607@nasa.gov> Message-ID: <499407DD.4070307@mellanox.co.il> Jeff Becker wrote: > > Thanks to NASA's developing relationship with Novell, I got access to > SLES11 rc3 iso's. I'm downloading them now, and will start on the > backports when that's done. > Thanks a lot Tziporet From ruffing at motama.com Thu Feb 12 03:48:19 2009 From: ruffing at motama.com (Jan Ruffing) Date: Thu, 12 Feb 2009 12:48:19 +0100 Subject: [ofa-general] Drop in TCP performance when using OFED? Message-ID: <49940C83.5020909@motama.com> Hallo, After I installed the OFED (1.4 beta), I noticed a drop in TCP performance via Infiniband: from 10 GBit/s to less than 8 GBit/s. Is that "expected behaviour"? Is there a way to avoid this performance loss? The HCA used in both test machines is a Mellanox Infinihost III Lx DDR HCA. Both machines run OpenSuse 11 with a 2.6.25.16 Kernel. Performance with Open Suse 11 "out of the box", using Open Suse 11 Infiniband packages: tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M ------------------------------------------------------------ Client connecting to 192.168.2.2, TCP port 5001 TCP window size: 515 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.2.1 port 47730 connected with 192.168.2.2 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 11.6 GBytes 10.0 Gbits/sec Performance after the Installation of ODED 1.4. beta: tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M ------------------------------------------------------------ Client connecting to 192.168.2.2, TCP port 5001 TCP window size: 902 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.2.1 port 38864 connected with 192.168.2.2 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 8.43 GBytes 7.24 Gbits/sec Thanks in advance, Jan -- Jan Ruffing Software Developer Motama GmbH Lortzingstraße 10 · 66111 Saarbrücken · Germany tel +49 681 940 85 50 · fax +49 681 940 85 49 ruffing at motama.com · www.motama.com Companies register · district council Saarbrücken · HRB 15249 CEOs · Dr.-Ing. Marco Lohse, Michael Repplinger This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. From hal.rosenstock at gmail.com Thu Feb 12 04:04:44 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 12 Feb 2009 07:04:44 -0500 Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: References: Message-ID: On Thu, Feb 12, 2009 at 2:37 AM, Wen Hao Wang wrote: > Hi all: > > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped in > RHEL5.3 image) by "yum groupisntall". Then I load some drivers and wrote > network interface configuration file ifcfg-ib0. ifup ib0 also succeeded. But > IB utilites report Connetion timed out. > > > [root at xblade06 network-scripts]# sminfo > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > sminfo: iberror: failed: query It looks like the SM found the blade and at least configured the SMLID but somehow LID routing did not work between the blade and the SM (at LID 9). Was this problem persistent (without rebooting the blade) ? Was the blade IB port active ? -- Hal > I had to reboot the blade and rerun "openibd start". Then sminfo reported > correct contents. I do not suppose this reboot is required. Did I miss any > configuration step? > > Moreover, "openibd start" report one warning message about hwconf. Anyone > has comments about this? > > [root at xblade07 ~]# /etc/init.d/openibd start > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such file or > directory > [ OK ] > > Thanks a lot! > > Wen Hao Wang > Email: wangwhao at cn.ibm.com > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From nicolas.morey-chaisemartin at ext.bull.net Thu Feb 12 04:20:36 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 12 Feb 2009 13:20:36 +0100 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: References: Message-ID: <49941414.2050400@ext.bull.net> Wen Hao Wang wrote: > > Hi all: > > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and > wrote network interface configuration file ifcfg-ib0. ifup ib0 also > succeeded. But IB utilites report Connetion timed out. > > > [root at xblade06 network-scripts]# sminfo > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > sminfo: iberror: failed: query > > I had to reboot the blade and rerun "openibd start". Then sminfo > reported correct contents. I do not suppose this reboot is required. > Did I miss any configuration step? > > Moreover, "openibd start" report one warning message about hwconf. > Anyone has comments about this? > > [root at xblade07 ~]# /etc/init.d/openibd start > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such > file or directory > [ OK ] > > Thanks a lot! > > Wen Hao Wang > Email: wangwhao at cn.ibm.com > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general Sounds to me as if you haven't any Subnet Manager (OpenSM or managed switch) running. $sminfo sminfo: sm lid 2 sm guid 0x8f1040041254a, activity count 751941 priority 3 state 3 SMINFO_MASTER $ sminfo -P 2 ibwarn: [17975] mad_rpc: _do_madrpc failed; dport (Lid 3945) sminfo: iberror: failed: query (we don't have any SM on the subnet connected to port 2) Your reboot might have started OpenSM. Thus making it works Nicolas From tziporet at dev.mellanox.co.il Thu Feb 12 04:21:30 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 12 Feb 2009 14:21:30 +0200 Subject: [ofa-general] Drop in TCP performance when using OFED? In-Reply-To: <49940C83.5020909@motama.com> References: <49940C83.5020909@motama.com> Message-ID: <4994144A.8010102@mellanox.co.il> Jan Ruffing wrote: > Hallo, > > After I installed the OFED (1.4 beta), I noticed a drop in TCP > performance via Infiniband: from 10 GBit/s to less than 8 GBit/s. > Is that "expected behaviour"? Is there a way to avoid this performance loss? > > The HCA used in both test machines is a Mellanox Infinihost III Lx DDR > HCA. Both machines run OpenSuse 11 with a 2.6.25.16 Kernel. > Is it SDP or IPoIB? What is the FW version you use? > > Performance with Open Suse 11 "out of the box", using Open Suse 11 > Infiniband packages: > > tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M > ------------------------------------------------------------ > Client connecting to 192.168.2.2, TCP port 5001 > TCP window size: 515 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.2.1 port 47730 connected with 192.168.2.2 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 11.6 GBytes 10.0 Gbits/sec > > > Performance after the Installation of ODED 1.4. beta: > > tamara iperf-2.0.4/src> ./iperf -c 192.168.2.2 -l 3M > ------------------------------------------------------------ > Client connecting to 192.168.2.2, TCP port 5001 > TCP window size: 902 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.2.1 port 38864 connected with 192.168.2.2 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 8.43 GBytes 7.24 Gbits/sec > > > Thanks in advance, > Jan > > From hal.rosenstock at gmail.com Thu Feb 12 04:41:28 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 12 Feb 2009 07:41:28 -0500 Subject: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: <20090207123355.GP17713@sashak.voltaire.com> References: <20090207123355.GP17713@sashak.voltaire.com> Message-ID: Sasha, On 2/7/09, Sasha Khapyorsky wrote: > On 14:12 Fri 06 Feb , Hal Rosenstock wrote: >> >> I'm looking at adding pkey support into the OpenSM vendor layer. The >> pkey table is a per port structure and is part of ib_port_attr_t. That >> structure also include num_pkeys. There is only related API: >> osm_vendor_get_all_port_attr which takes several pointers, the second >> one is a pointer to a preallocated array of port attributes (memory >> allocation for that is done by the client). ib_port_attr_t includes a >> pointer to the pkey table. So the only way this can work is if that >> allocation is also done by the client which makes that a valid >> parameter on input (as well as output). > > This could be a client choice: if pkey table pointer is initialized as > NULL osm_vendor_get_all_port_attr() allocates memory and initialize the > table and its size, otherwise it fills up only provided by client pkey > table entries. That's what I originally thought too but I'm not so sure looking at the other vendor layers. For example, osm_vendor_al.c (which I think is used in Windows currently) has the following code in osm_vendor_get_all_port_attr (and other vendor layers except umad are similar): for (port_num = 0; port_num < num_ports; port_num++) { p_attr_array[port_count] = *__osm_ca_info_get_port_attr_ptr(p_ca_info, port_num); port_count++; } and static ib_port_attr_t *__osm_ca_info_get_port_attr_ptr(IN const osm_ca_info_t * const p_ca_info, IN const uint8_t index) { return (&p_ca_info->p_attr->p_port_attr[index]); } so I'm thinking the tables need to be supplied by the underlying vendor library (al, umad, ...). Do you concur ? -- Hal > Sasha > From hal.rosenstock at gmail.com Thu Feb 12 05:13:47 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 12 Feb 2009 08:13:47 -0500 Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123355.GP17713@sashak.voltaire.com> Message-ID: On Thu, Feb 12, 2009 at 7:41 AM, Hal Rosenstock wrote: > so I'm thinking the tables need to be supplied by the underlying > vendor library (al, umad, ...). Do you concur ? If so, this can be supported as part of umad or better yet as part of OpenSM umad vendor with no umad changes. -- Hal From dledford at redhat.com Thu Feb 12 05:20:30 2009 From: dledford at redhat.com (Doug Ledford) Date: Thu, 12 Feb 2009 08:20:30 -0500 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: <4994060C.4050001@mellanox.co.il> References: <4994060C.4050001@mellanox.co.il> Message-ID: <1234444830.10037.313.camel@firewall.xsintricity.com> On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote: > Wen Hao Wang wrote: > > > > Hi all: > > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped > > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and > > wrote network interface configuration file ifcfg-ib0. ifup ib0 also > > succeeded. But IB utilites report Connetion timed out. > > > > > > [root at xblade06 network-scripts]# sminfo > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > sminfo: iberror: failed: query > > > > I had to reboot the blade and rerun "openibd start". Then sminfo > > reported correct contents. I do not suppose this reboot is required. > > Did I miss any configuration step? There was an unintentional bug in the rhel5.2 openibd init script in that it automatically turned itself on during install (generally, most init scripts should default to *not* turning themselves on during install of the package, nor should they start themselves during install of the package...this is for security reasons, imagine if you installed the bind name server on your box and it automatically started up before you had a chance to configure it). In rhel5.3 we fixed that bug. So, you may need to 'chkconfig --level 2345 openibd on' to make sure openibd starts up each time. The error you list above is consistent with not all of the kernel modules being loaded when you tried to use the sminfo program. > > Moreover, "openibd start" report one warning message about hwconf. > > Anyone has comments about this? > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such > > file or directory > > [ OK ] Can you see if the kudzu package is installed on your machine? The openib package uses this config file written by kudzu to determine what hardware drivers to load. I suppose I should put a specific requires in the rpm for that. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From ruffing at motama.com Thu Feb 12 05:35:49 2009 From: ruffing at motama.com (Jan Ruffing) Date: Thu, 12 Feb 2009 14:35:49 +0100 Subject: [ofa-general] Drop in TCP performance when using OFED? In-Reply-To: <4994144A.8010102@mellanox.co.il> References: <49940C83.5020909@motama.com> <4994144A.8010102@mellanox.co.il> Message-ID: <499425B5.2050000@motama.com> Tziporet Koren wrote: > Jan Ruffing wrote: >> After I installed the OFED (1.4 beta), I noticed a drop in TCP >> performance via Infiniband: from 10 GBit/s to less than 8 GBit/s. >> Is that "expected behaviour"? Is there a way to avoid this >> performance loss? >> >> The HCA used in both test machines is a Mellanox Infinihost III Lx DDR >> HCA. Both machines run OpenSuse 11 with a 2.6.25.16 Kernel. >> > > Is it SDP or IPoIB? > What is the FW version you use? That's using IPoIB. The FW version is 1.2.0 (according to ibv_devinfo). -- Jan Ruffing Software Developer Motama GmbH Lortzingstraße 10 · 66111 Saarbrücken · Germany tel +49 681 940 85 50 · fax +49 681 940 85 49 ruffing at motama.com · www.motama.com Companies register · district council Saarbrücken · HRB 15249 CEOs · Dr.-Ing. Marco Lohse, Michael Repplinger This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. From kliteyn at dev.mellanox.co.il Thu Feb 12 06:55:39 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 12 Feb 2009 16:55:39 +0200 Subject: [ofa-general] [PATCH] opensm/osm_sa.c: fixing SA MAD dump Message-ID: <4994386B.1040703@dev.mellanox.co.il> Hi Sasha, osm_sa_send() returns the MAD to the pool after sending it, so dumping the MAD after sending it is wrong - fixing. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_sa.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c index 185557f..416d44a 100644 --- a/opensm/opensm/osm_sa.c +++ b/opensm/opensm/osm_sa.c @@ -498,9 +498,9 @@ void osm_sa_respond(osm_sa_t *sa, osm_madw_t *madw, size_t attr_size, free(item); } + osm_dump_sa_mad(sa->p_log, resp_sa_mad, OSM_LOG_FRAMES); osm_sa_send(sa, resp_madw, FALSE); - osm_dump_sa_mad(sa->p_log, resp_sa_mad, OSM_LOG_FRAMES); Exit: /* need to set the mem free ... */ item = cl_qlist_remove_head(list); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu Feb 12 07:01:22 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 12 Feb 2009 17:01:22 +0200 Subject: [ofa-general] [PATCH] opensm/osm_state_mgr.c: small bug in scanning lid table Message-ID: <499439C2.40206@dev.mellanox.co.il> Hi Sasha, ref_size and curr_size return the size of the array, which counts LIDs from 0, so max_lid will be out of actual LIDs that are used. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_state_mgr.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index f5d3837..0a27044 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -932,7 +932,7 @@ static void __osm_state_mgr_check_tbl_consistency(IN osm_sm_t * sm) /* They should be the same, but compare it anyway */ max_lid = (ref_size > curr_size) ? ref_size : curr_size; - for (lid = 1; lid <= max_lid; lid++) { + for (lid = 1; lid < max_lid; lid++) { p_port_ref = NULL; p_port_stored = NULL; cl_ptr_vector_at(p_port_lid_tbl, lid, (void *)&p_port_stored); @@ -1006,7 +1006,7 @@ static void cleanup_switch(cl_map_item_t *item, void *log) if (!sw->new_lft) return; - + if (memcmp(sw->lft, sw->new_lft, IB_LID_UCAST_END_HO + 1)) osm_log(log, OSM_LOG_ERROR, "ERR 331D: " "LFT of switch 0x%016" PRIx64 " is not up to date.\n", -- 1.5.1.4 From purdy at sgi.com Thu Feb 12 07:14:36 2009 From: purdy at sgi.com (Dale Purdy) Date: Thu, 12 Feb 2009 09:14:36 -0600 Subject: [ofa-general] [PATCH] Fix credit loop checking Message-ID: <20090212151436.GA17309@sgi.com> ibdiagnet/ibdiagui and ibdmchk assume that up/down routing is being used if it is able to find roots, whether the root are correct or not. If it finds roots it does its credit loop validation based on whether the up/down rules are followed instead of doing a rigorous credit loop check. If the roots aren't correct, this can lead to determination of credit loop problems on topologies that don't have problems. ibdmchk has an option to supply one's own root_guids file to override this if you actually are using up/down and have your own roots that were used to set up the routing, but ibdiagnet/ibdiagui doesn't. In any case there shouldn't be assumptions about what the routing algorithm is, or what the roots are when checking for credit loops. Add a -u option to ibdiagnet/ibdiagui. Change ibdiagnet/ibdiagui and ibdmchk to only do the simple up/down rule checking against roots when the -u option is used. Otherwise the full credit loop check is done. Signed-off-by: Dale Purdy --- ibdiag/doc/ibdiagnet.pod | 11 ++++++++++- ibdiag/doc/ibdiagui.pod | 2 +- ibdiag/src/ibdebug.tcl | 11 +++++++---- ibdiag/src/ibdebug_if.tcl | 10 +++++++--- ibdm/src/osm_check.cpp | 22 +++++++++------------- 5 files changed, 34 insertions(+), 22 deletions(-) diff --git a/ibdiag/doc/ibdiagnet.pod b/ibdiag/doc/ibdiagnet.pod index d2cf460..cdc78ed 100644 --- a/ibdiag/doc/ibdiagnet.pod +++ b/ibdiag/doc/ibdiagnet.pod @@ -4,7 +4,7 @@ B =head1 SYNOPSYS -ibdiagnet [-c ] [-v] [-r] [-o ] +ibdiagnet [-c ] [-v] [-r] [-u] [-o ] [-t ] [-s ] [-i ] [-p ] [-wt] [-pm] [-pc] [-P <=>] [-lw <1x|4x|12x>] [-ls <2.5|5|10>] @@ -135,6 +135,15 @@ Provides a report of the fabric qualities =back +=item B<-u> : + +=over + +=item +Credit loop check based on UpDown rules + +=back + =item B<-t > : =over diff --git a/ibdiag/doc/ibdiagui.pod b/ibdiag/doc/ibdiagui.pod index 4e0250f..86a2df9 100644 --- a/ibdiag/doc/ibdiagui.pod +++ b/ibdiag/doc/ibdiagui.pod @@ -4,7 +4,7 @@ B =head1 SYNOPSYS -ibdiagui [-c ] [-v] [-r] [-o ] +ibdiagui [-c ] [-v] [-r] [-u] [-o ] [-t ] [-s ] [-i ] [-p ] [-pm] [-pc] [-P =] [-lw <1x|4x|12x>] [-ls <2.5|5|10>] diff --git a/ibdiag/src/ibdebug.tcl b/ibdiag/src/ibdebug.tcl index 3a464f2..04a8566 100644 --- a/ibdiag/src/ibdebug.tcl +++ b/ibdiag/src/ibdebug.tcl @@ -4391,10 +4391,13 @@ proc DumpFabQualities {} { inform "-I-ibdiagnet:check.credit.loops.header" # report credit loops - ibdmCalcMinHopTables $fabric - set roots [ibdmFindRootNodesByMinHop $fabric] - # just flush out any logs - set report [ibdmGetAndClearInternalLog] + set roots "" + if { [info exists G(argv:updown)] } { + ibdmCalcMinHopTables $fabric + set roots [ibdmFindRootNodesByMinHop $fabric] + # just flush out any logs + set report [ibdmGetAndClearInternalLog] + } if {[llength $roots]} { inform "-I-reporting:found.roots" $roots ibdmReportNonUpDownCa2CaPaths $fabric $roots diff --git a/ibdiag/src/ibdebug_if.tcl b/ibdiag/src/ibdebug_if.tcl index 21afc45..cf1b571 100644 --- a/ibdiag/src/ibdebug_if.tcl +++ b/ibdiag/src/ibdebug_if.tcl @@ -163,6 +163,10 @@ proc SetInfoArgv {} { -t,param "topo-file" -t,desc "Specifies the topology file name" + -u,name "updown" + -u,desc "Indicates that UpDown rule checking should be done against automaticly determined roots" + -u,arglen 0 + -v,name "verbose" -v,desc "Instructs the tool to run in verbose mode" -v,arglen 0 @@ -322,8 +326,8 @@ proc SetToolsFlags {} { array set TOOLS_FLAGS { ibping "(n|l|d) . c w v o . t s i p " ibdiagpath "(n|l|d) . c v o smp . t s i p . pm pc P . lw ls sl ." - ibdiagui " c v r o . t s i p . pm pc P . lw ls ." - ibdiagnet " c v r o . t s i p wt . pm pc P . lw ls . skip load_db csv" + ibdiagui " c v r u o . t s i p . pm pc P . lw ls ." + ibdiagnet " c v r u o . t s i p wt . pm pc P . lw ls . skip load_db csv" ibcfg "(n|l|d) (c|q) . t s i p o" ibmad "(m) (a) (n|l|d) . t s i p o ; (q) a" ibsac "(m) (a) k . t s i p o ; (q) a" @@ -2535,7 +2539,7 @@ proc showHelpPage { args } { Hop-count information: maximal hop-count, an example path, and a hop-count histogram All CA-to-CA paths traced - Credit loop report + Credit loop report (based on UpDown if -u option is provided) mgid-mlid-HCAs matching table Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not reported. diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp index 1c18c1c..09a3637 100644 --- a/ibdm/src/osm_check.cpp +++ b/ibdm/src/osm_check.cpp @@ -552,21 +552,17 @@ int main (int argc, char **argv) { list rootNodes; int anyErr = 0; - if (RootsFileName.size()) - { - if (TopoFile.size()) - { - rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName); - } - else - { - rootNodes = ParseRootNodeGuidsFile(&fabric, RootsFileName); - } - } - else - { + if (UseUpDown) { + if (RootsFileName.size()) { + if (TopoFile.size()) { + rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName); + } else { + rootNodes = ParseRootNodeGuidsFile(&fabric, RootsFileName); + } + } else { rootNodes = SubnMgtFindRootNodesByMinHop(&fabric); } + } if (!rootNodes.empty()) { cout << "-I- Recognized " << rootNodes.size() << " root nodes:" << endl; -- 1.5.6.5 From stan.smith at intel.com Thu Feb 12 08:51:55 2009 From: stan.smith at intel.com (Smith, Stan) Date: Thu, 12 Feb 2009 08:51:55 -0800 Subject: [ofa-general] RE: [ofw] Re: saquery & osm vendor IBAL - ca_names missing from osm_vendor_t ? In-Reply-To: <20090211014635.GS26139@sashak.voltaire.com> References: <000001c9857e$018d49e0$ca97070a@amr.corp.intel.com> <498F5A8F.2000101@dev.mellanox.co.il> <498F5E7B.6020208@dev.mellanox.co.il> <3F6F638B8D880340AB536D29CD4C1E1931817BA0@orsmsx501.amr.corp.intel.com> <20090209235414.GM26139@sashak.voltaire.com> <3F6F638B8D880340AB536D29CD4C1E1931817F0D@orsmsx501.amr.corp.intel.com> <20090211014635.GS26139@sashak.voltaire.com> Message-ID: <3F6F638B8D880340AB536D29CD4C1E19319BF21F@orsmsx501.amr.corp.intel.com> Sasha Khapyorsky wrote: > On 16:34 Mon 09 Feb , Smith, Stan wrote: >> >> Path of least resistance thinking would point towards not doing a >> switch as the vendor-ibal to vendor-ibumad would be part of the >> Windows OpenSM migration to OFED 1.4x OpenSM. My thinking is that >> making a switch to vendor-ibumad would be a good deal more >> work/involved just to get saquery working. > > For just saquery it would be overkill. (BTW I posted patch which > cleans osm vendor calls from saquery - hope the problem of vendor-ibal > extending will be eliminated soon). Thank you very much! Yes your new saquery patches will eliminate the vendor-ibal issues and any proposed vendor-ibal mods. Stan. > > I was thinking about vendor switching in context of OpenSM itself - in > order to unify OpenSM/umad access layer between different systems (and > eventually to cleanup all those osm vendor mess). > >> Not knowing the Windows OpenSM code base, moving part of it forward >> seems like a task 'which' could blossom into a good deal more work >> for the small return of saquery working? Frankly, I'd rather see >> work put into getting OFED OpenSM 1.4 working on Windows. > > Sure, it could be done as part of WinOF OpenSM upgrade process (doing > this just for fun against outdated OpenSM codebase doesn't buy a > much). > > Sasha From sashak at voltaire.com Thu Feb 12 12:00:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 12 Feb 2009 22:00:25 +0200 Subject: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123355.GP17713@sashak.voltaire.com> Message-ID: <20090212200025.GC14416@sashak.voltaire.com> Hi Hal, On 07:41 Thu 12 Feb , Hal Rosenstock wrote: > > That's what I originally thought too but I'm not so sure looking at > the other vendor layers. For example, osm_vendor_al.c (which I think > is used in Windows currently) has the following code in > osm_vendor_get_all_port_attr (and other vendor layers except umad are > similar): > > for (port_num = 0; port_num < num_ports; port_num++) { > p_attr_array[port_count] = > *__osm_ca_info_get_port_attr_ptr(p_ca_info, > port_num); > port_count++; > } > > and > > static ib_port_attr_t *__osm_ca_info_get_port_attr_ptr(IN const osm_ca_info_t * > const p_ca_info, > IN const uint8_t index) > { > return (&p_ca_info->p_attr->p_port_attr[index]); > } > > so I'm thinking the tables need to be supplied by the underlying > vendor library (al, umad, ...). Do you concur ? It is already supplied by libibumad - by umad_get_ca() (ca.ports[i]->pkeys). I think you just need to copy this to ib_port_attr_t structure. Sasha From sashak at voltaire.com Thu Feb 12 12:12:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 12 Feb 2009 22:12:02 +0200 Subject: [ofa-general] Re: [PATCH v3] opensm/osm_console.c : Added dump_portguid function to console to generate a list of port guids matching one or more regexps In-Reply-To: <4993C5C3.6020700@ext.bull.net> References: <4993C5C3.6020700@ext.bull.net> Message-ID: <20090212201202.GD14416@sashak.voltaire.com> On 07:46 Thu 12 Feb , Nicolas Morey Chaisemartin wrote: > This add a dump_portguid functionnality to openSM console which makes it > really easy to generate cn_guid_file, root_guid_file and such > by dumping into a file all port guids whom nodedesc contains at least one > of the provided regexps > > Signed-off-by: Nicolas Morey-Chaisemartin > Applied. Thanks. Sasha From sashak at voltaire.com Thu Feb 12 12:20:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 12 Feb 2009 22:20:20 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_sa.c: fixing SA MAD dump In-Reply-To: <4994386B.1040703@dev.mellanox.co.il> References: <4994386B.1040703@dev.mellanox.co.il> Message-ID: <20090212202020.GE14416@sashak.voltaire.com> On 16:55 Thu 12 Feb , Yevgeny Kliteynik wrote: > Hi Sasha, > > osm_sa_send() returns the MAD to the pool after sending it, > so dumping the MAD after sending it is wrong - fixing. > > Signed-off-by: Yevgeny Kliteynik Appied, Thanks. From sashak at voltaire.com Thu Feb 12 12:31:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 12 Feb 2009 22:31:05 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: small bug in scanning lid table In-Reply-To: <499439C2.40206@dev.mellanox.co.il> References: <499439C2.40206@dev.mellanox.co.il> Message-ID: <20090212203105.GF14416@sashak.voltaire.com> On 17:01 Thu 12 Feb , Yevgeny Kliteynik wrote: > Hi Sasha, > > ref_size and curr_size return the size of the array, > which counts LIDs from 0, so max_lid will be out of > actual LIDs that are used. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From arlin.r.davis at intel.com Thu Feb 12 14:25:50 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 12 Feb 2009 14:25:50 -0800 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: <499389BB.6060806@cs.anu.edu.au> References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au> Message-ID: >I am get a bit confused by description on the >DAT_COMPLETION_SUPPRESS_FLAG. > >Looks like it suppress notification after DTO operations. Is >it always true? Yes, with the exception of errors. >I have found that when I am using dat_ep_post_rdma_write to transfering >data over 128k (within 1 iov). Event will be brought to server side >(verified >with cookie), and at client side an event with Invalid_DAT_EVENT_NUMBER >will be received. What side is server and which is client? You will not see any indication on the remote side of an rdma_write. If you see an event with invalid event number then there is a failure during the operation or the QP went into error state. What version of uDAPL are you using? 2.0 or 1.2? Is this IB or iWARP? -arlin From andy.grover at oracle.com Thu Feb 12 14:26:05 2009 From: andy.grover at oracle.com (Andy Grover) Date: Thu, 12 Feb 2009 14:26:05 -0800 Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp Message-ID: <4994A1FD.2060704@oracle.com> Hi Vlad, Bringing up an old issue... With RDS-level flow control enabled, RDS attempts to set rnr_counter to 0 on an already connected QP by transitioning through SQD state. SQD is not supported on ConnectX, and so we either need do it the right way or make other plans. Is there an alternative way to adjust rnr_counter, or should we just assume this is unchangeable once connected? Thanks -- Regards -- Andy From Jie.Cai at cs.anu.edu.au Thu Feb 12 14:42:26 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 13 Feb 2009 09:42:26 +1100 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au> Message-ID: <4994A5D2.1040405@cs.anu.edu.au> Davis, Arlin R wrote: >> I am get a bit confused by description on the >> DAT_COMPLETION_SUPPRESS_FLAG. >> >> Looks like it suppress notification after DTO operations. Is >> it always true? >> > > Yes, with the exception of errors. > > >> I have found that when I am using dat_ep_post_rdma_write to transfering >> data over 128k (within 1 iov). Event will be brought to server side >> (verified >> with cookie), and at client side an event with Invalid_DAT_EVENT_NUMBER >> will be received. >> > > What side is server and which is client? sever side did the rdma write, and client side is the remote side. > You will not see any > indication on the remote side of an rdma_write. If you see an > event with invalid event number then there is a failure during > the operation or the QP went into error state. > > What version of uDAPL are you using? 2.0 or 1.2? > I am using uDAPL 2.0. However, when I used DAT_COMPLETION_SUPPRESS_FLAG at server side and the data been transfered is larger than 128KB, there is an event come to the server side with rdma write cookie. Is there an limitation on the size of data been transfered? > Is this IB or iWARP? > > This is IB, and I am using Mellanox ConnectX IB HCAs. > -arlin - Jie From sean.hefty at intel.com Thu Feb 12 14:34:49 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 12 Feb 2009 14:34:49 -0800 Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp In-Reply-To: <4994A1FD.2060704@oracle.com> References: <4994A1FD.2060704@oracle.com> Message-ID: >With RDS-level flow control enabled, RDS attempts to set rnr_counter to >0 on an already connected QP by transitioning through SQD state. SQD is >not supported on ConnectX, and so we either need do it the right way or >make other plans. Can this be set to 0 when connecting? - Sean From andy.grover at oracle.com Thu Feb 12 14:43:49 2009 From: andy.grover at oracle.com (Andy Grover) Date: Thu, 12 Feb 2009 14:43:49 -0800 Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp In-Reply-To: References: <4994A1FD.2060704@oracle.com> Message-ID: <4994A625.9060008@oracle.com> Sean Hefty wrote: >> With RDS-level flow control enabled, RDS attempts to set rnr_counter to >> 0 on an already connected QP by transitioning through SQD state. SQD is >> not supported on ConnectX, and so we either need do it the right way or >> make other plans. > > Can this be set to 0 when connecting? Yes of course, it would just be a little nicer if we could change once connected, instead of only when initiating the connection, so I wanted to find out if that was also possible. Thanks -- Regards -- Andy From rdreier at cisco.com Thu Feb 12 14:47:32 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Feb 2009 14:47:32 -0800 Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp In-Reply-To: <4994A625.9060008@oracle.com> (Andy Grover's message of "Thu, 12 Feb 2009 14:43:49 -0800") References: <4994A1FD.2060704@oracle.com> <4994A625.9060008@oracle.com> Message-ID: >> With RDS-level flow control enabled, RDS attempts to set rnr_counter to >> 0 on an already connected QP by transitioning through SQD state. SQD is >> not supported on ConnectX, and so we either need do it the right way or >> make other plans. Is SQD really not supported by ConnectX? If so it is likely a temporary firmware issue I would think. - R. From arlin.r.davis at intel.com Thu Feb 12 14:48:36 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 12 Feb 2009 14:48:36 -0800 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: <4994A5D2.1040405@cs.anu.edu.au> References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au> <4994A5D2.1040405@cs.anu.edu.au> Message-ID: >However, when I used DAT_COMPLETION_SUPPRESS_FLAG >at server side and the data been transfered is larger than 128KB, >there is an event come to the server side with rdma write cookie. You are most likely running into access violations or some other error. You should see the following message with any DTO error: "DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n" What is the DTO event status on the server side? >Is there an limitation on the size of data been transfered? based on the HCA max_msg_sz, usually 2GBytes (ibv_devinfo -v). -arlin From Jie.Cai at cs.anu.edu.au Thu Feb 12 15:09:24 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 13 Feb 2009 10:09:24 +1100 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au> <4994A5D2.1040405@cs.anu.edu.au> Message-ID: <4994AC24.3090907@cs.anu.edu.au> Davis, Arlin R wrote: > > >> However, when I used DAT_COMPLETION_SUPPRESS_FLAG >> at server side and the data been transfered is larger than 128KB, >> there is an event come to the server side with rdma write cookie. >> > > You are most likely running into access violations or some other > error. You should see the following message with any DTO error: > > "DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n" > I didn't see this error message. > What is the DTO event status on the server side? > > 12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP >> Is there an limitation on the size of data been transfered? >> > > based on the HCA max_msg_sz, usually 2GBytes (ibv_devinfo -v). > > -arlin > > From arlin.r.davis at intel.com Thu Feb 12 15:52:22 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 12 Feb 2009 15:52:22 -0800 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: <4994AC24.3090907@cs.anu.edu.au> References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au> <4994A5D2.1040405@cs.anu.edu.au> <4994AC24.3090907@cs.anu.edu.au> Message-ID: >> You are most likely running into access violations or some other >> error. You should see the following message with any DTO error: >> >> "DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n" >> >I didn't see this error message. What dapl packages are installed? rpm -qa | grep dapl What provider device name are you using? ofa-v2-ib0? >> >12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP Your DTO event string mapping looks odd. You have an error minor status along with a success major status. Does event.event_number == DAT_DTO_COMPLETION_EVENT? What is event.event_data.dto_completion_event_data.status? -arlin From Jie.Cai at cs.anu.edu.au Thu Feb 12 16:13:19 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 13 Feb 2009 11:13:19 +1100 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au> <4994A5D2.1040405@cs.anu.edu.au> <4994AC24.3090907@cs.anu.edu.au> Message-ID: <4994BB1F.8080902@cs.anu.edu.au> Davis, Arlin R wrote: > > > >>> You are most likely running into access violations or some other >>> error. You should see the following message with any DTO error: >>> >>> "DTO completion ERR: status %d, op %s, vendor_err 0x%x - %s\n" >>> >>> >> I didn't see this error message. >> > > What dapl packages are installed? rpm -qa | grep dapl > dapl-devel-2.0.7-1.ofed1.3 dapl-1.2.5-1.ofed1.3 dapl-devel-static-2.0.7-1.ofed1.3 dapl-devel-1.2.5-1.ofed1.3 dapl-utils-2.0.7-1.ofed1.3 dapl-2.0.7-1.ofed1.3 > What provider device name are you using? ofa-v2-ib0? > yes, I am using ofa-v2-ib0. > >>> >>> >> 12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP >> > > Your DTO event string mapping looks odd. You have an error minor > status along with a success major status. > > Does event.event_number == DAT_DTO_COMPLETION_EVENT? > Yes, it is a DAT_DTO_COMPLETION_EVENT. > What is event.event_data.dto_completion_event_data.status? > I printed it out, the event.event_data.dto_completion_event_data.status is 4. > -arlin From arlin.r.davis at intel.com Thu Feb 12 16:05:13 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 12 Feb 2009 16:05:13 -0800 Subject: [ofa-general] Question on dat_ep_post_rdma_write with DAT_COMPLETION_SUPPRESS_FLAG. In-Reply-To: <4994AC24.3090907@cs.anu.edu.au> References: <49927A53.1020403@cs.anu.edu.au> <499389BB.6060806@cs.anu.edu.au> <4994A5D2.1040405@cs.anu.edu.au> <4994AC24.3090907@cs.anu.edu.au> Message-ID: >> >12734: ERROR: DTO event status :DAT_SUCCESS DAT_RESOURCE_TEP You are incorrectly using the string to return code mapping. dat_return_subtype (dat_error.h) of DAT_RESOURCE_TEP == 4 is really a DAT_DTO_ERR_LOCAL_PROTECTION error. see dat.h for DTO completion status: typedef enum dat_dto_completion_status { DAT_DTO_SUCCESS = 0, DAT_DTO_ERR_FLUSHED = 1, DAT_DTO_ERR_LOCAL_LENGTH = 2, DAT_DTO_ERR_LOCAL_EP = 3, DAT_DTO_ERR_LOCAL_PROTECTION = 4, <<<<< DAT_DTO_ERR_BAD_RESPONSE = 5, DAT_DTO_ERR_REMOTE_ACCESS = 6, DAT_DTO_ERR_REMOTE_RESPONDER = 7, DAT_DTO_ERR_TRANSPORT = 8, DAT_DTO_ERR_RECEIVER_NOT_READY = 9, DAT_DTO_ERR_PARTIAL_PACKET = 10, DAT_RMR_OPERATION_FAILED = 11 } DAT_DTO_COMPLETION_STATUS; -arlin From wangwhao at cn.ibm.com Thu Feb 12 16:05:48 2009 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Fri, 13 Feb 2009 08:05:48 +0800 Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: <1234444830.10037.313.camel@firewall.xsintricity.com> Message-ID: Doug Ledford 写于 2009-02-12 21:20:30: > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote: > > Wen Hao Wang wrote: > > > > > > Hi all: > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped > > > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and > > > wrote network interface configuration file ifcfg-ib0. ifup ib0 also > > > succeeded. But IB utilites report Connetion timed out. > > > > > > > > > [root at xblade06 network-scripts]# sminfo > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > > sminfo: iberror: failed: query > > > > > > I had to reboot the blade and rerun "openibd start". Then sminfo > > > reported correct contents. I do not suppose this reboot is required. > > > Did I miss any configuration step? > > There was an unintentional bug in the rhel5.2 openibd init script in > that it automatically turned itself on during install (generally, most > init scripts should default to *not* turning themselves on during > install of the package, nor should they start themselves during install > of the package...this is for security reasons, imagine if you installed > the bind name server on your box and it automatically started up before > you had a chance to configure it). In rhel5.3 we fixed that bug. So, Yeah. I heard of this bug. > you may need to 'chkconfig --level 2345 openibd on' to make sure openibd > starts up each time. The error you list above is consistent with not > all of the kernel modules being loaded when you tried to use the sminfo > program. Even after reboot, service openibd is not started automatically. [root at xblade06 ~]# chkconfig --list openibd openibd 0:off 1:off 2:off 3:off 4:off 5:off 6:off I agree with you that maybe some modules were not loaded. But what's that? Before reboot, I run "/etc/init.d/openibd start" and "/etc/init.d/network restart". No error was reported. "openibd status" also looked good. > > > > Moreover, "openibd start" report one warning message about hwconf. > > > Anyone has comments about this? > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such > > > file or directory > > > [ OK ] > > Can you see if the kudzu package is installed on your machine? The > openib package uses this config file written by kudzu to determine what > hardware drivers to load. I suppose I should put a specific requires in > the rpm for that. kudzu is installed. [root at xblade06 ~]# rpm -q kudzu kudzu-1.2.57.1.21-1 > > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除] Thanks! Wen Hao Wang Email: wangwhao at cn.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Feb 12 16:09:01 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 12 Feb 2009 16:09:01 -0800 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> References: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> Message-ID: <8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com> > It seems that even when the rdma-cm consumer binds to a specific address, > the rdma-cm address resolution code follows the order of the >devices/rules > in routing table. So the user can't really dictate an outgoing interface > based on the src address provided to rdma_resolve_addr. > >Did you had the chance to look into that? I'm running 2.6.28 with 1 HCA with 2 ports. I added debug output around calls to rdma_translate_ip() and cma_acquire_dev(). The short answer is that things appear to work as expected. ib0 is on port 1 - 192.168.0.102 ib1 is on port 2 - 192.168.0.122 If I run ucmatose -b ip_addr (with or without -s option), I see that rdma_translate_ip() returns different net_device structures for the different input addresses. cma_acquire_dev() also indicates that different ports on the same HCA are being used for the two addresses. If I unplug one of the ports, I can no long connect if I use the IP address that corresponds to that port, but the other port still works. It doesn't matter which port I unplug, as long as I use the correct IP address. - Sean From wangwhao at cn.ibm.com Thu Feb 12 16:10:22 2009 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Fri, 13 Feb 2009 08:10:22 +0800 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: <49941414.2050400@ext.bull.net> Message-ID: Nicolas Morey Chaisemartin 写于 2009-02-12 20:20:36: > Wen Hao Wang wrote: > > > > Hi all: > > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped > > in RHEL5.3 image) by "yum groupisntall". Then I load some drivers and > > wrote network interface configuration file ifcfg-ib0. ifup ib0 also > > succeeded. But IB utilites report Connetion timed out. > > > > > > [root at xblade06 network-scripts]# sminfo > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > sminfo: iberror: failed: query > > > > I had to reboot the blade and rerun "openibd start". Then sminfo > > reported correct contents. I do not suppose this reboot is required. > > Did I miss any configuration step? > > > > Moreover, "openibd start" report one warning message about hwconf. > > Anyone has comments about this? > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such > > file or directory > > [ OK ] > > > > Thanks a lot! > > > > Wen Hao Wang > > Email: wangwhao at cn.ibm.com > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib. > org/mailman/listinfo/openib-general > Sounds to me as if you haven't any Subnet Manager (OpenSM or managed > switch) running. > $sminfo > sminfo: sm lid 2 sm guid 0x8f1040041254a, activity count 751941 priority > 3 state 3 SMINFO_MASTER > $ sminfo -P 2 > ibwarn: [17975] mad_rpc: _do_madrpc failed; dport (Lid 3945) > sminfo: iberror: failed: query > > (we don't have any SM on the subnet connected to port 2) > > Your reboot might have started OpenSM. Thus making it works > > Nicolas > > OpenSM is running on another machine with Lid 9. While this machine (xblade06) has Lid 8. Here is the output after reboot: [root at xblade06 ~]# sminfo sminfo: sm lid 9 sm guid 0x2c90300013101, activity count 618300 priority 0 state 3 SMINFO_MASTER [root at xblade06 ~]# ps -ef|grep opensm root 5369 5234 0 00:08 pts/0 00:00:00 grep opensm [root at xblade06 ~]# ibv_devices device node GUID ------ ---------------- mlx4_0 0002c903000134b0 [root at xblade06 ~]# ibnetdiscover |grep 2c903000134b0 # Initiated from node 0002c903000134b0 port 0002c903000134b1 [10] "H-0002c903000134b0"[1](2c903000134b1) # "xblade06 HCA-1" lid 8 4xSDR caguid=0x2c903000134b0 Ca 2 "H-0002c903000134b0" # "xblade06 HCA-1" Thanks! Wen Hao Wang Email: wangwhao at cn.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From wangwhao at cn.ibm.com Thu Feb 12 16:17:37 2009 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Fri, 13 Feb 2009 08:17:37 +0800 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: Message-ID: Hal Rosenstock 写于 2009-02-12 20:04:44: > On Thu, Feb 12, 2009 at 2:37 AM, Wen Hao Wang wrote: > > Hi all: > > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED (shipped in > > RHEL5.3 image) by "yum groupisntall". Then I load some drivers and wrote > > network interface configuration file ifcfg-ib0. ifup ib0 also succeeded. But > > IB utilites report Connetion timed out. > > > > > > [root at xblade06 network-scripts]# sminfo > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > sminfo: iberror: failed: query > > It looks like the SM found the blade and at least configured the SMLID > but somehow LID routing did not work between the blade and the SM (at > LID 9). Was this problem persistent (without rebooting the blade) ? > Was the blade IB port active ? > > -- Hal Before reboot, I tried following operations openibd restart network restart ibcheckerrors ibclearerrors But none of them helped. I had no idea what else I could do. So I tried reboot. If I remember correct, the port state was Linkup before rebooting. And now it is active > > > I had to reboot the blade and rerun "openibd start". Then sminfo reported > > correct contents. I do not suppose this reboot is required. Did I miss any > > configuration step? > > > > Moreover, "openibd start" report one warning message about hwconf. Anyone > > has comments about this? > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No such file or > > directory > > [ OK ] > > > > Thanks a lot! > > > > Wen Hao Wang > > Email: wangwhao at cn.ibm.com > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > Thanks! Wen Hao Wang Email: wangwhao at cn.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Feb 12 16:41:40 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 12 Feb 2009 19:41:40 -0500 Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: <20090212200025.GC14416@sashak.voltaire.com> References: <20090207123355.GP17713@sashak.voltaire.com> <20090212200025.GC14416@sashak.voltaire.com> Message-ID: Sasha, On Thu, Feb 12, 2009 at 3:00 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 07:41 Thu 12 Feb , Hal Rosenstock wrote: >> >> That's what I originally thought too but I'm not so sure looking at >> the other vendor layers. For example, osm_vendor_al.c (which I think >> is used in Windows currently) has the following code in >> osm_vendor_get_all_port_attr (and other vendor layers except umad are >> similar): >> >> for (port_num = 0; port_num < num_ports; port_num++) { >> p_attr_array[port_count] = >> *__osm_ca_info_get_port_attr_ptr(p_ca_info, >> port_num); >> port_count++; >> } >> >> and >> >> static ib_port_attr_t *__osm_ca_info_get_port_attr_ptr(IN const osm_ca_info_t * >> const p_ca_info, >> IN const uint8_t index) >> { >> return (&p_ca_info->p_attr->p_port_attr[index]); >> } >> >> so I'm thinking the tables need to be supplied by the underlying >> vendor library (al, umad, ...). Do you concur ? > > It is already supplied by libibumad - by umad_get_ca() > (ca.ports[i]->pkeys). I think you just need to copy this to > ib_port_attr_t structure. Yes but rather than using supplied pointers (as inputs for the per port pkey/guid tables), the other vendor layers require a large enough buffer for these tables and set the port pointers appropriately (on output) rather than supplying these pointers as input parameters. So if we use these as input, then we definitely break the other vendor layers. -- Hal > Sasha > From sean.hefty at intel.com Thu Feb 12 16:56:19 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 12 Feb 2009 16:56:19 -0800 Subject: [ofa-general] [ib-mgmt] ibdiag_common.h question Message-ID: <12C5145C5B854D78A1DAA6BB2F2CBA50@amr.corp.intel.com> I noticed the following in ibdiag_common.h: #define DEBUG if (ibdebug || ibverbose) IBWARN #define VERBOSE if (ibdebug || ibverbose > 1) IBWARN This allows for else statements to mismatch when defined. - Sean From rdreier at cisco.com Thu Feb 12 21:43:14 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Feb 2009 21:43:14 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: remove modulo math from build_rdma_recv(). In-Reply-To: <20090211222915.19520.22647.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 11 Feb 2009 16:29:15 -0600") References: <20090211222915.19520.22647.stgit@dell3.ogc.int> Message-ID: thanks, applied From rdreier at cisco.com Thu Feb 12 21:47:34 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Feb 2009 21:47:34 -0800 Subject: [ofa-general] Re: [PATCH v3] RDMA/nes: Account for freed pbl after hw operation In-Reply-To: <20090202231521.GA6220@ctung-MOBL> (Chien Tung's message of "Mon, 2 Feb 2009 17:15:21 -0600") References: <20090202231521.GA6220@ctung-MOBL> Message-ID: looks good, applied... one comment: > Add proper pbl accounting in case nes_reg_mr failed. > > Signed-off-by: Don Wood when you forward someone else's patch, you should add your Signed-off-by line after theirs. - R. From sean.hefty at intel.com Thu Feb 12 23:21:21 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 12 Feb 2009 23:21:21 -0800 Subject: [ofa-general] [ib-diag] sminfo: add support for WinOF Message-ID: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> Allow sminfo to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- Would there be any objection to including the windows source files (.c and .h) in the mgmt tree? infiniband-diags/Makefile.am | 2 + infiniband-diags/include/ibdiag_common.h | 2 + infiniband-diags/include/linux/ibdiag_osd.h | 43 +++++++++++++++++++++++++++ infiniband-diags/src/ibdiag_common.c | 13 ++++---- infiniband-diags/src/sminfo.c | 15 ++++----- 5 files changed, 58 insertions(+), 17 deletions(-) diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index f9cc5bd..0d32abd 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -1,5 +1,5 @@ -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband -I$(srcdir)/include/linux if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index 4783b8e..2dea873 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -52,7 +52,7 @@ extern int ibd_timeout; #undef DEBUG #define DEBUG if (ibdebug || ibverbose) IBWARN #define VERBOSE if (ibdebug || ibverbose > 1) IBWARN -#define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) +#define IBERROR(fmt, ...) iberror(__FUNCTION__, fmt, ## __VA_ARGS__) struct ibdiag_opt { const char *name; diff --git a/infiniband-diags/include/linux/ibdiag_osd.h b/infiniband-diags/include/linux/ibdiag_osd.h new file mode 100644 index 0000000..5c6faa9 --- /dev/null +++ b/infiniband-diags/include/linux/ibdiag_osd.h @@ -0,0 +1,43 @@ +/* + * Copyright (c) 2009 Intel Corp, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _IBDIAG_OSD_H_ +#define _IBDIAG_OSD_H_ + +#include +#include +#include + +#define CDECL + +#endif /* _IBDIAG_OSD_H_ */ diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index bda1efa..154e00c 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -43,15 +43,14 @@ #include #include #include -#include #include -#include #include #include #include #include #include +#include "ibdiag_osd.h" int ibdebug; int ibverbose; @@ -204,7 +203,7 @@ static const struct ibdiag_opt common_opts[] = { { "usage", 'u', 0, NULL, "usage message" }, { "help", 'h', 0, NULL, "help message" }, { "version", 'V', 0, NULL, "show version" }, - {} + { 0 } }; static void make_opt(struct option *l, const struct ibdiag_opt *o, @@ -254,11 +253,11 @@ static struct option *make_long_opts(const char *exclude_str, static void make_str_opts(const struct option *o, char *p, unsigned size) { - int i, n = 0; + unsigned i, n = 0; for (n = 0; o->name && n + 2 + o->has_arg < size; o++) { - p[n++] = o->val; - for (i = 0; i < o->has_arg; i++) + p[n++] = (char) o->val; + for (i = 0; i < (unsigned) o->has_arg; i++) p[n++] = ':'; } p[n] = '\0'; @@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[], void *cxt, char str_opts[1024]; const struct ibdiag_opt *o; - memset(opts_map, 0, sizeof(opts_map)); + memset((void *) opts_map, 0, sizeof(opts_map)); prog_name = argv[0]; prog_args = usage_args; diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c index e96c782..7767668 100644 --- a/infiniband-diags/src/sminfo.c +++ b/infiniband-diags/src/sminfo.c @@ -37,14 +37,13 @@ #include #include -#include -#include #include #include #include #include "ibdiag_common.h" +#include "ibdiag_osd.h" static uint8_t sminfo[1024]; @@ -59,10 +58,10 @@ enum { }; char *statestr[] = { - [SMINFO_NOTACT] "SMINFO_NOTACT", - [SMINFO_DISCOVER] "SMINFO_DISCOVER", - [SMINFO_STANDBY] "SMINFO_STANDBY", - [SMINFO_MASTER] "SMINFO_MASTER", + "SMINFO_NOTACT", + "SMINFO_DISCOVER", + "SMINFO_STANDBY", + "SMINFO_MASTER", }; #define STATESTR(s) (((unsigned)(s)) < SMINFO_STATE_LAST ? statestr[s] : "???") @@ -88,7 +87,7 @@ static int process_opt(void *context, int ch, char *optarg) return 0; } -int main(int argc, char **argv) +int CDECL main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; int mod = 0; @@ -100,7 +99,7 @@ int main(int argc, char **argv) { "state", 's', 1, "<0-3>", "set SM state"}, { "priority", 'p', 1, "<0-15>", "set SM priority"}, { "activity", 'a', 1, NULL, "set activity count"}, - { } + { 0 } }; char usage_args[] = " [modifier]"; From sean.hefty at intel.com Thu Feb 12 23:31:31 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 12 Feb 2009 23:31:31 -0800 Subject: [ofa-general] [ibmad] libibmad: add MAD_EXPORT to exported calls Message-ID: <877D4427C8B64CFCB6B26E0CE0F5812A@amr.corp.intel.com> From: Stan Smith ibtracert and ibroute need xdump and smp_query_via exported from the library. Add MAD_EXPORT to the calls for Windows support. Signed-off-by: Stan Smith Signed-off-by: Sean Hefty --- libibmad/include/infiniband/mad.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index bd62ec7..1aaaa1b 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -748,7 +748,7 @@ MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout); MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout); -uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, +MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout, const void *srcport); uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout, const void *srcport); @@ -875,7 +875,7 @@ static inline uint64_t htonll(uint64_t x) exit(-1); \ } while(0) -void xdump(FILE * file, char *msg, void *p, int size); +MAD_EXPORT void xdump(FILE * file, char *msg, void *p, int size); END_C_DECLS #endif /* _MAD_H_ */ From nicolas.morey-chaisemartin at ext.bull.net Fri Feb 13 01:24:24 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Fri, 13 Feb 2009 10:24:24 +0100 Subject: [ofa-general] [PATCH 1/3 v2] opensm: Added io_guid_file and max_reverse_hops options In-Reply-To: References: Message-ID: <49953C48.3030203@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- Reposted as io_guid_file and max_reverse_hops were missing from the opt_tbl and wouldn't be read from the cached option file. opensm/include/opensm/osm_subnet.h | 6 ++++++ opensm/opensm/main.c | 26 +++++++++++++++++++++++++- opensm/opensm/osm_subnet.c | 14 ++++++++++++++ 3 files changed, 45 insertions(+), 1 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 8863e47..671b51f 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -190,6 +190,8 @@ typedef struct osm_subn_opt { char *lfts_file; char *root_guid_file; char *cn_guid_file; + char *io_guid_file; + uint16_t max_reverse_hops; char *ids_guid_file; char *guid_routing_order_file; char *sa_db_file; @@ -383,6 +385,10 @@ typedef struct osm_subn_opt { * Name of the file that contains list of compute node guids that * will be used by fat-tree routing (provided by User) * +* io_guid_file +* Name of the file that contains list of I/O node guids that +* will be used by fat-tree routing (provided by User) +* * ids_guid_file * Name of the file that contains list of ids which should be * used by Up/Down algorithm instead of node GUIDs diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index a8dc9e6..b5e3337 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -212,6 +212,12 @@ static void show_usage(void) printf("--cn_guid_file, -u \n" " Set the compute nodes for the Fat-Tree routing algorithm\n" " to the guids provided in the given file (one to a line)\n\n"); + printf("--io_guid_file, -G \n" + " Set the I/O nodes for the Fat-Tree routing algorithm\n" + " to the guids provided in the given file (one to a line)\n\n"); + printf("--max_reverse_hops, -H \n" + " Set the max number of hops the wrong way around\n" + " an I/O node is allowed to do (connectivity for I/O nodes on top swithces)\n\n"); printf("--ids_guid_file, -m \n" " Name of the map file with set of the IDs which will be used\n" " by Up/Down routing algorithm instead of node GUIDs\n" @@ -526,7 +532,7 @@ int main(int argc, char *argv[]) uint32_t val; unsigned config_file_done = 0; const char *const short_option = - "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:"; + "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:"; /* In the array below, the 2nd parameter specifies the number @@ -570,6 +576,8 @@ int main(int argc, char *argv[]) {"sadb_file", 1, NULL, 'S'}, {"root_guid_file", 1, NULL, 'a'}, {"cn_guid_file", 1, NULL, 'u'}, + {"io_guid_file", 1, NULL, 'G'}, + {"max_reverse_hops", 1, NULL, 'H'}, {"ids_guid_file", 1, NULL, 'm'}, {"guid_routing_order_file", 1, NULL, 'X'}, {"stay_on_fatal", 0, NULL, 'y'}, @@ -880,6 +888,22 @@ int main(int argc, char *argv[]) opt.cn_guid_file); break; + case 'G': + /* + Specifies I/O node guids file + */ + opt.io_guid_file = optarg; + printf(" I/O Node Guid File: %s\n", + opt.io_guid_file); + break; + case 'H': + /* + Specifies I/O max reverted hops + */ + opt.max_reverse_hops = atoi(optarg); + printf(" Max Reverse Hops: %d\n", + opt.max_reverse_hops); + break; case 'm': /* Specifies ids guid file */ SET_STR_OPT(opt.ids_guid_file, optarg); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 69937c1..2ee7cf7 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -146,6 +146,8 @@ static const opt_rec_t opt_tbl[] = { { "lfts_file", OPT_OFFSET(lfts_file), opts_parse_charp, NULL, 0 }, { "root_guid_file", OPT_OFFSET(root_guid_file), opts_parse_charp, NULL, 0 }, { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 }, + { "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 }, + { "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16, NULL, 0 }, { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0 }, { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file), opts_parse_charp, NULL, 0 }, { "sa_db_file", OPT_OFFSET(sa_db_file), opts_parse_charp, NULL, 0 }, @@ -578,6 +580,8 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->lfts_file = NULL; p_opt->root_guid_file = NULL; p_opt->cn_guid_file = NULL; + p_opt->io_guid_file = NULL; + p_opt->max_reverse_hops = 0; p_opt->ids_guid_file = NULL; p_opt->guid_routing_order_file = NULL; p_opt->sa_db_file = NULL; @@ -1393,6 +1397,16 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str); fprintf(out, + "# The file holding the fat-tree I/O node guids\n" + "# One guid in each line\nio_guid_file %s\n\n", + p_opts->io_guid_file ? p_opts->io_guid_file : null_str); + + fprintf(out, + "# Number of reverse hops allowed for I/O nodes \n" + "# Used for connectivity between I/O nodes connected to Top Switches\nmax_reverse_hops %d\n\n", + p_opts->max_reverse_hops); + + fprintf(out, "# The file holding the node ids which will be used by" " Up/Down algorithm instead\n# of GUIDs (one guid and" " id in each line)\nids_guid_file %s\n\n", -- 1.6.1 From prabhat.sharda at gmail.com Fri Feb 13 03:07:27 2009 From: prabhat.sharda at gmail.com (prabhat sharda) Date: Fri, 13 Feb 2009 16:37:27 +0530 Subject: [ofa-general] Installation problem: cannot find -libverbs Message-ID: <743c0f8a0902130307r2999859bk931c2759a07635b8@mail.gmail.com> Hi, I am a newbie on OFED. I am trying to install OFED-1.4 on RHEL 5.2, which is as per notes is a supported platform. While installing OFED through menu, by selecting all the packages, the process halts listing below message: "Failed to build tgt-generic RPM See /tmp/OFED.28181.logs/tgt-generic.rpmbuild.log " The log file listed above has the below error message: *************************************************************************** cc iscsi/conn.o iscsi/param.o iscsi/session.o iscsi/iscsid.o iscsi/target.o iscsi/chap.o iscsi/transport.o iscsi/iscsi_tcp.o iscsi/isns.o iscsi/libcrc32c.o bs_rdwr.o bs_aio.o iscsi/iscsi_rdma.o tgtd.o mgmt.o target.o scsi.o log.o driver.o util.o work.o parser.o spc.o sbc.o mmc.o osd.o scc.o smc.o ssc.o bs_ssc.o bs.o -o tgtd -lcrypto -L /usr/OFED/lib64 -libverbs -lrdmacm -lpthread /usr/bin/ld: cannot find -libverbs collect2: ld returned 1 exit status make: *** [tgtd] Error 1 make: Leaving directory `/var/tmp/OFED_topdir/BUILD/tgt-generic/usr' error: Bad exit status from /var/tmp/rpm-tmp.42646 (%build) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.42646 (%build) **************************************************************** My installation machine is of 32 bit. On examining, I found that there is not any folder " /usr/OFED/lib64" , where as "lib" exist on path "/usr/OFED" having the files "libibverbs.a", "libibverbs.so", "libibverbs.so.1" and "libibverbs.so.1.0.0". Can anyone help me out to resolve this issue? Let me know if i missed to check something. Thanks in advance. Regards, Prabhat From vlad at lists.openfabrics.org Fri Feb 13 03:15:05 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 13 Feb 2009 03:15:05 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090213-0200 daily build status Message-ID: <20090213111506.42481E60F24@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Fri Feb 13 03:58:12 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 13 Feb 2009 06:58:12 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123355.GP17713@sashak.voltaire.com> <20090212200025.GC14416@sashak.voltaire.com> Message-ID: On Thu, Feb 12, 2009 at 7:41 PM, Hal Rosenstock wrote: > Yes but rather than using supplied pointers (as inputs for the per > port pkey/guid tables), the other vendor layers require a large enough > buffer for these tables and set the port pointers appropriately (on > output) rather than supplying these pointers as input parameters. So > if we use these as input, then we definitely break the other vendor > layers. Another choice is to ifdef these differences between Linux and Windows at least until umad is used there. -- Hal From diego.guella at sircomtech.com Fri Feb 13 07:32:28 2009 From: diego.guella at sircomtech.com (Diego Guella) Date: Fri, 13 Feb 2009 16:32:28 +0100 Subject: [ofa-general] ib_create_qp and ib_get_err_str weirdness Message-ID: <01fa01c98df0$47baed30$0100000a@DIEGO> Hello, I am using Mellanox WinOF 2.0.0 with a MHES14-XTC SDR single-port card. I noticed a strange behavior of ib_create_qp function: ----- memset(&qp_create, 0, sizeof(qp_create)); qp_create.qp_type = IB_QPT_RELIABLE_CONN; // Reliable Connected qp_create.sq_depth = ctx->qdepth; qp_create.rq_depth = ctx->qdepth; qp_create.sq_sge = ctx->hca_attr->max_sges; qp_create.rq_sge = ctx->hca_attr->max_sges; qp_create.h_sq_cq = ctx->cq_h; qp_create.h_rq_cq = ctx->cq_h; qp_create.h_srq = NULL; qp_create.sq_signaled = 1; ctx->qp_h = 0; rc = ib_create_qp(ctx->pd_h, &qp_create, NULL, NULL, &ctx->qp_h); ----- return value ("rc") is 3 (=IB_INVALID_PARAMETER). I spent some time figuring out the problem was the SQ SGE value: http://lists.openfabrics.org/pipermail/general/2006-June/023417.html According to iba/ib_al.h: ----- * IB_INVALID_MAX_SGE * The requested maximum number of scatter-gather entries for the send or * receive queue could not be supported. ----- so, why the return value isn't 22 (=IB_INVALID_MAX_SGE)? In the discussion I mentioned, it turned out that even using hca_attr->max_sges there is the possibility that ib_create_qp fails. Which is my case. I have the need to send some audio buffers (32 or more) from an IO node to a computing node using RDMA WRITE. The ownership of the buffers is of the audio driver, and I haven't the guarantee that the audio buffers are contiguous. I was trying and send them using the lowest possible number of WR, each one with the highest possible number of sge. But, given the hca_attr->max_sge unreliability, how do you recommend to achieve this goal? Should I post a WR for each buffer I'd want to send through RDMA WRITE? Another less-related problem: ib_get_err_str is not correct for every input value, for example I noticed that for ib_get_err_str(IB_INVALID_PD_HANDLE) the string returned is IB_INVALID_MR_HANDLE I don't know if these problems apply to linux too, so I'm including general list. Thanks and best regards, Diego From or.gerlitz at gmail.com Fri Feb 13 07:39:55 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 13 Feb 2009 17:39:55 +0200 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: <8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com> References: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> <8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com> Message-ID: <15ddcffd0902130739k59abf606r17ed8616aae7c246@mail.gmail.com> On Fri, Feb 13, 2009 at 2:09 AM, Sean Hefty wrote: > If I run ucmatose -b ip_addr (with or without -s option), I see that > rdma_translate_ip() returns different net_device structures for the different > input addresses. cma_acquire_dev() also indicates that different ports on the > same HCA are being used for the two addresses. > If I unplug one of the ports, I can no long connect if I use the IP address that > corresponds to that port, but the other port still works. It doesn't matter > which port I unplug, as long as I use the correct IP address. I wasn't sure if you actually run the whole test or just let rdma_bind to be called and see the above. Anyway, if you send me a patch with the prints you've added, I can repeat it in my setup and we'll see. Or. From or.gerlitz at gmail.com Fri Feb 13 07:48:16 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 13 Feb 2009 17:48:16 +0200 Subject: ***SPAM*** Re: [ofa-general] mlx4 changing RNR_RETRY for an established qp In-Reply-To: <4994A625.9060008@oracle.com> References: <4994A1FD.2060704@oracle.com> <4994A625.9060008@oracle.com> Message-ID: <15ddcffd0902130748n3b02c6b5v9be07d324c287692@mail.gmail.com> On Fri, Feb 13, 2009 at 12:43 AM, Andy Grover wrote: > Yes of course, it would just be a little nicer if we could change once > connected, instead of only when initiating the connection, so I wanted > to find out if that was also possible. Hi Andy, I've made a comment to Olaf couple of months ago on an alternative way for you to change the RNR value , see http://oss.oracle.com/pipermail/rds-devel/2008-May/000595.html - the archive copy has an awful long sentences - so I'll also fwd it to you directly. Or. From or.gerlitz at gmail.com Fri Feb 13 07:50:00 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 13 Feb 2009 17:50:00 +0200 Subject: Fwd: [ofa-general] RDS flow control In-Reply-To: <4833D223.5090007@voltaire.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <4833D223.5090007@voltaire.com> Message-ID: <15ddcffd0902130750j9720a01g4400be9b423004fd@mail.gmail.com> ---------- Forwarded message ---------- From: Or Gerlitz Date: Wed, May 21, 2008 at 9:41 AM Subject: Re: [ofa-general] RDS flow control To: Olaf Kirch Cc: Roland Dreier , rds-devel at oss.oracle.com, general at lists.openfabrics.org Roland Dreier wrote: > > > Is there a way of changing the RNR retry count back to 0 after establishing > > the connection? > > Yes... quite complicated but possible. Basically you have to transition > to the QP to the "send queue drained" (SQD) state, change the rnr retry > value in an SQD->SQD transition and then transition back to RTS. In case the RTS->SQD->SQD->RTS transition is not applicable or just for the sake of being aware to more solutions, I gave it some thought and its seems possible for you to build a protocol which uses exchange (through the private data carried by the CM messages) whether each side supports credit management, and based on that && HW support of the IB_DEVICE_RC_RNR_NAK_GEN device capability decide what value to place into the QP RNR retries. On the passive side of the connection its trivial, since the rdma-cm uses the values you place into the conn_param parameters of rdma_accept. On the active side, things are a bit more complex, but with some changes, I think you would be able to do it also in a different way than the SQD one: the RNR retries are set into the QP once its being moved to RTS (Ready-To-Send). So, if you managed to get the QP into your hands --before-- the RTU is sent (since this point in time is the last synchoronization step provided to you by the IB CM), you could set the RNR retries value accroding to info carried in the REP message sent by the passive (which you have posted in the private data to rdma_accept, etc). This would be possible, if you enhance the rdma-cm to deliver RDMA_CM_EVENT_CONNECT_RESPONSE event also to IDs created with the PS_TCP port space (eg conditioned on some new field in conn_param) where today its supported only to PS_SDP ones. Once this change is in place, you will get RDMA_CM_EVENT_CONNECT_RESPONSE event, decide what RNR retry value you want to use, and call rdma_accept providing this value (one more little change is needed here in cma.c), the rdma cm would override the value set by cm_init_qp_rts_attr, see cma_modify_qp_rts -> rdma_init_qp_attr -> ib_cm_init_qp_attr -> cm_init_qp_rts_attr and you are done... Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chien.tin.tung at intel.com Fri Feb 13 07:59:35 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Fri, 13 Feb 2009 08:59:35 -0700 Subject: [ofa-general] RE: [PATCH v3] RDMA/nes: Account for freed pbl after hw operation In-Reply-To: References: <20090202231521.GA6220@ctung-MOBL> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830323437DF7@azsmsx501.amr.corp.intel.com> >when you forward someone else's patch, you should add your >Signed-off-by line after theirs. Will do. Chien From dledford at redhat.com Fri Feb 13 08:13:32 2009 From: dledford at redhat.com (Doug Ledford) Date: Fri, 13 Feb 2009 11:13:32 -0500 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: References: Message-ID: <1234541612.751.1.camel@firewall.xsintricity.com> On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote: > Doug Ledford 写于 2009-02-12 21:20:30: > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote: > > > Wen Hao Wang wrote: > > > > > > > > Hi all: > > > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED > (shipped > > > > in RHEL5.3 image) by "yum groupisntall". Then I load some > drivers and > > > > wrote network interface configuration file ifcfg-ib0. ifup ib0 > also > > > > succeeded. But IB utilites report Connetion timed out. > > > > > > > > > > > > [root at xblade06 network-scripts]# sminfo > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > > > sminfo: iberror: failed: query > > > > > > > > I had to reboot the blade and rerun "openibd start". Then > sminfo > > > > reported correct contents. I do not suppose this reboot is > required. > > > > Did I miss any configuration step? > > > > There was an unintentional bug in the rhel5.2 openibd init script in > > that it automatically turned itself on during install (generally, > most > > init scripts should default to *not* turning themselves on during > > install of the package, nor should they start themselves during > install > > of the package...this is for security reasons, imagine if you > installed > > the bind name server on your box and it automatically started up > before > > you had a chance to configure it). In rhel5.3 we fixed that bug. > So, > > Yeah. I heard of this bug. > > > you may need to 'chkconfig --level 2345 openibd on' to make sure > openibd > > starts up each time. The error you list above is consistent with > not > > all of the kernel modules being loaded when you tried to use the > sminfo > > program. > > Even after reboot, service openibd is not started automatically. > [root at xblade06 ~]# chkconfig --list openibd > openibd 0:off 1:off 2:off 3:off 4:off 5:off 6:off That's because you have to run the command I listed in my first email to turn it on. > I agree with you that maybe some modules were not loaded. But what's > that? > Before reboot, I run "/etc/init.d/openibd start" and > "/etc/init.d/network > restart". No error was reported. "openibd status" also looked good. Running start on a service does not enable that service at the next reboot. You must specifically enable the service in order for it to start automatically. > > > > > > Moreover, "openibd start" report one warning message about > hwconf. > > > > Anyone has comments about this? > > > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No > such > > > > file or directory > > > > [ OK ] > > > > Can you see if the kudzu package is installed on your machine? The > > openib package uses this config file written by kudzu to determine > what > > hardware drivers to load. I suppose I should put a specific > requires in > > the rpm for that. > > kudzu is installed. > [root at xblade06 ~]# rpm -q kudzu > kudzu-1.2.57.1.21-1 Make sure kudzu has been run at least once then (it would appear to be turned off on your machine or else /etc/sysconfig/hwconf would exist). You can run it manually from the command line and that should be sufficient for the openibd init script's needs. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From sean.hefty at intel.com Fri Feb 13 10:19:43 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 13 Feb 2009 10:19:43 -0800 Subject: [ofa-general] Re: pick the outgoing HCA based on the IP used for bind In-Reply-To: <15ddcffd0902130739k59abf606r17ed8616aae7c246@mail.gmail.com> References: <15ddcffd0902111252q735aa158sc69568c50314da67@mail.gmail.com> <8D3FEF485C8346ECABC66929714483E4@amr.corp.intel.com> <15ddcffd0902130739k59abf606r17ed8616aae7c246@mail.gmail.com> Message-ID: <56DD47B66EFC4D23A33D6201E4093128@amr.corp.intel.com> >I wasn't sure if you actually run the whole test or just let rdma_bind >to be called and see the above. Anyway, if you send me a patch with >the prints you've added, I can repeat it in my setup and we'll see. I let ucmatose run successfully. It's kind of a hassel for me to generate a patch for this (I made them directly on the kernel code on my test systems), but these are the changes: rdma_translate_ip (addr.c) Add printk after ip_dev_find to display the ip and dev variables. cma_acquire_dev (cma.c) Add printk after ib_find_cached_gid to display cma_dev and id_priv->id.port_num - Sean From sean.hefty at intel.com Fri Feb 13 10:39:48 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 13 Feb 2009 10:39:48 -0800 Subject: [ofa-general] RE: [ofw] [ib-diag] sminfo: add support for WinOF In-Reply-To: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> Message-ID: <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> >diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband- >diags/src/ibdiag_common.c >index bda1efa..154e00c 100644 >--- a/infiniband-diags/src/ibdiag_common.c >+++ b/infiniband-diags/src/ibdiag_common.c >@@ -43,15 +43,14 @@ > #include > #include > #include >-#include > #include >-#include > #include > > #include > #include > #include > #include >+#include "ibdiag_osd.h" I think it'll be easier to just put this include in ibdiag_common.h... >diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c >index e96c782..7767668 100644 >--- a/infiniband-diags/src/sminfo.c >+++ b/infiniband-diags/src/sminfo.c >@@ -37,14 +37,13 @@ > > #include > #include >-#include >-#include > #include > > #include > #include > > #include "ibdiag_common.h" >+#include "ibdiag_osd.h" ...and avoid adding it to all the source files. I'll update my patches, but wait for comments against this patch before re-submitting. - Sean From andy.grover at oracle.com Fri Feb 13 11:05:27 2009 From: andy.grover at oracle.com (Andy Grover) Date: Fri, 13 Feb 2009 11:05:27 -0800 Subject: Fwd: [ofa-general] RDS flow control In-Reply-To: <15ddcffd0902130750j9720a01g4400be9b423004fd@mail.gmail.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <4833D223.5090007@voltaire.com> <15ddcffd0902130750j9720a01g4400be9b423004fd@mail.gmail.com> Message-ID: <4995C477.5010109@oracle.com> Thanks Or! This is exactly the kind of info I was looking for. Regards -- Andy Or Gerlitz wrote: > ---------- Forwarded message ---------- > From: Or Gerlitz > Date: Wed, May 21, 2008 at 9:41 AM > Subject: Re: [ofa-general] RDS flow control > To: Olaf Kirch > Cc: Roland Dreier , rds-devel at oss.oracle.com, > general at lists.openfabrics.org > > > Roland Dreier wrote: >> > Is there a way of changing the RNR retry count back to 0 after > establishing >> > the connection? >> >> Yes... quite complicated but possible. Basically you have to transition >> to the QP to the "send queue drained" (SQD) state, change the rnr retry >> value in an SQD->SQD transition and then transition back to RTS. > > In case the RTS->SQD->SQD->RTS transition is not applicable or just for the > sake of being aware to more solutions, I gave it some thought and its seems > possible for you to build a protocol which uses exchange (through the > private data carried by the CM messages) whether each side supports credit > management, and based on that && HW support of the IB_DEVICE_RC_RNR_NAK_GEN > device capability decide what value to place into the QP RNR retries. > > On the passive side of the connection its trivial, since the rdma-cm uses > the values you place into the conn_param parameters of rdma_accept. > > On the active side, things are a bit more complex, but with some changes, I > think you would be able to do it also in a different way than the SQD one: > the RNR retries are set into the QP once its being moved to RTS > (Ready-To-Send). So, if you managed to get the QP into your hands --before-- > the RTU is sent (since this point in time is the last synchoronization step > provided to you by the IB CM), you could set the RNR retries value accroding > to info carried in the REP message sent by the passive (which you have > posted in the private data to rdma_accept, etc). > > This would be possible, if you enhance the rdma-cm to deliver > RDMA_CM_EVENT_CONNECT_RESPONSE event also to IDs created with the PS_TCP > port space (eg conditioned on some new field in conn_param) where today its > supported only to PS_SDP ones. > > Once this change is in place, you will get RDMA_CM_EVENT_CONNECT_RESPONSE > event, decide what RNR retry value you want to use, and call rdma_accept > providing this value (one more little change is needed here in cma.c), the > rdma cm would override the value set by cm_init_qp_rts_attr, see > cma_modify_qp_rts -> rdma_init_qp_attr -> ib_cm_init_qp_attr -> > cm_init_qp_rts_attr > > and you are done... > > Or. > From ralph.campbell at qlogic.com Fri Feb 13 11:31:02 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 13 Feb 2009 11:31:02 -0800 Subject: [ofa-general] [PATCH] opensm: fix structure definition for trap 257-258 Message-ID: <1234553462.3948.31.camel@chromite.mv.qlogic.com> I was looking at a structure definition for trap messages in the opensm code and noticed this minor bug. Here is a patch to correct the problem. Signed-off-by: Ralph Campbell diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 0f9d110..cc92f36 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -7176,10 +7176,9 @@ typedef struct _ib_mad_notice_attr // Total Size calc Accumulated ib_net16_t pad1; // 2 ib_net16_t lid1; // 2 ib_net16_t lid2; // 2 - ib_net32_t key; // 2 - uint8_t sl; // 1 - ib_net32_t qp1; // 4 - ib_net32_t qp2; // 4 + ib_net32_t key; // 4 + ib_net32_t qp1; // 4b sl, 4b pad, 24b qp1 + ib_net32_t qp2; // 8b pad, 24b qp2 ib_gid_t gid1; // 16 ib_gid_t gid2; // 16 } PACK_SUFFIX ntc_257_258; From vst at vlnb.net Fri Feb 13 12:02:54 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Fri, 13 Feb 2009 23:02:54 +0300 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <4980B8DE.3060806@harr.org> References: <48E386F6.5040502@fusionio.com> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> Message-ID: <4995D1EE.4000807@vlnb.net> Cameron Harr, on 01/28/2009 10:58 PM wrote: > I've attached a spreadsheet with some of my findings. In the Summary > tab, I have a baseline with no affinity set. For other 5 tests, see below. > > Vladislav Bolkhovitin wrote: >> Try the following variants: >> >> 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 and >> 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 to >> CPU5, fct2-worker to CPU7 >> >> 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to >> CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for other >> processes. >> >> 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to >> all CPUs, except CPU0 and CPU4, no affinity for other processes. > These are tests 1, 2 and 3, respectively >> Or other similar variants you'd like (even CPUs relate to physical >> CPU0, odd CPUs relate to physical CPU1). For instance, you can try to >> affine IRQs 169 and 177 to CPU1. > I did two other tests (Tests 4,5), that has the mlx4_core (comp) IRQ > (formerly known as IRQ 82) pinned to CPU0, the two ioDrive IRQs (169, > 177) pinned to CPU 4, fct0 and scsi_tgt0 on CPUs 2&3, fct1 and scsi_tgt1 > on CPUs 4&6 (test 4) OR fct1 and scsi_tgt1 on CPUs 5&6. >> No points to run for srptthread=1, for it just produce a baseline with >> no affinity at all. > I ran with these anyway to look at differences among the tests. Having > this thread enabled always results in better performance. >> Please do each run several times and write down an average result >> between runs and approximate variation between them in %%. Otherwise >> we can't make any reliable conclusions. > I ran each test 3 times and took the averages. In order to get a quick > look at performance per run, I added a column in the summary that sums > the IOPs for each test with SRPT thread enabled and then not enabled. > Test 4 seems to give the best results. Here's a brief summary of that > summary with just SRPT thread=0: > > Baseline: 356226.39 > Test 1: 371217.6533 > Test 2: 370553.78 > Test 3: 373295.2033 > Test 4: 399385.2233 > Test 5: 393204.5833 Linux CPU scheduler does really impressive job! Interesting, will something change with: 1. The latest SVN. It has some changes, which might make a difference. 2. Pass-through dev handler instead of BLOCKIO, which you are using. Thanks, Vlad From chien.tin.tung at intel.com Fri Feb 13 13:24:31 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Fri, 13 Feb 2009 15:24:31 -0600 Subject: [ofa-general] [PATCH] RDMA/nes: Inform hardware that asynchronous event has been handled Message-ID: <20090213212431.GA7092@ctung-MOBL> From: Don Wood When asynchronous events are processed by software, it is necessary to let the hardware know that software has handled the event. This frees up the entry in the asynchronous event queue. Signed-off-by: Don Wood Signed-off-by: Chien Tung --- diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 5d139db..d612aec 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -2269,6 +2269,8 @@ static void nes_process_aeq(struct nes_device *nesdev, struct nes_hw_aeq *aeq) if (++head >= aeq_size) head = 0; + + nes_write32(nesdev->regs+NES_AEQ_ALLOC, 1 << 16); } while (1); aeq->aeq_head = head; diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h index bc0b4de..498d43e 100644 --- a/drivers/infiniband/hw/nes/nes_hw.h +++ b/drivers/infiniband/hw/nes/nes_hw.h @@ -61,6 +61,7 @@ enum pci_regs { NES_CQ_ACK = 0x0034, NES_WQE_ALLOC = 0x0040, NES_CQE_ALLOC = 0x0044, + NES_AEQ_ALLOC = 0x0048 }; enum indexed_regs { -- 1.5.2.2 From sean.hefty at intel.com Fri Feb 13 14:55:17 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 13 Feb 2009 14:55:17 -0800 Subject: [ofa-general] [PATCH] [DAPL] scm: add support for WinOF Message-ID: <6402857E406545A895F63DF7FA784D42@amr.corp.intel.com> Modify the openib_scm provider to support both OFED and WinOF releases. This takes advantage of having a libibverbs compatibility library.* Signed-off-by: Sean Hefty --- * If only there were a sockets compatility layer... gurgle This is only build tested for windows, but does run on Linux. diff --git a/Makefile.am b/Makefile.am index bfc93f7..5044e36 100755 --- a/Makefile.am +++ b/Makefile.am @@ -49,7 +49,8 @@ dapl_udapl_libdaploscm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAG -DOPENIB -DCQ_WAIT_OBJECT \ -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \ -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \ - -I$(srcdir)/dapl/openib_scm + -I$(srcdir)/dapl/openib_scm \ + -I$(srcdir)/dapl/openib_scm/linux if HAVE_LD_VERSION_SCRIPT dat_version_script = -Wl,--version-script=$(srcdir)/dat/udat/libdat2.map diff --git a/dapl/openib_scm/README b/dapl/openib_scm/README deleted file mode 100644 index 239dfe6..0000000 --- a/dapl/openib_scm/README +++ /dev/null @@ -1,40 +0,0 @@ - -OpenIB uDAPL provider using socket-based CM, in leiu of uCM/uAT, to setup QP/channels. - -to build: - -cd dapl/udapl -make VERBS=openib_scm clean -make VERBS=openib_scm - - -Modifications to common code: - -- added dapl/openib_scm directory - - dapl/udapl/Makefile - -New files for openib_scm provider - - dapl/openib/dapl_ib_cq.c - dapl/openib/dapl_ib_dto.h - dapl/openib/dapl_ib_mem.c - dapl/openib/dapl_ib_qp.c - dapl/openib/dapl_ib_util.c - dapl/openib/dapl_ib_util.h - dapl/openib/dapl_ib_cm.c - -A simple dapl test just for openib_scm testing... - - test/dtest/dtest.c - test/dtest/makefile - - server: dtest -s - client: dtest -h hostname - -known issues: - - no memory windows support in ibverbs, dat_create_rmr fails. - - - diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c index 80a7d5e..9a15e42 100644 --- a/dapl/openib_scm/dapl_ib_cm.c +++ b/dapl/openib_scm/dapl_ib_cm.c @@ -52,26 +52,169 @@ #include "dapl_cr_util.h" #include "dapl_name_service.h" #include "dapl_ib_util.h" - -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -#if __BYTE_ORDER == __LITTLE_ENDIAN -static inline uint64_t cpu_to_be64(uint64_t x) {return bswap_64(x);} -#elif __BYTE_ORDER == __BIG_ENDIAN -static inline uint64_t cpu_to_be64(uint64_t x) {return x;} -#endif +#include "dapl_osd.h" extern int g_scm_pipe[2]; +#if defined(_WIN32) || defined(_WIN64) +enum DAPL_FD_EVENTS { + DAPL_FD_READ = 0x1, + DAPL_FD_WRITE = 0x2, + DAPL_FD_ERROR = 0x4 +}; + +static int dapl_config_socket(DAPL_SOCKET s) +{ + unsigned long nonblocking = 1; + return ioctlsocket(s, FIONBIO, &nonblocking); +} + +static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr *addr, + int addrlen) +{ + int err; + + connect(s, addr, addrlen); + err = WSAGetLastError(); + return (err == WSAEWOULDBLOCK) ? EAGAIN : err; +} + +struct dapl_fd_set { + struct fd_set set[3]; +}; + +static struct dapl_fd_set *dapl_alloc_fd_set(void) +{ + return dapl_os_alloc(sizeof(struct dapl_fd_set)); +} + +static void dapl_fd_zero(struct dapl_fd_set *set) +{ + FD_ZERO(&set->set[0]); + FD_ZERO(&set->set[1]); + FD_ZERO(&set->set[2]); +} + +static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set, + enum DAPL_FD_EVENTS event) +{ + FD_SET(s, &set->set[(event == DAPL_FD_READ) ? 0 : 1]); + FD_SET(s, &set->set[2]); + return 0; +} + +static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event) +{ + struct fd_set rw_fds; + struct fd_set err_fds; + struct timeval tv; + int ret; + + FD_ZERO(&rw_fds); + FD_ZERO(&err_fds); + FD_SET(s, &rw_fds); + FD_SET(s, &err_fds); + + tv.tv_sec = 0; + tv.tv_usec = 0; + + if (event == DAPL_FD_READ) + ret = select(1, &rw_fds, NULL, &err_fds, &tv); + else + ret = select(1, NULL, &rw_fds, &err_fds, &tv); + + if (ret == 0) + return 0; + else if (FD_ISSET(s, &rw_fds)) + return event; + else if (FD_ISSET(s, &err_fds)) + return DAPL_FD_ERROR; + else + return WSAGetLastError(); +} + +static int dapl_select(struct dapl_fd_set *set) +{ + return select(0, &set->set[0], &set->set[1], &set->set[2], NULL); +} +#else // _WIN32 || _WIN64 +enum DAPL_FD_EVENTS { + DAPL_FD_READ = POLLIN, + DAPL_FD_WRITE = POLLOUT, + DAPL_FD_ERROR = POLLERR +}; + +static int dapl_config_socket(DAPL_SOCKET s) +{ + int ret; + + ret = fcntl(s, F_GETFL); + if (ret >= 0) + ret = fcntl(s, F_SETFL, ret | O_NONBLOCK); + return ret; +} + +static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr *addr, int addrlen) +{ + int ret; + + ret = connect(s, addr, addrlen); + + return (errno == EINPROGRESS) ? EAGAIN : ret; +} + +struct dapl_fd_set { + int index; + struct pollfd set[DAPL_FD_SETSIZE]; +}; + +static struct dapl_fd_set *dapl_alloc_fd_set(void) +{ + return dapl_os_alloc(sizeof(struct dapl_fd_set)); +} + +static void dapl_fd_zero(struct dapl_fd_set *set) +{ + set->index = 0; +} + +static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set, + enum DAPL_FD_EVENTS event) +{ + if (set->index == DAPL_FD_SETSIZE - 1) { + dapl_log(DAPL_DBG_TYPE_ERR, + "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n", + set->index + 1); + return -1; + } + + set->set[set->index].fd = s; + set->set[set->index].revents = 0; + set->set[set->index++].events = event; + return 0; +} + +static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event) +{ + struct pollfd fds; + int ret; + + fds.fd = s; + fds.events = event; + fds.revents = 0; + ret = poll(&fds, 1, 0); + if (ret <= 0) + return ret; + + return fds.revents; +} + +static int dapl_select(struct dapl_fd_set *set) +{ + return poll(set->set, set->index, -1); +} +#endif + static struct ib_cm_handle *dapli_cm_create(void) { struct ib_cm_handle *cm_ptr; @@ -85,7 +228,7 @@ static struct ib_cm_handle *dapli_cm_create(void) (void)dapl_os_memzero(cm_ptr, sizeof(*cm_ptr)); cm_ptr->dst.ver = htons(DSCM_VER); - cm_ptr->socket = -1; + cm_ptr->socket = DAPL_INVALID_SOCKET; return cm_ptr; bail: dapl_os_free(cm_ptr, sizeof(*cm_ptr)); @@ -100,8 +243,8 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr) /* cleanup, never made it to work queue */ if (cm_ptr->state == SCM_INIT) { - if (cm_ptr->socket >= 0) - close(cm_ptr->socket); + if (cm_ptr->socket != DAPL_INVALID_SOCKET) + closesocket(cm_ptr->socket); dapl_os_free(cm_ptr, sizeof(*cm_ptr)); return; } @@ -112,9 +255,9 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr) cm_ptr->ep->cm_handle = IB_INVALID_HANDLE; /* close socket if still active */ - if (cm_ptr->socket >= 0) { - close(cm_ptr->socket); - cm_ptr->socket = -1; + if (cm_ptr->socket != DAPL_INVALID_SOCKET) { + closesocket(cm_ptr->socket); + cm_ptr->socket = DAPL_INVALID_SOCKET; } dapl_os_unlock(&cm_ptr->lock); @@ -172,14 +315,14 @@ dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr) return DAT_SUCCESS; } else { /* send disc date, close socket, schedule destroy */ - if (cm_ptr->socket >= 0) { - if (write(cm_ptr->socket, - &disc_data, sizeof(disc_data)) == -1) + if (cm_ptr->socket != DAPL_INVALID_SOCKET) { + if (send(cm_ptr->socket, (char *) &disc_data, + sizeof(disc_data), 0) == -1) dapl_log(DAPL_DBG_TYPE_WARN, " cm_disc: write error = %s\n", strerror(errno)); - close(cm_ptr->socket); - cm_ptr->socket = -1; + closesocket(cm_ptr->socket); + cm_ptr->socket = DAPL_INVALID_SOCKET; } cm_ptr->state = SCM_DISCONNECTED; } @@ -211,7 +354,7 @@ void dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err) { int len, opt = 1; - struct iovec iovec[2]; + struct iovec iov[2]; struct dapl_ep *ep_ptr = cm_ptr->ep; if (err) { @@ -226,18 +369,21 @@ dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err) " socket connected, write QP and private data\n"); /* no delay for small packets */ - setsockopt(cm_ptr->socket,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt)); + setsockopt(cm_ptr->socket, IPPROTO_TCP, TCP_NODELAY, + (char *) &opt, sizeof(opt)); /* send qp info and pdata to remote peer */ - iovec[0].iov_base = &cm_ptr->dst; - iovec[0].iov_len = sizeof(ib_qp_cm_t); + iov[0].iov_base = (void *) &cm_ptr->dst; + iov[0].iov_len = sizeof(ib_qp_cm_t); if (cm_ptr->dst.p_size) { - iovec[1].iov_base = cm_ptr->p_data; - iovec[1].iov_len = ntohl(cm_ptr->dst.p_size); + iov[1].iov_base = cm_ptr->p_data; + iov[1].iov_len = ntohl(cm_ptr->dst.p_size); + len = writev(cm_ptr->socket, iov, 2); + } else { + len = writev(cm_ptr->socket, iov, 1); } - len = writev(cm_ptr->socket, iovec, (cm_ptr->dst.p_size ? 2:1)); - if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) { + if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_PENDING write: ERR %s, wcnt=%d -> %s\n", strerror(errno), len, @@ -253,9 +399,9 @@ dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err) dapl_dbg_log(DAPL_DBG_TYPE_CM, " connected: sending SRC GID subnet %016llx id %016llx\n", (unsigned long long) - cpu_to_be64(cm_ptr->dst.gid.global.subnet_prefix), + htonll(cm_ptr->dst.gid.global.subnet_prefix), (unsigned long long) - cpu_to_be64(cm_ptr->dst.gid.global.interface_id)); + htonll(cm_ptr->dst.gid.global.interface_id)); /* queue up to work thread to avoid blocking consumer */ cm_ptr->state = SCM_RTU_PENDING; @@ -290,25 +436,23 @@ dapli_socket_connect(DAPL_EP *ep_ptr, return DAT_INSUFFICIENT_RESOURCES; /* create, connect, sockopt, and exchange QP information */ - if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) { + if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) == DAPL_INVALID_SOCKET) { dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); return DAT_INSUFFICIENT_RESOURCES; } - /* non-blocking */ - ret = fcntl(cm_ptr->socket, F_GETFL); - if (ret < 0 || fcntl(cm_ptr->socket, - F_SETFL, ret | O_NONBLOCK) < 0) { - dapl_log(DAPL_DBG_TYPE_ERR, - " socket connect: fcntl on socket %d ERR %d %s\n", - cm_ptr->socket, ret, - strerror(errno)); - goto bail; - } + ret = dapl_config_socket(cm_ptr->socket); + if (ret < 0) { + dapl_log(DAPL_DBG_TYPE_ERR, + " socket connect: config socket %d ERR %d %s\n", + cm_ptr->socket, ret, strerror(errno)); + goto bail; + } ((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual); - ret = connect(cm_ptr->socket, r_addr, sizeof(*r_addr)); - if (ret && errno != EINPROGRESS) { + ret = dapl_connect_socket(cm_ptr->socket, (struct sockaddr *) r_addr, + sizeof(*r_addr)); + if (ret && ret != EAGAIN) { dapl_log(DAPL_DBG_TYPE_ERR, " socket connect ERROR: %s -> %s r_qual %d\n", strerror(errno), @@ -391,16 +535,13 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) { DAPL_EP *ep_ptr = cm_ptr->ep; int len; - struct iovec iovec[2]; short rtu_data = htons(0x0E0F); ib_cm_events_t event = IB_CME_DESTINATION_REJECT; /* read DST information into cm_ptr, overwrite SRC info */ dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: recv peer QP data\n"); - iovec[0].iov_base = &cm_ptr->dst; - iovec[0].iov_len = sizeof(ib_qp_cm_t); - len = readv(cm_ptr->socket, iovec, 1); + len = recv(cm_ptr->socket, (char *) &cm_ptr->dst, sizeof(ib_qp_cm_t), 0); if (len != sizeof(ib_qp_cm_t) || ntohs(cm_ptr->dst.ver) != DSCM_VER) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU read: ERR %s, rcnt=%d, ver=%d -> %s\n", @@ -456,9 +597,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) /* read private data into cm_handle if any present */ dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, read private data\n"); if (cm_ptr->dst.p_size) { - iovec[0].iov_base = cm_ptr->p_data; - iovec[0].iov_len = cm_ptr->dst.p_size; - len = readv(cm_ptr->socket, iovec, 1); + len = recv(cm_ptr->socket, cm_ptr->p_data, cm_ptr->dst.p_size, 0); if (len != cm_ptr->dst.p_size) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU read pdata: ERR %s, rcnt=%d -> %s\n", @@ -495,7 +634,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n"); /* complete handshake after final QP state change */ - if (write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)) == -1) { + if (send(cm_ptr->socket, (char *) &rtu_data, sizeof(rtu_data), 0) == -1) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU: write error = %s\n", strerror(errno)); goto bail; @@ -564,7 +703,7 @@ dapli_socket_listen(DAPL_IA *ia_ptr, cm_ptr->hca = ia_ptr->hca_ptr; /* bind, listen, set sockopt, accept, exchange data */ - if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { + if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0)) == DAPL_INVALID_SOCKET) { dapl_log(DAPL_DBG_TYPE_ERR, " ERR: listen socket create: %s\n", strerror(errno)); @@ -572,7 +711,8 @@ dapli_socket_listen(DAPL_IA *ia_ptr, goto bail; } - setsockopt(cm_ptr->socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt)); + setsockopt(cm_ptr->socket, SOL_SOCKET, SO_REUSEADDR, + (char *) &opt, sizeof(opt)); addr.sin_port = htons(serviceID); addr.sin_family = AF_INET; addr.sin_addr.s_addr = INADDR_ANY; @@ -625,7 +765,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) (void) dapl_os_memzero(acm_ptr, sizeof(*acm_ptr)); - acm_ptr->socket = -1; + acm_ptr->socket = DAPL_INVALID_SOCKET; acm_ptr->sp = cm_ptr->sp; acm_ptr->hca = cm_ptr->hca; @@ -633,7 +773,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) acm_ptr->socket = accept(cm_ptr->socket, (struct sockaddr*)&acm_ptr->dst.ia_address, (socklen_t*)&len); - if (acm_ptr->socket < 0) { + if (acm_ptr->socket == DAPL_INVALID_SOCKET) { dapl_log(DAPL_DBG_TYPE_ERR, " accept: ERR %s on FD %d l_cr %p\n", strerror(errno),cm_ptr->socket,cm_ptr); @@ -664,7 +804,7 @@ dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr) dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read QP data\n"); /* read in DST QP info, IA address. check for private data */ - len = read(acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t)); + len = recv(acm_ptr->socket, (char *) &acm_ptr->dst, sizeof(ib_qp_cm_t), 0); if (len != sizeof(ib_qp_cm_t) || ntohs(acm_ptr->dst.ver) != DSCM_VER) { dapl_log(DAPL_DBG_TYPE_ERR, @@ -700,8 +840,7 @@ dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr) /* read private data into cm_handle if any present */ if (acm_ptr->dst.p_size) { - len = read( acm_ptr->socket, - acm_ptr->p_data, acm_ptr->dst.p_size); + len = recv(acm_ptr->socket, acm_ptr->p_data, acm_ptr->dst.p_size, 0); if (len != acm_ptr->dst.p_size) { dapl_log(DAPL_DBG_TYPE_ERR, " accept read pdata: ERR %s, rcnt=%d\n", @@ -757,14 +896,14 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; ib_qp_cm_t local; - struct iovec iovec[2]; + struct iovec iov[2]; int len; if (p_size > IB_MAX_REP_PDATA_SIZE) return DAT_LENGTH_ERROR; /* must have a accepted socket */ - if (cm_ptr->socket < 0) + if (cm_ptr->socket == DAPL_INVALID_SOCKET) return DAT_INTERNAL_ERROR; dapl_dbg_log(DAPL_DBG_TYPE_EP, @@ -844,14 +983,17 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, local.ia_address = ia_ptr->hca_ptr->hca_address; local.p_size = htonl(p_size); - iovec[0].iov_base = &local; - iovec[0].iov_len = sizeof(ib_qp_cm_t); + iov[0].iov_base = (void *) &local; + iov[0].iov_len = sizeof(ib_qp_cm_t); if (p_size) { - iovec[1].iov_base = p_data; - iovec[1].iov_len = p_size; + iov[1].iov_base = p_data; + iov[1].iov_len = p_size; + len = writev(cm_ptr->socket, iov, 2); + } else { + len = writev(cm_ptr->socket, iov, 1); } - len = writev(cm_ptr->socket, iovec, (p_size ? 2:1)); - if (len != (p_size + sizeof(ib_qp_cm_t))) { + + if (len != (p_size + sizeof(ib_qp_cm_t))) { dapl_log(DAPL_DBG_TYPE_ERR, " ACCEPT_USR: ERR %s, wcnt=%d -> %s\n", strerror(errno), len, @@ -859,6 +1001,7 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, &cm_ptr->dst.ia_address)->sin_addr)); goto bail; } + dapl_dbg_log(DAPL_DBG_TYPE_CM, " ACCEPT_USR: local port=0x%x lid=0x%x" " qpn=0x%x psize=%d\n", @@ -867,9 +1010,9 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, dapl_dbg_log(DAPL_DBG_TYPE_CM, " ACCEPT_USR SRC GID subnet %016llx id %016llx\n", (unsigned long long) - cpu_to_be64(local.gid.global.subnet_prefix), + htonll(local.gid.global.subnet_prefix), (unsigned long long) - cpu_to_be64(local.gid.global.interface_id)); + htonll(local.gid.global.interface_id)); /* save state and reference to EP, queue for RTU data */ cm_ptr->ep = ep_ptr; @@ -894,7 +1037,7 @@ dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) short rtu_data = 0; /* complete handshake after final QP state change */ - len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data)); + len = recv(cm_ptr->socket, (char *) &rtu_data, sizeof(rtu_data), 0); if (len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f) { dapl_log(DAPL_DBG_TYPE_ERR, " ACCEPT_RTU: ERR %s, rcnt=%d rdata=%x\n", @@ -1108,9 +1251,9 @@ dapls_ib_remove_conn_listener ( /* close accepted socket, free cm_srvc_handle and return */ if (cm_ptr != NULL) { - if (cm_ptr->socket >= 0) { - close(cm_ptr->socket ); - cm_ptr->socket = -1; + if (cm_ptr->socket != DAPL_INVALID_SOCKET) { + closesocket(cm_ptr->socket); + cm_ptr->socket = DAPL_INVALID_SOCKET; } /* cr_thread will free */ cm_ptr->state = SCM_DESTROY; @@ -1195,27 +1338,29 @@ dapls_ib_reject_connection( IN DAT_COUNT psize, IN const DAT_PVOID pdata) { - struct iovec iovec[2]; + struct iovec iov[2]; dapl_dbg_log (DAPL_DBG_TYPE_EP, " reject(cm %p reason %x, pdata %p, psize %d)\n", cm_ptr, reason, pdata, psize); /* write reject data to indicate reject */ - if (cm_ptr->socket >= 0) { + if (cm_ptr->socket != DAPL_INVALID_SOCKET) { cm_ptr->dst.rej = (uint16_t)reason; cm_ptr->dst.rej = htons(cm_ptr->dst.rej); - iovec[0].iov_base = &cm_ptr->dst; - iovec[0].iov_len = sizeof(ib_qp_cm_t); + + iov[0].iov_base = (void *) &cm_ptr->dst; + iov[0].iov_len = sizeof(ib_qp_cm_t); if (psize) { - iovec[1].iov_base = pdata; - iovec[2].iov_len = psize; - writev(cm_ptr->socket, &iovec[0], 2); - } else - writev(cm_ptr->socket, &iovec[0], 1); - - close(cm_ptr->socket); - cm_ptr->socket = -1; + iov[1].iov_base = pdata; + iov[1].iov_len = psize; + writev(cm_ptr->socket, iov, 2); + } else { + writev(cm_ptr->socket, iov, 1); + } + + closesocket(cm_ptr->socket); + cm_ptr->socket = DAPL_INVALID_SOCKET; } /* cr_thread will destroy CR */ @@ -1444,138 +1589,141 @@ dapls_ib_get_cm_event ( } /* outbound/inbound CR processing thread to avoid blocking applications */ -#define SCM_MAX_CONN 8192 void cr_thread(void *arg) { - struct dapl_hca *hca_ptr = arg; - dp_ib_cm_handle_t cr, next_cr; - int opt,ret,idx; - socklen_t opt_len; - char rbuf[2]; - struct pollfd ufds[SCM_MAX_CONN]; - - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca %p\n",hca_ptr); - - dapl_os_lock( &hca_ptr->ib_trans.lock ); - hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; - while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { - idx=0; - ufds[idx].fd = g_scm_pipe[0]; /* wakeup and process work */ - ufds[idx].events = POLLIN; - ufds[idx].revents = 0; - - if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) - next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list); - else - next_cr = NULL; - - while (next_cr) { - cr = next_cr; - if ((cr->socket == -1 && cr->state == SCM_DESTROY) || - hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { - - dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: Free %p\n", cr); - next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, - (DAPL_LLIST_ENTRY*)&cr->entry ); - dapl_llist_remove_entry(&hca_ptr->ib_trans.list, - (DAPL_LLIST_ENTRY*)&cr->entry); - dapl_os_free(cr, sizeof(*cr)); - continue; - } - - if (idx==SCM_MAX_CONN-1) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n",idx+1); - continue; - } - - /* Add to ufds for poll, check for immediate work */ - ufds[++idx].fd = cr->socket; /* add listen or cr */ - ufds[idx].revents = 0; - if (cr->state == SCM_CONN_PENDING) - ufds[idx].events = POLLOUT; - else - ufds[idx].events = POLLIN; - - /* check socket for event, accept in or connect out */ - dapl_dbg_log(DAPL_DBG_TYPE_CM," poll cr=%p, fd=%d,%d\n", - cr, cr->socket, ufds[idx].fd); - dapl_os_unlock(&hca_ptr->ib_trans.lock); - ret = poll(&ufds[idx],1,0); - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " poll wakeup ret=%d cr->st=%d" - " ev=0x%x fd=%d\n", - ret,cr->state,ufds[idx].revents,ufds[idx].fd); - - /* data on listen, qp exchange, and on disconnect request */ - if ((ret == 1) && ufds[idx].revents == POLLIN) { - if (cr->socket > 0) { - if (cr->state == SCM_LISTEN) - dapli_socket_accept(cr); - else if (cr->state == SCM_ACCEPTING) - dapli_socket_accept_data(cr); - else if (cr->state == SCM_ACCEPTED) - dapli_socket_accept_rtu(cr); - else if (cr->state == SCM_RTU_PENDING) - dapli_socket_connect_rtu(cr); - else if (cr->state == SCM_CONNECTED) - dapli_socket_disconnect(cr); + struct dapl_hca *hca_ptr = arg; + dp_ib_cm_handle_t cr, next_cr; + int opt, ret; + socklen_t opt_len; + char rbuf[2]; + struct dapl_fd_set *set; + enum DAPL_FD_EVENTS event; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca %p\n", hca_ptr); + set = dapl_alloc_fd_set(); + if (!set) + goto out; + + dapl_os_lock(&hca_ptr->ib_trans.lock); + hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; + + while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { + dapl_fd_zero(set); + dapl_fd_set(g_scm_pipe[0], set, DAPL_FD_READ); + + if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) + next_cr = dapl_llist_peek_head(&hca_ptr->ib_trans.list); + else + next_cr = NULL; + + while (next_cr) { + cr = next_cr; + if ((cr->socket == DAPL_INVALID_SOCKET && cr->state == SCM_DESTROY) || + hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry); + dapl_llist_remove_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry); + dapl_os_free(cr, sizeof(*cr)); + continue; + } + + event = (cr->state == SCM_CONN_PENDING) ? + DAPL_FD_WRITE : DAPL_FD_READ; + if (dapl_fd_set(cr->socket, set, event)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " cr_thread: DESTROY CR st=%d fd %d" + " -> %s\n", cr->state, cr->socket, + inet_ntoa(((struct sockaddr_in*) + &cr->dst.ia_address)->sin_addr)); + dapli_cm_destroy(cr); + continue; + } + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " poll cr=%p, fd=%d\n", + cr, cr->socket); + dapl_os_unlock(&hca_ptr->ib_trans.lock); + + ret = dapl_poll(cr->socket, event); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " poll wakeup ret=%d cr->st=%d fd=%d\n", + ret, cr->state, cr->socket); + + /* data on listen, qp exchange, and on disconnect request */ + if (ret == DAPL_FD_READ) { + if (cr->socket != DAPL_INVALID_SOCKET) { + switch (cr->state) { + case SCM_LISTEN: + dapli_socket_accept(cr); + break; + case SCM_ACCEPTING: + dapli_socket_accept_data(cr); + break; + case SCM_ACCEPTED: + dapli_socket_accept_rtu(cr); + break; + case SCM_RTU_PENDING: + dapli_socket_connect_rtu(cr); + break; + case SCM_CONNECTED: + dapli_socket_disconnect(cr); + break; + default: + break; + } + } + /* connect socket is writable, check status */ + } else if (ret == DAPL_FD_WRITE || ret == DAPL_FD_ERROR) { + if (cr->state == SCM_CONN_PENDING) { + opt = 0; + ret = getsockopt(cr->socket, SOL_SOCKET, + SO_ERROR, (char *) &opt, &opt_len); + if (!ret) + dapli_socket_connected(cr, opt); + else + dapli_socket_connected(cr, errno); + } else { + dapl_log(DAPL_DBG_TYPE_CM, + " CM poll ERR, wrong state(%d) -> %s SKIP\n", cr->state, + inet_ntoa(((struct sockaddr_in*)&cr->dst.ia_address)->sin_addr)); + } + } else if (ret != 0) { + dapl_log(DAPL_DBG_TYPE_CM, + " CM poll warning %s, ret=%d st=%d -> %s\n", + strerror(errno), ret, cr->state, + inet_ntoa(((struct sockaddr_in*) + &cr->dst.ia_address)->sin_addr)); + + /* POLLUP, NVAL, or poll error, issue event if connected */ + if (cr->state == SCM_CONNECTED) + dapli_socket_disconnect(cr); + } + + dapl_os_lock(&hca_ptr->ib_trans.lock); + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry); } - /* connect socket is writable, check status */ - } else if ((ret == 1) && - (ufds[idx].revents & POLLOUT || - ufds[idx].revents & POLLERR)) { - if (cr->state == SCM_CONN_PENDING) { - opt = 0; - ret = getsockopt(cr->socket, SOL_SOCKET, - SO_ERROR, &opt, &opt_len); - if (!ret) - dapli_socket_connected(cr,opt); - else - dapli_socket_connected(cr,errno); - } else { - dapl_log(DAPL_DBG_TYPE_CM, - " CM poll ERR, wrong state(%d) -> %s SKIP\n", - cr->state, - inet_ntoa(((struct sockaddr_in*) - &cr->dst.ia_address)->sin_addr)); + + dapl_os_unlock(&hca_ptr->ib_trans.lock); + dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: sleep, fds=%d\n", + set->index+1); + dapl_select(set); + dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n"); + + /* if pipe used to wakeup, consume */ + if (dapl_poll(g_scm_pipe[0], DAPL_FD_READ) == DAPL_FD_READ) { + if (read(g_scm_pipe[0], rbuf, 2) == -1) + dapl_log(DAPL_DBG_TYPE_CM, + " cr_thread: read pipe error = %s\n", + strerror(errno)); } - } else if (ret != 0) { - dapl_log(DAPL_DBG_TYPE_CM, - " CM poll warning %s, ret=%d revnt=%x st=%d -> %s\n", - strerror(errno), ret, ufds[idx].revents, cr->state, - inet_ntoa(((struct sockaddr_in*) - &cr->dst.ia_address)->sin_addr)); - - /* POLLUP, NVAL, or poll error, issue event if connected */ - if (cr->state == SCM_CONNECTED) - dapli_socket_disconnect(cr); - } - dapl_os_lock(&hca_ptr->ib_trans.lock); - next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, - (DAPL_LLIST_ENTRY*)&cr->entry); + dapl_os_lock(&hca_ptr->ib_trans.lock); } + dapl_os_unlock(&hca_ptr->ib_trans.lock); - dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: sleep, %d\n", idx+1); - poll(ufds,idx+1,-1); /* infinite, all sockets and pipe */ - /* if pipe used to wakeup, consume */ - if (ufds[0].revents == POLLIN) - if (read(g_scm_pipe[0], rbuf, 2) == -1) - dapl_log(DAPL_DBG_TYPE_CM, - " cr_thread: read pipe error = %s\n", - strerror(errno)); - dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n"); - dapl_os_lock(&hca_ptr->ib_trans.lock); - } - dapl_os_unlock(&hca_ptr->ib_trans.lock); - hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT; - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) exit\n",hca_ptr); + free(set); +out: + hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) exit\n",hca_ptr); } - -/* - * Local variables: - * c-indent-level: 4 - * c-basic-offset: 4 - * tab-width: 8 - * End: - */ diff --git a/dapl/openib_scm/dapl_ib_cq.c b/dapl/openib_scm/dapl_ib_cq.c index 7d6bd4f..59fff11 100644 --- a/dapl/openib_scm/dapl_ib_cq.c +++ b/dapl/openib_scm/dapl_ib_cq.c @@ -46,97 +46,111 @@ * **************************************************************************/ +#include "openib_osd.h" #include "dapl.h" #include "dapl_adapter_util.h" #include "dapl_lmr_util.h" #include "dapl_evd_util.h" #include "dapl_ring_buffer_util.h" -#include -#include -int dapli_cq_thread_init(struct dapl_hca *hca_ptr) +#if defined(_WIN64) || defined(_WIN32) +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) { - DAT_RETURN dat_status; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr); - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr); + if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) + return; - /* create thread to process inbound connect request */ - hca_ptr->ib_trans.cq_state = IB_THREAD_INIT; - dat_status = dapl_os_thread_create(cq_thread, (void*)hca_ptr, &hca_ptr->ib_trans.cq_thread); - if (dat_status != DAT_SUCCESS) - { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " cq_thread_init: failed to create thread\n"); - return 1; - } + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; + SetEvent(hca_ptr->ib_trans.ib_cq->event); + dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr); + while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { + dapl_os_sleep_usec(20000); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",dapl_os_getpid()); +} + +static void cq_thread(void *arg) +{ + struct dapl_hca *hca_ptr = arg; + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; + + hca_ptr->ib_trans.cq_state = IB_THREAD_RUN; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr); - /* wait for thread to start */ - while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) { - struct timespec sleep, remain; - sleep.tv_sec = 0; - sleep.tv_nsec = 20000000; /* 20 ms */ - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " cq_thread_init: waiting for cq_thread\n"); - nanosleep (&sleep, &remain); - } - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) exit\n",getpid()); - return 0; + /* wait on DTO event, or signal to abort */ + while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { + if (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, &ibv_cq, (void*)&evd_ptr)) { + + if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) { + ibv_ack_cq_events(ibv_cq, 1); + return; + } + + /* process DTO event via callback */ + dapl_evd_dto_callback(hca_ptr->ib_hca_handle, evd_ptr->ib_cq_handle, + (void*)evd_ptr ); + + ibv_ack_cq_events(ibv_cq, 1); + } + } + hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr); } +#else // _WIN32 || _WIN64 + void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) { - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr); + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr); if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) return; - /* destroy cr_thread and lock */ - hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; - pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); - dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr); - while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { - struct timespec sleep, remain; - sleep.tv_sec = 0; - sleep.tv_nsec = 2000000; /* 2 ms */ - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " cq_thread_destroy: waiting for cq_thread\n"); - nanosleep (&sleep, &remain); - } - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",getpid()); + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; + pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); + dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr); + while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { + dapl_os_sleep_usec(20000); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",dapl_os_getpid()); } /* catch the signal */ static void ib_cq_handler(int signum) { - return; + return; } -void cq_thread( void *arg ) +static void cq_thread(void *arg) { - struct dapl_hca *hca_ptr = arg; - struct dapl_evd *evd_ptr; - struct ibv_cq *ibv_cq = NULL; + struct dapl_hca *hca_ptr = arg; + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; sigset_t sigset; sigemptyset(&sigset); - sigaddset(&sigset,SIGUSR1); - pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); - signal(SIGUSR1, ib_cq_handler); + sigaddset(&sigset,SIGUSR1); + pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); + signal(SIGUSR1, ib_cq_handler); hca_ptr->ib_trans.cq_state = IB_THREAD_RUN; - + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr); - /* wait on DTO event, or signal to abort */ - while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { - struct pollfd cq_fd = { - .fd = hca_ptr->ib_trans.ib_cq->fd, - .events = POLLIN, - .revents = 0 - }; + /* wait on DTO event, or signal to abort */ + while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { + struct pollfd cq_fd = { + .fd = hca_ptr->ib_trans.ib_cq->fd, + .events = POLLIN, + .revents = 0 + }; if ((poll(&cq_fd, 1, -1) == 1) && - (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, - &ibv_cq, (void*)&evd_ptr))) { + (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, &ibv_cq, (void*)&evd_ptr))) { if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) { ibv_ack_cq_events(ibv_cq, 1); @@ -144,15 +158,40 @@ void cq_thread( void *arg ) } /* process DTO event via callback */ - dapl_evd_dto_callback ( hca_ptr->ib_hca_handle, - evd_ptr->ib_cq_handle, - (void*)evd_ptr ); + dapl_evd_dto_callback(hca_ptr->ib_hca_handle, + evd_ptr->ib_cq_handle, (void*)evd_ptr ); ibv_ack_cq_events(ibv_cq, 1); } - } - hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr); + } + hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr); +} + +#endif // _WIN32 || _WIN64 + + +int dapli_cq_thread_init(struct dapl_hca *hca_ptr) +{ + DAT_RETURN dat_status; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr); + + /* create thread to process inbound connect request */ + hca_ptr->ib_trans.cq_state = IB_THREAD_INIT; + dat_status = dapl_os_thread_create(cq_thread, (void*)hca_ptr, &hca_ptr->ib_trans.cq_thread); + if (dat_status != DAT_SUCCESS) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " cq_thread_init: failed to create thread\n"); + return 1; + } + + /* wait for thread to start */ + while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) { + dapl_os_sleep_usec(20000); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) exit\n",dapl_os_getpid()); + return 0; } @@ -308,11 +347,11 @@ dapls_ib_cq_alloc ( IN DAPL_EVD *evd_ptr, IN DAT_COUNT *cqlen ) { + struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen ); - struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; - #ifdef CQ_WAIT_OBJECT if (evd_ptr->cq_wait_obj_handle) channel = evd_ptr->cq_wait_obj_handle; diff --git a/dapl/openib_scm/dapl_ib_dto.h b/dapl/openib_scm/dapl_ib_dto.h index 45000b9..fa19d01 100644 --- a/dapl/openib_scm/dapl_ib_dto.h +++ b/dapl/openib_scm/dapl_ib_dto.h @@ -147,12 +147,6 @@ dapls_ib_post_send ( IN const DAT_RMR_TRIPLET *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags) { - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: ep %p op %d ck %p sgs", - "%d l_iov %p r_iov %p f %d\n", - ep_ptr, op_type, cookie, segments, local_iov, - remote_iov, completion_flags); - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; @@ -163,6 +157,12 @@ dapls_ib_post_send ( int ret; dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + dapl_dbg_log(DAPL_DBG_TYPE_EP, " post_snd: ep %p cookie %p segs %d l_iov %p\n", ep_ptr, cookie, segments, local_iov); @@ -317,12 +317,6 @@ dapls_ib_post_ext_send ( IN DAT_COMPLETION_FLAGS completion_flags, IN DAT_IB_ADDR_HANDLE *remote_ah) { - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_ext_snd: ep %p op %d ck %p sgs", - "%d l_iov %p r_iov %p f %d\n", - ep_ptr, op_type, cookie, segments, local_iov, - remote_iov, completion_flags, remote_ah); - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; @@ -331,6 +325,12 @@ dapls_ib_post_ext_send ( int ret; dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags, remote_ah); + + dapl_dbg_log(DAPL_DBG_TYPE_EP, " post_snd: ep %p cookie %p segs %d l_iov %p\n", ep_ptr, cookie, segments, local_iov); diff --git a/dapl/openib_scm/dapl_ib_mem.c b/dapl/openib_scm/dapl_ib_mem.c index 54340ed..9a97e5e 100644 --- a/dapl/openib_scm/dapl_ib_mem.c +++ b/dapl/openib_scm/dapl_ib_mem.c @@ -1,4 +1,4 @@ -/* + /* * Copyright (c) 2005-2007 Intel Corporation. All rights reserved. * * This Software is licensed under one of the following licenses: @@ -35,13 +35,6 @@ * **********************************************************************/ -#include /* for IOCTL's */ -#include /* for socket(2) and related bits and pieces */ -#include /* for socket(2) */ -#include /* for struct ifreq */ -#include /* for ARPHRD_ETHER */ -#include /* for _SC_CLK_TCK */ - #include "dapl.h" #include "dapl_adapter_util.h" #include "dapl_lmr_util.h" @@ -215,10 +208,9 @@ dapls_ib_mr_register(IN DAPL_IA *ia_ptr, lmr->param.registered_address = (DAT_VADDR)(uintptr_t)virt_addr; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " mr_register: mr=%p addr=%p h %x pd %p ctx %p " + " mr_register: mr=%p addr=%p pd %p ctx %p " "lkey=0x%x rkey=0x%x priv=%x\n", lmr->mr_handle, lmr->mr_handle->addr, - lmr->mr_handle->handle, lmr->mr_handle->pd, lmr->mr_handle->context, lmr->mr_handle->lkey, lmr->mr_handle->rkey, length, dapls_convert_privileges(privileges)); diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c index 92b45d5..d82d3f5 100644 --- a/dapl/openib_scm/dapl_ib_util.c +++ b/dapl/openib_scm/dapl_ib_util.c @@ -49,17 +49,13 @@ static const char rcsid[] = "$Id: $"; #endif +#include "openib_osd.h" #include "dapl.h" #include "dapl_adapter_util.h" #include "dapl_ib_util.h" +#include "dapl_osd.h" #include -#include -#include -#include -#include -#include -#include int g_dapl_loopback_connection = 0; int g_scm_pipe[2]; @@ -88,52 +84,43 @@ char *dapl_ib_mtu_str(enum ibv_mtu mtu) } } -/* just get IP address for hostname */ -DAT_RETURN getipaddr( char *addr, int addr_len) +static DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len) { - struct sockaddr_in *ipv4_addr = (struct sockaddr_in*)addr; - struct hostent *h_ptr; - struct utsname ourname; + struct sockaddr_in *sin; + struct addrinfo *res, hint, *ai; + int ret; + char hostname[256]; - if (uname(&ourname) < 0) { - dapl_log(DAPL_DBG_TYPE_ERR, - " open_hca: uname err=%s\n", strerror(errno)); + if (addr_len < sizeof(*sin)) { return DAT_INTERNAL_ERROR; } - h_ptr = gethostbyname(ourname.nodename); - if (h_ptr == NULL) { - dapl_log(DAPL_DBG_TYPE_ERR, - " open_hca: gethostbyname err=%s\n", - strerror(errno)); - return DAT_INTERNAL_ERROR; + ret = gethostname(hostname,256); + if (ret) + return ret; + + memset(&hint, 0, sizeof hint); + hint.ai_flags = AI_PASSIVE; + hint.ai_family = AF_INET; + hint.ai_socktype = SOCK_STREAM; + hint.ai_protocol = IPPROTO_TCP; + + ret = getaddrinfo(hostname, NULL, &hint, &res); + if (ret) + return ret; + + ret = DAT_INVALID_ADDRESS; + for (ai = res; ai; ai = ai->ai_next) { + sin = (struct sockaddr_in *) ai->ai_addr; + if (*((uint32_t *) &sin->sin_addr) != htonl(0x7f000001)) { + *((struct sockaddr_in *) addr) = *sin; + ret = DAT_SUCCESS; + break; + } } - if (h_ptr->h_addrtype == AF_INET) { - int i; - struct in_addr **alist = - (struct in_addr **)h_ptr->h_addr_list; - - *(uint32_t*)&ipv4_addr->sin_addr = 0; - ipv4_addr->sin_family = AF_INET; - - /* Walk the list of addresses for host */ - for (i=0; alist[i] != NULL; i++) { - /* first non-loopback address */ - if (*(uint32_t*)alist[i] != htonl(0x7f000001)) { - dapl_os_memcpy(&ipv4_addr->sin_addr, - h_ptr->h_addr_list[i], - 4); - break; - } - } - /* if no acceptable address found */ - if (*(uint32_t*)&ipv4_addr->sin_addr == 0) - return DAT_INVALID_ADDRESS; - } else - return DAT_INVALID_ADDRESS; - - return DAT_SUCCESS; + freeaddrinfo(res); + return ret; } /* @@ -165,6 +152,28 @@ int32_t dapls_ib_release (void) return 0; } +#if defined(_WIN64) || defined(_WIN32) +int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + return 0; +} +#else // _WIN64 || WIN32 +int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + int opts; + + opts = fcntl(channel->fd, F_GETFL); /* uCQ */ + if (opts < 0 || fcntl(channel->fd, F_SETFL, opts | O_NONBLOCK) < 0) { + dapl_log(DAPL_DBG_TYPE_ERR, + " dapls_create_comp_channel: fcntl on ib_cq->fd %d ERR %d %s\n", + channel->fd, opts, strerror(errno)); + return errno; + } + + return 0; +} +#endif + /* * dapls_ib_open_hca * @@ -187,7 +196,6 @@ DAT_RETURN dapls_ib_open_hca ( IN DAPL_HCA *hca_ptr) { struct ibv_device **dev_list; - int opts; int i; DAT_RETURN dat_status = DAT_SUCCESS; @@ -219,7 +227,7 @@ found: dapl_dbg_log(DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev), (unsigned long long) - bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); + ntohll(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev); if (!hca_ptr->ib_hca_handle) { @@ -268,13 +276,7 @@ found: goto bail; } - opts = fcntl(hca_ptr->ib_trans.ib_cq->fd, F_GETFL); /* uCQ */ - if (opts < 0 || fcntl(hca_ptr->ib_trans.ib_cq->fd, - F_SETFL, opts | O_NONBLOCK) < 0) { - dapl_log(DAPL_DBG_TYPE_ERR, - " open_hca: fcntl on ib_cq->fd %d ERR %d %s\n", - hca_ptr->ib_trans.ib_cq->fd, opts, - strerror(errno)); + if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) { goto bail; } @@ -309,16 +311,11 @@ found: /* wait for thread */ while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { - struct timespec sleep, remain; - sleep.tv_sec = 0; - sleep.tv_nsec = 2000000; /* 2 ms */ - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " open_hca: waiting for cr_thread\n"); - nanosleep (&sleep, &remain); + dapl_os_sleep_usec(20000); } /* get the IP address of the device */ - dat_status = getipaddr((char*)&hca_ptr->hca_address, + dat_status = getlocalipaddr((DAT_SOCK_ADDR*) &hca_ptr->hca_address, sizeof(DAT_SOCK_ADDR6)); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, @@ -376,16 +373,13 @@ DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) " thread_destroy: thread wakeup err = %s\n", strerror(errno)); while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { - struct timespec sleep, remain; - sleep.tv_sec = 0; - sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: waiting for cr_thread\n"); if (write(g_scm_pipe[1], "w", sizeof "w") == -1) dapl_log(DAPL_DBG_TYPE_UTIL, " thread_destroy: thread wakeup err = %s\n", strerror(errno)); - nanosleep (&sleep, &remain); + dapl_os_sleep_usec(20000); } dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h index 863da2b..fd1c24e 100644 --- a/dapl/openib_scm/dapl_ib_util.h +++ b/dapl/openib_scm/dapl_ib_util.h @@ -49,8 +49,8 @@ #ifndef _DAPL_IB_UTIL_H_ #define _DAPL_IB_UTIL_H_ +#include "openib_osd.h" #include -#include #ifdef DAT_EXTENSIONS #include @@ -73,8 +73,6 @@ typedef struct ibv_wc ib_work_completion_t; typedef struct ibv_context *ib_hca_handle_t; typedef ib_hca_handle_t dapl_ibal_ca_t; -/* CM mappings, user CM not complete use SOCKETS */ - /* destination info to exchange, define wire protocol version */ #define DSCM_VER 3 typedef struct _ib_qp_cm @@ -86,7 +84,7 @@ typedef struct _ib_qp_cm uint32_t qpn; uint32_t p_size; DAT_SOCK_ADDR6 ia_address; - union ibv_gid gid; + union ibv_gid gid; uint16_t qp_type; } ib_qp_cm_t; @@ -110,20 +108,18 @@ struct ib_cm_handle struct dapl_llist_entry entry; DAPL_OS_LOCK lock; SCM_STATE state; - int socket; + DAPL_SOCKET socket; struct dapl_hca *hca; struct dapl_sp *sp; - struct dapl_ep *ep; + struct dapl_ep *ep; ib_qp_cm_t dst; - unsigned char p_data[256]; + unsigned char p_data[256]; /* must follow ib_qp_cm_t */ struct ibv_ah *ah; }; typedef struct ib_cm_handle *dp_ib_cm_handle_t; typedef dp_ib_cm_handle_t ib_cm_srvc_handle_t; -DAT_RETURN getipaddr(char *addr, int addr_len); - /* CM events */ typedef enum { @@ -141,9 +137,6 @@ typedef enum } ib_cm_events_t; -/* prototype for cm thread */ -void cr_thread (void *arg); - /* Operation and state mappings */ typedef enum ibv_send_flags ib_send_op_type_t; typedef struct ibv_sge ib_data_segment_t; @@ -289,7 +282,7 @@ typedef struct _ib_hca_transport DAPL_OS_LOCK cq_lock; int max_inline_send; ib_thread_state_t cq_state; - DAPL_OS_THREAD cq_thread; + DAPL_OS_THREAD cq_thread; struct ibv_comp_channel *ib_cq; int cr_state; DAPL_OS_THREAD thread; @@ -317,7 +310,6 @@ typedef uint32_t ib_shm_transport_t; /* prototypes */ int32_t dapls_ib_init (void); int32_t dapls_ib_release (void); -void cq_thread (void *arg); void cr_thread(void *arg); int dapli_cq_thread_init(struct dapl_hca *hca_ptr); void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); @@ -349,7 +341,7 @@ dapl_convert_errno( IN int err, IN const char *str ) if (!err) return DAT_SUCCESS; #if DAPL_DBG - if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) + if ((err != EAGAIN) && (err != ETIMEDOUT)) dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); #endif @@ -357,24 +349,15 @@ dapl_convert_errno( IN int err, IN const char *str ) { case EOVERFLOW : return DAT_LENGTH_ERROR; case EACCES : return DAT_PRIVILEGES_VIOLATION; - case ENXIO : - case ERANGE : case EPERM : return DAT_PROTECTION_VIOLATION; - case EINVAL : - case EBADF : - case ENOENT : - case ENOTSOCK : return DAT_INVALID_HANDLE; + case EINVAL : return DAT_INVALID_HANDLE; case EISCONN : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_CONNECTED; case ECONNREFUSED : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_NOTREADY; - case ETIME : case ETIMEDOUT : return DAT_TIMEOUT_EXPIRED; case ENETUNREACH: return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_UNREACHABLE; case EADDRINUSE : return DAT_CONN_QUAL_IN_USE; case EALREADY : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_ACTCONNPENDING; - case ENOSPC : - case ENOMEM : - case E2BIG : - case EDQUOT : return DAT_INSUFFICIENT_RESOURCES; + case ENOMEM : return DAT_INSUFFICIENT_RESOURCES; case EAGAIN : return DAT_QUEUE_EMPTY; case EINTR : return DAT_INTERRUPTED_CALL; case EAFNOSUPPORT : return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_MALFORMED; diff --git a/dapl/openib_scm/linux/openib_osd.h b/dapl/openib_scm/linux/openib_osd.h new file mode 100644 index 0000000..235a82e --- /dev/null +++ b/dapl/openib_scm/linux/openib_osd.h @@ -0,0 +1,21 @@ +#ifndef OPENIB_OSD_H +#define OPENIB_OSD_H + +#include +#include + +#if __BYTE_ORDER == __BIG_ENDIAN +#define htonll(x) (x) +#define ntohll(x) (x) +#elif __BYTE_ORDER == __LITTLE_ENDIAN +#define htonll(x) bswap_64(x) +#define ntohll(x) bswap_64(x) +#endif + +#define DAPL_SOCKET int +#define DAPL_INVALID_SOCKET -1 +#define DAPL_FD_SETSIZE 8192 + +#define closesocket close + +#endif // OPENIB_OSD_H diff --git a/dapl/openib_scm/windows/openib_osd.h b/dapl/openib_scm/windows/openib_osd.h new file mode 100644 index 0000000..67c70ec --- /dev/null +++ b/dapl/openib_scm/windows/openib_osd.h @@ -0,0 +1,39 @@ +#ifndef OPENIB_OSD_H +#define OPENIB_OSD_H + +#ifndef FD_SETSIZE +#define FD_SETSIZE 1024 /* Set before including winsock2 - see select help */ +#define DAPL_FD_SETSIZE FD_SETSIZE +#endif + +#include +#include +#include +#include + +#define ntohll _byteswap_uint64 +#define htonll _byteswap_uint64 + +#define pipe(x) _pipe(x, 4096, _O_TEXT) +#define read _read +#define write _write +#define DAPL_SOCKET SOCKET +#define DAPL_INVALID_SOCKET INVALID_SOCKET + +/* allow casting to WSABUF */ +struct iovec +{ + u_long iov_len; + char FAR* iov_base; +}; + +static int writev(DAPL_SOCKET s, struct iovec *vector, int count) +{ + int len, ret; + + ret = WSASend(s, (WSABUF *) vector, count, &len, 0, NULL, NULL); + return ret ? ret : len; +} + +#endif // OPENIB_OSD_H + diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index 6fef9af..ae02944 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -302,6 +302,15 @@ dapl_os_thread_create ( IN void *data, OUT DAPL_OS_THREAD *thread_id ); +STATIC _INLINE_ void +dapl_os_sleep_usec(int usec) +{ + struct timespec sleep, remain; + + sleep.tv_sec = 0; + sleep.tv_nsec = usec * 1000; + nanosleep(&sleep, &remain); +} /* * Lock Functions From vitto.giova at yahoo.it Fri Feb 13 19:34:03 2009 From: vitto.giova at yahoo.it (Vittorio) Date: Sat, 14 Feb 2009 04:34:03 +0100 Subject: [ofa-general] ***SPAM*** troubleshooting with infinband Message-ID: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com> Hello! This is my first message on the list so i hope that i'm not going to ask silly or already answered question i'm a student and i'm porting an electromagnetic field simulator to a parallel and distributed linux cluster for final thesis; i'm using both OpenMP and MPI over Infiniband to achieve speed improvements the openmp part is done and now i'm facing problem with setting up MPI over Infinband i have correctly set up the kernel modules installed the right drivers for the board (mellanox hca) and userspace programs installed mpavich2 mpi implementation however i fail to run all of this together: for example ibhost correctly find the two nodes connected Ca : 0x0002c90300018b8e ports 2 " HCA-1" Ca : 0x0002c90300018b12 ports 2 "localhost HCA-1" but ibping doens't receive responses ibwarn: [32052] ibping: Ping.. ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2) ibwarn: [32052] main: ibping to Lid 2 failed subsequently any other operation with MPI fails strangely enough however IPoIB works very well and i can ping and connect with no problems the two machines are identical and they use a crossover cable (point to point) lspci identifies the boards as 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0) what can be the cause of all of this? am i forgetting something? any help is greatly appreciated Thank you Vittorio -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Fri Feb 13 23:23:40 2009 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 14 Feb 2009 09:23:40 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband In-Reply-To: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com> References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com> Message-ID: <4996717C.8000005@gmail.com> Vittorio wrote: > Hello! > This is my first message on the list so i hope that i'm not going to > ask silly or already answered question > > i'm a student and i'm porting an electromagnetic field simulator to a > parallel and distributed linux cluster for final thesis; i'm using > both OpenMP and MPI over Infiniband to achieve speed improvements > > the openmp part is done and now i'm facing problem with setting up MPI > over Infinband > i have correctly set up the kernel modules > installed the right drivers for the board (mellanox hca) and userspace > programs > installed mpavich2 mpi implementation > > however i fail to run all of this together: > for example ibhost correctly find the two nodes connected > > Ca : 0x0002c90300018b8e ports 2 " HCA-1" > Ca : 0x0002c90300018b12 ports 2 "localhost HCA-1" > > but ibping doens't receive responses > > ibwarn: [32052] ibping: Ping.. > ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2) > ibwarn: [32052] main: ibping to Lid 2 failed > > subsequently any other operation with MPI fails > strangely enough however IPoIB works very well and i can ping and > connect with no problems > > the two machines are identical and they use a crossover cable (point > to point) > lspci identifies the boards as > 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, > PCIe 2.0 2.5GT/s] (rev a0) > > what can be the cause of all of this? am i forgetting something? > any help is greatly appreciated > Thank you > Vittorio I suggest that you will execute the ibv_rc_pingpong and see that the IB connectivity is o.k.. Then try to execute rping to check that the ib_cma is o.k.. Those will be a good start point to find the problem (do it for all of the active ports that you have). Dotan From dotanba at gmail.com Fri Feb 13 23:33:49 2009 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 14 Feb 2009 09:33:49 +0200 Subject: ***SPAM*** Re: [ofa-general] non zero lkey in send(), write() with num_sge > 1? In-Reply-To: <809230.93598.qm@web111213.mail.gq1.yahoo.com> References: <809230.93598.qm@web111213.mail.gq1.yahoo.com> Message-ID: <499673DD.6090008@gmail.com> Bill N wrote: >>> Can stack pass num_sge > 1, and lkey !=0 as part of >>> >> sg_list[] elements, in post_send() call? >> >>> >>> >> What are you trying to achieve? >> > [Bill] > I just wanted to confirm, that even when Stag !=0, > (a) there can be multiple SGEs in the list with different lkey and TO. > And > (b) HCAs have to validate each of the SGE entry against the lkey. > > Want to ensure that > - As RDMA ULP I can invoke post_send() with multiple lkeys and utilize the allocated MRs, HCAs are designed to handle that. > > Any example ULP we are aware of that does this? > > Regards, > Bill > If we are talking about the following scenario: For example: num_sge = 3. sg_list[0].lkey=A sg_list[1].lkey=B sg_list[2].lkey=C so, here is the answer: I checked the ULPs code which are part of the Linux kernel and I noticed that there isn't any ULP that uses several SGEs from different memory regions: Most of the ULPs uses only one SGE, and those who use more than one, use the same lkey. From my experience, I can tell you that the OFED stack support this feature (and many HCAs support it too). If you know otherwise, there is a bug somewhere.. Dotan From dotanba at gmail.com Fri Feb 13 23:42:26 2009 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 14 Feb 2009 09:42:26 +0200 Subject: [ofa-general] ib_create_qp and ib_get_err_str weirdness In-Reply-To: <01fa01c98df0$47baed30$0100000a@DIEGO> References: <01fa01c98df0$47baed30$0100000a@DIEGO> Message-ID: <499675E2.3060703@gmail.com> Hi. Diego Guella wrote: > Hello, > > I am using Mellanox WinOF 2.0.0 with a MHES14-XTC SDR single-port card. > I noticed a strange behavior of ib_create_qp function: > > ----- > memset(&qp_create, 0, sizeof(qp_create)); > qp_create.qp_type = IB_QPT_RELIABLE_CONN; // Reliable Connected > qp_create.sq_depth = ctx->qdepth; > qp_create.rq_depth = ctx->qdepth; > qp_create.sq_sge = ctx->hca_attr->max_sges; > qp_create.rq_sge = ctx->hca_attr->max_sges; > qp_create.h_sq_cq = ctx->cq_h; > qp_create.h_rq_cq = ctx->cq_h; > qp_create.h_srq = NULL; > qp_create.sq_signaled = 1; > ctx->qp_h = 0; > rc = ib_create_qp(ctx->pd_h, &qp_create, NULL, NULL, &ctx->qp_h); > ----- > return value ("rc") is 3 (=IB_INVALID_PARAMETER). > > I spent some time figuring out the problem was the SQ SGE value: > http://lists.openfabrics.org/pipermail/general/2006-June/023417.html > > According to iba/ib_al.h: > ----- > * IB_INVALID_MAX_SGE > * The requested maximum number of scatter-gather entries for the send or > * receive queue could not be supported. > ----- > so, why the return value isn't 22 (=IB_INVALID_MAX_SGE)? > > In the discussion I mentioned, it turned out that even using > hca_attr->max_sges there is the possibility that ib_create_qp fails. > Which is my case. > I have the need to send some audio buffers (32 or more) from an IO > node to a computing node using RDMA WRITE. > The ownership of the buffers is of the audio driver, and I haven't the > guarantee that the audio buffers are contiguous. > I was trying and send them using the lowest possible number of WR, > each one with the highest possible number of sge. > But, given the hca_attr->max_sge unreliability, how do you recommend > to achieve this goal? I saw code that is aware to this problem and try to create a QP with the maximum number of sge, and upon failures, decrease this value until the QP can be created. If you will use maximum supported number of sge minus a constant (let's say: 2), it should be always o.k.. > Should I post a WR for each buffer I'd want to send through RDMA WRITE? If you are talking about local buffers, than you can use send data from several buffers using the same SR. If you are talking about remote buffers, than you have to use different SR for every remote buffer that you want to fill. > > > Another less-related problem: > ib_get_err_str is not correct for every input value, for example I > noticed that for > ib_get_err_str(IB_INVALID_PD_HANDLE) the string returned is > IB_INVALID_MR_HANDLE > > > I don't know if these problems apply to linux too, so I'm including > general list. In Linux the return values are different (usually, -1 means that there is an error and that's all...). I believe that the error exists only in the win-ofa code. Dotan From vlad at lists.openfabrics.org Sat Feb 14 03:14:08 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 14 Feb 2009 03:14:08 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090214-0200 daily build status Message-ID: <20090214111408.6399DE6101C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Sat Feb 14 04:07:57 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 14 Feb 2009 07:07:57 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband In-Reply-To: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com> References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com> Message-ID: On Fri, Feb 13, 2009 at 10:34 PM, Vittorio wrote: > Hello! > This is my first message on the list so i hope that i'm not going to ask > silly or already answered question > > i'm a student and i'm porting an electromagnetic field simulator to a > parallel and distributed linux cluster for final thesis; i'm using both > OpenMP and MPI over Infiniband to achieve speed improvements > > the openmp part is done and now i'm facing problem with setting up MPI over > Infinband > i have correctly set up the kernel modules > installed the right drivers for the board (mellanox hca) and userspace > programs > installed mpavich2 mpi implementation > > however i fail to run all of this together: > for example ibhost correctly find the two nodes connected > > Ca : 0x0002c90300018b8e ports 2 " HCA-1" > Ca : 0x0002c90300018b12 ports 2 "localhost HCA-1" > > but ibping doens't receive responses > > ibwarn: [32052] ibping: Ping.. > ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2) > ibwarn: [32052] main: ibping to Lid 2 failed This would be expected if no ibping server was running on the lid 2 machine. -- Hal > subsequently any other operation with MPI fails > strangely enough however IPoIB works very well and i can ping and connect > with no problems > the two machines are identical and they use a crossover cable (point to > point) > lspci identifies the boards as > 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 > 2.5GT/s] (rev a0) > > what can be the cause of all of this? am i forgetting something? > any help is greatly appreciated > Thank you > Vittorio > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hnrose at comcast.net Sat Feb 14 04:37:36 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 07:37:36 -0500 Subject: [ofa-general] Test Message-ID: <20090214123736.GA25106@comcast.net> Please ignore. -- Hal From vitto.giova at yahoo.it Sat Feb 14 05:49:54 2009 From: vitto.giova at yahoo.it (Vittorio) Date: Sat, 14 Feb 2009 14:49:54 +0100 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband In-Reply-To: <4996717C.8000005@gmail.com> References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com> <4996717C.8000005@gmail.com> Message-ID: <4de51c660902140549v6b3dec6byaf18d42aa06f966d@mail.gmail.com> thanks for the suggestion, but i can't understand which kind of address i should put for the two commands i tried ibping with the server (like suggested) and it works with -G or with lid but what should i put as argument of ibv_rc_pingpong and rping? thanks a lot Vittorio On Sat, Feb 14, 2009 at 8:23 AM, Dotan Barak wrote: > Vittorio wrote: > >> Hello! >> This is my first message on the list so i hope that i'm not going to ask >> silly or already answered question >> >> i'm a student and i'm porting an electromagnetic field simulator to a >> parallel and distributed linux cluster for final thesis; i'm using both >> OpenMP and MPI over Infiniband to achieve speed improvements >> >> the openmp part is done and now i'm facing problem with setting up MPI >> over Infinband >> i have correctly set up the kernel modules >> installed the right drivers for the board (mellanox hca) and userspace >> programs >> installed mpavich2 mpi implementation >> >> however i fail to run all of this together: >> for example ibhost correctly find the two nodes connected >> >> Ca : 0x0002c90300018b8e ports 2 " HCA-1" >> Ca : 0x0002c90300018b12 ports 2 "localhost HCA-1" >> >> but ibping doens't receive responses >> >> ibwarn: [32052] ibping: Ping.. >> ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2) >> ibwarn: [32052] main: ibping to Lid 2 failed >> >> subsequently any other operation with MPI fails >> strangely enough however IPoIB works very well and i can ping and connect >> with no problems >> >> the two machines are identical and they use a crossover cable (point to >> point) >> lspci identifies the boards as >> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe >> 2.0 2.5GT/s] (rev a0) >> >> what can be the cause of all of this? am i forgetting something? >> any help is greatly appreciated >> Thank you >> Vittorio >> > I suggest that you will execute the ibv_rc_pingpong and see that the IB > connectivity is o.k.. > Then try to execute rping to check that the ib_cma is o.k.. > > Those will be a good start point to find the problem > (do it for all of the active ports that you have). > > > Dotan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Sat Feb 14 05:53:08 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 08:53:08 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_helper.c: Add port counters to __osm_disp_msg_str Message-ID: <20090214135308.GB25402@comcast.net> >From d9c17a8251b874c33542a19a51d1332ea3196713 Mon Sep 17 00:00:00 2001 From: Hal Rosenstock Date: Thu, 12 Feb 2009 09:27:46 -0500 Subject: [PATCH] opensm/osm_helper.c: Add port counters to __osm_disp_msg_str Signed-off-by: Hal Rosenstock --- opensm/opensm/osm_helper.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index e2ad4e7..c56f5b2 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -2101,6 +2101,7 @@ static const char *const __osm_disp_msg_str[] = { #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) "OSM_MSG_MAD_MULTIPATH_RECORD", #endif + "OSM_MSG_MAD_PORT_COUNTERS", "UNKNOWN!!" }; -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 05:51:39 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 08:51:39 -0500 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin Message-ID: <20090214135139.GA25402@comcast.net> >From 3b8e45eaaeaac7bd34b60dfd432469cafc6caef7 Mon Sep 17 00:00:00 2001 From: Hal Rosenstock Date: Tue, 10 Feb 2009 07:14:32 -0500 Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints Signed-off-by: Hal Rosenstock --- opensm/opensm/osm_ucast_mgr.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 7232fbc..e404c91 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -786,7 +786,7 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t *m) int i, num = cl_qmap_count(&m->p_subn->sw_guid_tbl); void **s = malloc(num * sizeof(*s)); if (!s) { - OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR: " + OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR 3A0C: " "No memory, skip by switch load sorting.\n"); return; } @@ -814,7 +814,7 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) if (parse_node_map(p_mgr->p_subn->opt.guid_routing_order_file, add_guid_to_order_list, p_mgr)) - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : " + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A0D: " "cannot parse guid routing order file \'%s\'\n", p_mgr->p_subn->opt.guid_routing_order_file); } else @@ -825,7 +825,7 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) clear_prof_ignore_flag, NULL); if (parse_node_map(p_mgr->p_subn->opt.port_prof_ignore_file, mark_ignored_port, p_mgr)) { - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : " + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A0E: " "cannot parse port prof ignore file \'%s\'\n", p_mgr->p_subn->opt.port_prof_ignore_file); } -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 05:55:50 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 08:55:50 -0500 Subject: [ofa-general] [PATCH] opensm/osm_console.c: Add missing command in help_perfmgr Message-ID: <20090214135550.GE25402@comcast.net> >From 7faaf4e757c42a8f57fd5b02f425266f2eb853b2 Mon Sep 17 00:00:00 2001 From: Hal Rosenstock Date: Fri, 13 Feb 2009 13:32:43 -0500 Subject: [PATCH] opensm/osm_console.c: Add missing command in help_perfmgr Signed-off-by: Hal Rosenstock --- opensm/opensm/osm_console.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index fe5994b..a66a7d3 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -204,7 +204,7 @@ static void help_dump_conf(FILE *out, int detail) static void help_perfmgr(FILE * out, int detail) { fprintf(out, - "perfmgr [enable|disable|clear_counters|dump_counters|sweep_time[seconds]]\n"); + "perfmgr [enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n"); if (detail) { fprintf(out, "perfmgr -- print the performance manager state\n"); -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 05:57:00 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 08:57:00 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/sim_net.c: In new_node, fix nodetype in nodeinfo for router nodes Message-ID: <20090214135700.GF25402@comcast.net> >From 17350f5a17ec5ec821607aae7bf94a88b84d6e74 Mon Sep 17 00:00:00 2001 From: Hal Rosenstock Date: Thu, 12 Feb 2009 10:57:20 -0500 Subject: [PATCH] ibsim/sim_net.c: In new_node, fix nodetype in nodeinfo for router nodes Signed-off-by: Hal Rosenstock --- ibsim/sim_net.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c index 7a42cb6..f0628ec 100644 --- a/ibsim/sim_net.c +++ b/ibsim/sim_net.c @@ -322,6 +322,8 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports) guids[type]++; // reserve single guid; } else { memcpy(nd->nodeinfo, hcanodeinfo, sizeof(nd->nodeinfo)); + if (type == ROUTER_NODE) + mad_set_field(nd->nodeinfo, 0, IB_NODE_TYPE_F, ROUTER_NODE); guids[type] += nodeports + 1; // reserve guids; } -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 05:54:09 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 08:54:09 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_console.c: Add list of SMs to status command Message-ID: <20090214135409.GC25402@comcast.net> >From debc6e1f5bd225449ca897264948b08ccf69de38 Mon Sep 17 00:00:00 2001 From: Hal Rosenstock Date: Fri, 13 Feb 2009 09:49:36 -0500 Subject: [PATCH] opensm/osm_console.c: Add list of SMs to status command Also, add SM priority into status command Signed-off-by: Hal Rosenstock --- opensm/opensm/osm_console.c | 38 ++++++++++++++++++++++++++++++++++---- 1 files changed, 34 insertions(+), 4 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 5bc1079..f06eb52 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2005-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -303,13 +304,13 @@ static char *sm_state_str(int state) case IB_SMINFO_STATE_DISCOVERING: return ("Discovering"); case IB_SMINFO_STATE_STANDBY: - return ("Standby"); + return ("Standby "); case IB_SMINFO_STATE_NOTACTIVE: - return ("Not Active"); + return ("Not Active "); case IB_SMINFO_STATE_MASTER: - return ("Master"); + return ("Master "); } - return ("UNKNOWN"); + return ("UNKNOWN "); } static char *sa_state_str(osm_sa_state_t state) @@ -323,6 +324,32 @@ static char *sa_state_str(osm_sa_state_t state) return ("UNKNOWN"); } +static void dump_sms(osm_opensm_t * p_osm, FILE * out) +{ + osm_subn_t *p_subn = &p_osm->subn; + osm_remote_sm_t *p_rsm; + + fprintf(out, "\n Known SMs\n" + " ---------\n"); + fprintf(out, " Port GUID SM State Priority\n"); + fprintf(out, " --------- -------- --------\n"); + fprintf(out, " 0x%" PRIx64 " %s %d SELF\n", + cl_ntoh64(p_subn->sm_port_guid), + sm_state_str(p_subn->sm_state), + p_subn->opt.sm_priority); + + CL_PLOCK_ACQUIRE(p_osm->sm.p_lock); + p_rsm = (osm_remote_sm_t *) cl_qmap_head(&p_subn->sm_guid_tbl); + while (p_rsm != (osm_remote_sm_t *) cl_qmap_end(&p_subn->sm_guid_tbl)) { + fprintf(out, " 0x%" PRIx64 " %s %d\n", + cl_ntoh64(p_rsm->smi.guid), + sm_state_str(ib_sminfo_get_state(&p_rsm->smi)), + ib_sminfo_get_priority(&p_rsm->smi)); + p_rsm = (osm_remote_sm_t *) cl_qmap_next(&p_rsm->map_item); + } + CL_PLOCK_RELEASE(p_osm->sm.p_lock); +} + static void print_status(osm_opensm_t * p_osm, FILE * out) { cl_list_item_t *item; @@ -332,6 +359,8 @@ static void print_status(osm_opensm_t * p_osm, FILE * out) fprintf(out, " OpenSM Version : %s\n", p_osm->osm_version); fprintf(out, " SM State : %s\n", sm_state_str(p_osm->subn.sm_state)); + fprintf(out, " SM Priority : %d\n", + p_osm->subn.opt.sm_priority); fprintf(out, " SA State : %s\n", sa_state_str(p_osm->sa.state)); fprintf(out, " Routing Engine : %s\n", @@ -391,6 +420,7 @@ static void print_status(osm_opensm_t * p_osm, FILE * out) p_osm->subn.in_sweep_hop_0, p_osm->subn.first_time_master_sweep, p_osm->subn.coming_out_of_standby); + dump_sms(p_osm, out); fprintf(out, "\n"); cl_plock_release(&p_osm->lock); } -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 05:55:04 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 08:55:04 -0500 Subject: [ofa-general] [PATCH] opensm/osm_console.c: Eliminate some extraneous parentheses Message-ID: <20090214135504.GD25402@comcast.net> >From 8d6c1b61e43059ed80885131c0bbce51baf4eddf Mon Sep 17 00:00:00 2001 From: Hal Rosenstock Date: Fri, 13 Feb 2009 10:35:39 -0500 Subject: [PATCH] opensm/osm_console.c: Eliminate some extraneous parentheses Signed-off-by: Hal Rosenstock --- opensm/opensm/osm_console.c | 24 ++++++++++++------------ 1 files changed, 12 insertions(+), 12 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index f06eb52..fe5994b 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -381,8 +381,8 @@ static void print_status(osm_opensm_t * p_osm, FILE * out) #ifdef ENABLE_OSM_PERF_MGR fprintf(out, "\n PerfMgr state/sweep state : %s/%s\n", - osm_perfmgr_get_state_str(&(p_osm->perfmgr)), - osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr))); + osm_perfmgr_get_state_str(&p_osm->perfmgr), + osm_perfmgr_get_sweep_state_str(&p_osm->perfmgr)); #endif fprintf(out, "\n MAD stats\n" " ---------\n" @@ -1135,26 +1135,26 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) p_cmd = next_token(p_last); if (p_cmd) { if (strcmp(p_cmd, "enable") == 0) { - osm_perfmgr_set_state(&(p_osm->perfmgr), + osm_perfmgr_set_state(&p_osm->perfmgr, PERFMGR_STATE_ENABLED); } else if (strcmp(p_cmd, "disable") == 0) { - osm_perfmgr_set_state(&(p_osm->perfmgr), + osm_perfmgr_set_state(&p_osm->perfmgr, PERFMGR_STATE_DISABLE); } else if (strcmp(p_cmd, "clear_counters") == 0) { - osm_perfmgr_clear_counters(&(p_osm->perfmgr)); + osm_perfmgr_clear_counters(&p_osm->perfmgr); } else if (strcmp(p_cmd, "dump_counters") == 0) { p_cmd = next_token(p_last); if (p_cmd && (strcmp(p_cmd, "mach") == 0)) { - osm_perfmgr_dump_counters(&(p_osm->perfmgr), + osm_perfmgr_dump_counters(&p_osm->perfmgr, PERFMGR_EVENT_DB_DUMP_MR); } else { - osm_perfmgr_dump_counters(&(p_osm->perfmgr), + osm_perfmgr_dump_counters(&p_osm->perfmgr, PERFMGR_EVENT_DB_DUMP_HR); } } else if (strcmp(p_cmd, "print_counters") == 0) { p_cmd = next_token(p_last); if (p_cmd) { - osm_perfmgr_print_counters(&(p_osm->perfmgr), + osm_perfmgr_print_counters(&p_osm->perfmgr, p_cmd, out); } else { fprintf(out, @@ -1164,7 +1164,7 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) p_cmd = next_token(p_last); if (p_cmd) { uint16_t time_s = atoi(p_cmd); - osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr), + osm_perfmgr_set_sweep_time_s(&p_osm->perfmgr, time_s); } else { fprintf(out, @@ -1179,9 +1179,9 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) "sweep state : %s\n" "sweep time : %us\n" "outstanding queries/max : %d/%u\n", - osm_perfmgr_get_state_str(&(p_osm->perfmgr)), - osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr)), - osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr)), + osm_perfmgr_get_state_str(&p_osm->perfmgr), + osm_perfmgr_get_sweep_state_str(&p_osm->perfmgr), + osm_perfmgr_get_sweep_time_s(&p_osm->perfmgr), p_osm->perfmgr.outstanding_queries, p_osm->perfmgr.max_outstanding_queries); } -- 1.5.6.4 From hal.rosenstock at gmail.com Sat Feb 14 06:08:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 14 Feb 2009 09:08:36 -0500 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad/rpc.c: In mad_rpc/mad_rpc_rmpp, set rpc attribute ID from response In-Reply-To: <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com> References: <1233877653.8992.516.camel@bertha1.edm.orcorp.ca> <15ddcffd0902061502l6c59161bq994802624ed4e6d1@mail.gmail.com> Message-ID: Or, On Fri, Feb 6, 2009 at 6:02 PM, Or Gerlitz wrote: > On Fri, Feb 6, 2009 at 1:47 AM, Hal Rosenstock > wrote: >> Sasha, >> This patch sets the attribute ID based on what is in the response. > > Hal, > > Your patches can't really be reviewed when being sent as attachment, Yes, it is more work for the reviewer in this case. > any reason not > to send them embedded within the email message? Sendmail is just more fun than one should be allowed to have. FWIW, I think I have this resolved now but we'll see... -- Hal > Or. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Sat Feb 14 07:25:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 17:25:33 +0200 Subject: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123355.GP17713@sashak.voltaire.com> <20090212200025.GC14416@sashak.voltaire.com> Message-ID: <20090214152533.GG14416@sashak.voltaire.com> Hi Hal, On 19:41 Thu 12 Feb , Hal Rosenstock wrote: > > > > It is already supplied by libibumad - by umad_get_ca() > > (ca.ports[i]->pkeys). I think you just need to copy this to > > ib_port_attr_t structure. > > Yes but rather than using supplied pointers (as inputs for the per > port pkey/guid tables), the other vendor layers require a large enough > buffer for these tables and set the port pointers appropriately (on > output) rather than supplying these pointers as input parameters. So > if we use these as input, then we definitely break the other vendor > layers. Ok, if you already have an usage example, this is even simpler - just alloc mem and copy pkey table. Sasha From sashak at voltaire.com Sat Feb 14 07:26:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 17:26:18 +0200 Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: References: <20090207123355.GP17713@sashak.voltaire.com> <20090212200025.GC14416@sashak.voltaire.com> Message-ID: <20090214152618.GH14416@sashak.voltaire.com> On 06:58 Fri 13 Feb , Hal Rosenstock wrote: > > Another choice is to ifdef these differences between Linux and Windows > at least until umad is used there. Please try to avoid #ifdef(s). Sasha From sashak at voltaire.com Sat Feb 14 07:28:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 17:28:04 +0200 Subject: [ofa-general] [ib-mgmt] ibdiag_common.h question In-Reply-To: <12C5145C5B854D78A1DAA6BB2F2CBA50@amr.corp.intel.com> References: <12C5145C5B854D78A1DAA6BB2F2CBA50@amr.corp.intel.com> Message-ID: <20090214152804.GI14416@sashak.voltaire.com> Hi Sean, On 16:56 Thu 12 Feb , Sean Hefty wrote: > I noticed the following in ibdiag_common.h: > > #define DEBUG if (ibdebug || ibverbose) IBWARN > #define VERBOSE if (ibdebug || ibverbose > 1) IBWARN > > This allows for else statements to mismatch when defined. Sure, we can wrap it with 'do { ... } while (0)'. Sasha From sashak at voltaire.com Sat Feb 14 07:37:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 17:37:34 +0200 Subject: [ofa-general] [PATCH] infiniabnd-diags/common: wrap debug macros with do {} while (0) In-Reply-To: <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> Message-ID: <20090214153734.GJ14416@sashak.voltaire.com> Wrap debug macros which use 'if () {}' with 'do { .. } while (0)' to prevent potential 'else' statement mismatching. Also use portable __VA_ARGS__ macro. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/include/ibdiag_common.h | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index 4783b8e..52fd147 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -50,9 +50,13 @@ extern int ibd_timeout; /*========================================================*/ #undef DEBUG -#define DEBUG if (ibdebug || ibverbose) IBWARN -#define VERBOSE if (ibdebug || ibverbose > 1) IBWARN -#define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) +#define DEBUG(fmt, ...) do { \ + if (ibdebug || ibverbose) IBWARN(fmt, ## __VA_ARGS__); \ +} while (0) +#define VERBOSE(fmt, ...) do { \ + if (ibdebug || ibverbose > 1) IBWARN(fmt, ## __VA_ARGS__); \ +} while (0) +#define IBERROR(fmt, ...) iberror(__FUNCTION__, fmt, ## __VA_ARGS__) struct ibdiag_opt { const char *name; -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 14 07:40:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 17:40:45 +0200 Subject: [ofa-general] Re: [ofw] [ib-diag] sminfo: add support for WinOF In-Reply-To: <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> Message-ID: <20090214154045.GK14416@sashak.voltaire.com> On 10:39 Fri 13 Feb , Sean Hefty wrote: > >diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband- > >diags/src/ibdiag_common.c > >index bda1efa..154e00c 100644 > >--- a/infiniband-diags/src/ibdiag_common.c > >+++ b/infiniband-diags/src/ibdiag_common.c > >@@ -43,15 +43,14 @@ > > #include > > #include > > #include > >-#include > > #include > >-#include > > #include > > > > #include > > #include > > #include > > #include > >+#include "ibdiag_osd.h" > > I think it'll be easier to just put this include in ibdiag_common.h... What about to add files inttypes.h and unistd.h in winof tree? It could be wrapper similars to ibdiag_osd.h. Sasha > > >diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c > >index e96c782..7767668 100644 > >--- a/infiniband-diags/src/sminfo.c > >+++ b/infiniband-diags/src/sminfo.c > >@@ -37,14 +37,13 @@ > > > > #include > > #include > >-#include > >-#include > > #include > > > > #include > > #include > > > > #include "ibdiag_common.h" > >+#include "ibdiag_osd.h" > > ...and avoid adding it to all the source files. I'll update my patches, but > wait for comments against this patch before re-submitting. > > - Sean > From sashak at voltaire.com Sat Feb 14 07:56:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 17:56:01 +0200 Subject: [ofa-general] Re: [ib-diag] sminfo: add support for WinOF In-Reply-To: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> Message-ID: <20090214155601.GL14416@sashak.voltaire.com> On 23:21 Thu 12 Feb , Sean Hefty wrote: > Allow sminfo to build and run on both Linux and Windows. Window > build files are maintained in the WinOF respository. These changes > allow dropping the infiniband-diags into the WinOF build environment. > > Signed-off-by: Sean Hefty > --- > Would there be any objection to including the windows source files (.c and .h) > in the mgmt tree? Which files? Basically I prefer to not have unrelated things in my tree, but let's see specific needs. > > infiniband-diags/Makefile.am | 2 + > infiniband-diags/include/ibdiag_common.h | 2 + > infiniband-diags/include/linux/ibdiag_osd.h | 43 +++++++++++++++++++++++++++ > infiniband-diags/src/ibdiag_common.c | 13 ++++---- > infiniband-diags/src/sminfo.c | 15 ++++----- > 5 files changed, 58 insertions(+), 17 deletions(-) > > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > index f9cc5bd..0d32abd 100644 > --- a/infiniband-diags/Makefile.am > +++ b/infiniband-diags/Makefile.am > @@ -1,5 +1,5 @@ > > -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband > +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband -I$(srcdir)/include/linux > > if DEBUG > DBGFLAGS = -ggdb -D_DEBUG_ > diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h > index 4783b8e..2dea873 100644 > --- a/infiniband-diags/include/ibdiag_common.h > +++ b/infiniband-diags/include/ibdiag_common.h > @@ -52,7 +52,7 @@ extern int ibd_timeout; > #undef DEBUG > #define DEBUG if (ibdebug || ibverbose) IBWARN > #define VERBOSE if (ibdebug || ibverbose > 1) IBWARN > -#define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) > +#define IBERROR(fmt, ...) iberror(__FUNCTION__, fmt, ## __VA_ARGS__) > > struct ibdiag_opt { > const char *name; > diff --git a/infiniband-diags/include/linux/ibdiag_osd.h b/infiniband-diags/include/linux/ibdiag_osd.h > new file mode 100644 > index 0000000..5c6faa9 > --- /dev/null > +++ b/infiniband-diags/include/linux/ibdiag_osd.h > @@ -0,0 +1,43 @@ > +/* > + * Copyright (c) 2009 Intel Corp, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef _IBDIAG_OSD_H_ > +#define _IBDIAG_OSD_H_ > + > +#include > +#include > +#include > + > +#define CDECL > + > +#endif /* _IBDIAG_OSD_H_ */ > diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c > index bda1efa..154e00c 100644 > --- a/infiniband-diags/src/ibdiag_common.c > +++ b/infiniband-diags/src/ibdiag_common.c > @@ -43,15 +43,14 @@ > #include > #include > #include > -#include > #include > -#include > #include > > #include > #include > #include > #include > +#include "ibdiag_osd.h" Wouldn't it be easier (at least for linux developers :)) instead of filtering out pretty standard header files to put such files under winof tree? (Including config.h, this file is generated by autotools, as far as I could see it is not used in WinOF, so it should be easy to keep this as "osd" file). > > int ibdebug; > int ibverbose; > @@ -204,7 +203,7 @@ static const struct ibdiag_opt common_opts[] = { > { "usage", 'u', 0, NULL, "usage message" }, > { "help", 'h', 0, NULL, "help message" }, > { "version", 'V', 0, NULL, "show version" }, > - {} > + { 0 } > }; > > static void make_opt(struct option *l, const struct ibdiag_opt *o, > @@ -254,11 +253,11 @@ static struct option *make_long_opts(const char *exclude_str, > > static void make_str_opts(const struct option *o, char *p, unsigned size) > { > - int i, n = 0; > + unsigned i, n = 0; > > for (n = 0; o->name && n + 2 + o->has_arg < size; o++) { > - p[n++] = o->val; > - for (i = 0; i < o->has_arg; i++) > + p[n++] = (char) o->val; > + for (i = 0; i < (unsigned) o->has_arg; i++) > p[n++] = ':'; > } > p[n] = '\0'; > @@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[], void *cxt, > char str_opts[1024]; > const struct ibdiag_opt *o; > > - memset(opts_map, 0, sizeof(opts_map)); > + memset((void *) opts_map, 0, sizeof(opts_map)); Hmm, why is this casting needed? > > prog_name = argv[0]; > prog_args = usage_args; > diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c > index e96c782..7767668 100644 > --- a/infiniband-diags/src/sminfo.c > +++ b/infiniband-diags/src/sminfo.c > @@ -37,14 +37,13 @@ > > #include > #include > -#include > -#include > #include > > #include > #include > > #include "ibdiag_common.h" > +#include "ibdiag_osd.h" > > static uint8_t sminfo[1024]; > > @@ -59,10 +58,10 @@ enum { > }; > > char *statestr[] = { > - [SMINFO_NOTACT] "SMINFO_NOTACT", > - [SMINFO_DISCOVER] "SMINFO_DISCOVER", > - [SMINFO_STANDBY] "SMINFO_STANDBY", > - [SMINFO_MASTER] "SMINFO_MASTER", > + "SMINFO_NOTACT", > + "SMINFO_DISCOVER", > + "SMINFO_STANDBY", > + "SMINFO_MASTER", > }; > > #define STATESTR(s) (((unsigned)(s)) < SMINFO_STATE_LAST ? statestr[s] : "???") > @@ -88,7 +87,7 @@ static int process_opt(void *context, int ch, char *optarg) > return 0; > } > > -int main(int argc, char **argv) > +int CDECL main(int argc, char **argv) Would compiler flag /Gd do the same without code modification? (http://msdn.microsoft.com/en-us/library/46t77ak2(VS.71).aspx) Sasha > { > int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; > int mod = 0; > @@ -100,7 +99,7 @@ int main(int argc, char **argv) > { "state", 's', 1, "<0-3>", "set SM state"}, > { "priority", 'p', 1, "<0-15>", "set SM priority"}, > { "activity", 'a', 1, NULL, "set activity count"}, > - { } > + { 0 } > }; > char usage_args[] = " [modifier]"; > > > > From dotanba at gmail.com Sat Feb 14 07:53:14 2009 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 14 Feb 2009 17:53:14 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband In-Reply-To: <4de51c660902140549v6b3dec6byaf18d42aa06f966d@mail.gmail.com> References: <4de51c660902131934j79736771xfb85af348048c0b1@mail.gmail.com> <4996717C.8000005@gmail.com> <4de51c660902140549v6b3dec6byaf18d42aa06f966d@mail.gmail.com> Message-ID: <4996E8EA.1000102@gmail.com> Vittorio wrote: > thanks for the suggestion, but i can't understand which kind of > address i should put for the two commands > i tried ibping with the server (like suggested) and it works with -G > or with lid > > but what should i put as argument of ibv_rc_pingpong and rping? > > thanks a lot > Vittorio Both of them man pages, so you can check it out. In ibv_rc_pingpong: Server side: ibv_rc_pingpong Client: ibv_rc_pingpong Sorry, but I don't remember the rping parameters ... Dotan From sashak at voltaire.com Sat Feb 14 07:58:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 17:58:17 +0200 Subject: [ofa-general] Re: [ibmad] libibmad: add MAD_EXPORT to exported calls In-Reply-To: <877D4427C8B64CFCB6B26E0CE0F5812A@amr.corp.intel.com> References: <877D4427C8B64CFCB6B26E0CE0F5812A@amr.corp.intel.com> Message-ID: <20090214155817.GM14416@sashak.voltaire.com> On 23:31 Thu 12 Feb , Sean Hefty wrote: > From: Stan Smith > > ibtracert and ibroute need xdump and smp_query_via exported > from the library. Add MAD_EXPORT to the calls for Windows support. > > Signed-off-by: Stan Smith > Signed-off-by: Sean Hefty Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 14 08:11:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 18:11:11 +0200 Subject: [ofa-general] Re: [PATCH] opensm: fix structure definition for trap 257-258 In-Reply-To: <1234553462.3948.31.camel@chromite.mv.qlogic.com> References: <1234553462.3948.31.camel@chromite.mv.qlogic.com> Message-ID: <20090214161111.GN14416@sashak.voltaire.com> On 11:31 Fri 13 Feb , Ralph Campbell wrote: > I was looking at a structure definition for trap messages in the opensm > code and noticed this minor bug. > Here is a patch to correct the problem. > > Signed-off-by: Ralph Campbell Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 14 08:22:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 18:22:54 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin In-Reply-To: <20090214135139.GA25402@comcast.net> References: <20090214135139.GA25402@comcast.net> Message-ID: <20090214162254.GO14416@sashak.voltaire.com> Hi Hal, On 08:51 Sat 14 Feb , hnrose at comcast.net wrote: > > From 3b8e45eaaeaac7bd34b60dfd432469cafc6caef7 Mon Sep 17 00:00:00 2001 Please don't put this line ("From ...") in patch message body - it marks start of message in mbox file format and breaks things like 'git rebase' and similar. (At least mask this line with '> ' character). > From: Hal Rosenstock > Date: Tue, 10 Feb 2009 07:14:32 -0500 > Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints Actually there is no reason to repeat email header in a commit message. > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 14 08:31:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 18:31:55 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: Add port counters to __osm_disp_msg_str In-Reply-To: <20090214135308.GB25402@comcast.net> References: <20090214135308.GB25402@comcast.net> Message-ID: <20090214163155.GP14416@sashak.voltaire.com> On 08:53 Sat 14 Feb , hnrose at comcast.net wrote: > > From d9c17a8251b874c33542a19a51d1332ea3196713 Mon Sep 17 00:00:00 2001 > From: Hal Rosenstock > Date: Thu, 12 Feb 2009 09:27:46 -0500 > Subject: [PATCH] opensm/osm_helper.c: Add port counters to __osm_disp_msg_str > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 14 08:36:58 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 18:36:58 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c: Add list of SMs to status command In-Reply-To: <20090214135409.GC25402@comcast.net> References: <20090214135409.GC25402@comcast.net> Message-ID: <20090214163658.GQ14416@sashak.voltaire.com> On 08:54 Sat 14 Feb , hnrose at comcast.net wrote: > > From debc6e1f5bd225449ca897264948b08ccf69de38 Mon Sep 17 00:00:00 2001 > From: Hal Rosenstock > Date: Fri, 13 Feb 2009 09:49:36 -0500 > Subject: [PATCH] opensm/osm_console.c: Add list of SMs to status command > > Also, add SM priority into status command > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 14 08:38:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 18:38:32 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c: Eliminate some extraneous parentheses In-Reply-To: <20090214135504.GD25402@comcast.net> References: <20090214135504.GD25402@comcast.net> Message-ID: <20090214163832.GR14416@sashak.voltaire.com> On 08:55 Sat 14 Feb , hnrose at comcast.net wrote: > > From 8d6c1b61e43059ed80885131c0bbce51baf4eddf Mon Sep 17 00:00:00 2001 > From: Hal Rosenstock > Date: Fri, 13 Feb 2009 10:35:39 -0500 > Subject: [PATCH] opensm/osm_console.c: Eliminate some extraneous parentheses > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 14 08:40:51 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 18:40:51 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_console.c: Add missing command in help_perfmgr In-Reply-To: <20090214135550.GE25402@comcast.net> References: <20090214135550.GE25402@comcast.net> Message-ID: <20090214164051.GS14416@sashak.voltaire.com> On 08:55 Sat 14 Feb , hnrose at comcast.net wrote: > > From 7faaf4e757c42a8f57fd5b02f425266f2eb853b2 Mon Sep 17 00:00:00 2001 > From: Hal Rosenstock > Date: Fri, 13 Feb 2009 13:32:43 -0500 > Subject: [PATCH] opensm/osm_console.c: Add missing command in help_perfmgr > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Feb 14 08:44:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 18:44:59 +0200 Subject: [ofa-general] Re: [PATCH] ibsim/sim_net.c: In new_node, fix nodetype in nodeinfo for router nodes In-Reply-To: <20090214135700.GF25402@comcast.net> References: <20090214135700.GF25402@comcast.net> Message-ID: <20090214164459.GT14416@sashak.voltaire.com> On 08:57 Sat 14 Feb , hnrose at comcast.net wrote: > > From 17350f5a17ec5ec821607aae7bf94a88b84d6e74 Mon Sep 17 00:00:00 2001 > From: Hal Rosenstock > Date: Thu, 12 Feb 2009 10:57:20 -0500 > Subject: [PATCH] ibsim/sim_net.c: In new_node, fix nodetype in nodeinfo for router nodes > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hal.rosenstock at gmail.com Sat Feb 14 09:03:12 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 14 Feb 2009 12:03:12 -0500 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin In-Reply-To: <20090214162254.GO14416@sashak.voltaire.com> References: <20090214135139.GA25402@comcast.net> <20090214162254.GO14416@sashak.voltaire.com> Message-ID: Hi Sasha, On Sat, Feb 14, 2009 at 11:22 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 08:51 Sat 14 Feb , hnrose at comcast.net wrote: >> >> From 3b8e45eaaeaac7bd34b60dfd432469cafc6caef7 Mon Sep 17 00:00:00 2001 > > Please don't put this line ("From ...") in patch message body - it marks > start of message in mbox file format and breaks things like 'git rebase' > and similar. (At least mask this line with '> ' character). Looks to me like it was >From but I'll try to remember to strip this. >> From: Hal Rosenstock >> Date: Tue, 10 Feb 2009 07:14:32 -0500 >> Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints > > Actually there is no reason to repeat email header in a commit message. So you just want the email subject and that stripped from the commit log ? -- Hal >> >> Signed-off-by: Hal Rosenstock > > Applied. Thanks. > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Sat Feb 14 09:46:22 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 19:46:22 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin In-Reply-To: References: <20090214135139.GA25402@comcast.net> <20090214162254.GO14416@sashak.voltaire.com> Message-ID: <20090214174622.GU14416@sashak.voltaire.com> On 12:03 Sat 14 Feb , Hal Rosenstock wrote: > > > > Please don't put this line ("From ...") in patch message body - it marks > > start of message in mbox file format and breaks things like 'git rebase' > > and similar. (At least mask this line with '> ' character). > > Looks to me like it was >From but I'll try to remember to strip this. I added '>' to 'From ...' by hand during commit using 'git commit --amend' (for each patch). > > >> From: Hal Rosenstock > >> Date: Tue, 10 Feb 2009 07:14:32 -0500 > >> Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints > > > > Actually there is no reason to repeat email header in a commit message. > > So you just want the email subject and that stripped from the commit log ? Normally email subject is used as patch description and email up to '---' line as commit message. You can put any text which is not part of commit message under '---' and before diffstat lines. You may want to look at http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches;hb=HEAD (or similar paper in kernel source tree) for more detailed explanations. Sasha From sashak at voltaire.com Sat Feb 14 10:05:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 20:05:54 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin In-Reply-To: <20090214174622.GU14416@sashak.voltaire.com> References: <20090214135139.GA25402@comcast.net> <20090214162254.GO14416@sashak.voltaire.com> <20090214174622.GU14416@sashak.voltaire.com> Message-ID: <20090214180554.GV14416@sashak.voltaire.com> On 19:46 Sat 14 Feb , Sasha Khapyorsky wrote: > > > > >> From: Hal Rosenstock > > >> Date: Tue, 10 Feb 2009 07:14:32 -0500 > > >> Subject: [PATCH] opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prints > > > > > > Actually there is no reason to repeat email header in a commit message. > > > > So you just want the email subject and that stripped from the commit log ? > > Normally email subject is used as patch description and email up to '---' > line as commit message. You can put any text which is not part of > commit message under '---' and before diffstat lines. And if you need to change patch authorship put line: From: Author Name (with ":") as first non-empty line in an email message body. Sasha > > You may want to look at > > http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches;hb=HEAD > > (or similar paper in kernel source tree) for more detailed explanations. > > Sasha From sean.hefty at intel.com Sat Feb 14 10:21:20 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 14 Feb 2009 10:21:20 -0800 Subject: [ofa-general] RE: [ofw] [ib-diag] sminfo: add support for WinOF In-Reply-To: <20090214154045.GK14416@sashak.voltaire.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> <20090214154045.GK14416@sashak.voltaire.com> Message-ID: <53FBD52E94FB434A944908FCE21DC27F@amr.corp.intel.com> >> >+#include "ibdiag_osd.h" >> >> I think it'll be easier to just put this include in ibdiag_common.h... > >What about to add files inttypes.h and unistd.h in winof tree? It could >be wrapper similars to ibdiag_osd.h. That could be done. The files would just be empty. As a thought, if you think of the porting going the reverse direction, would you want to add a windows.h to the linux side? - Sean From sean.hefty at intel.com Sat Feb 14 10:40:57 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 14 Feb 2009 10:40:57 -0800 Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF In-Reply-To: <20090214155601.GL14416@sashak.voltaire.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <20090214155601.GL14416@sashak.voltaire.com> Message-ID: <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> >> Would there be any objection to including the windows source files (.c and >.h) >> in the mgmt tree? > >Which files? Basically I prefer to not have unrelated things in my tree, >but let's see specific needs. So far, I have windows/ibdiag_osd.h, ibdiag_windows.c, and windows/cl_nodenamemap.h. My goal is to have the ib-diags support both Windows and Linux, so Windows files are related in that respect. Making an exception for the build files is reasonable IMO, given the WinOF build environment. >> diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband- >diags/src/ibdiag_common.c >> index bda1efa..154e00c 100644 >> --- a/infiniband-diags/src/ibdiag_common.c >> +++ b/infiniband-diags/src/ibdiag_common.c >> @@ -43,15 +43,14 @@ >> #include >> #include >> #include >> -#include >> #include >> -#include >> #include >> >> #include >> #include >> #include >> #include >> +#include "ibdiag_osd.h" > >Wouldn't it be easier (at least for linux developers :)) instead >of filtering out pretty standard header files to put such files under >winof tree? (Including config.h, this file is generated by autotools, >as far as I could see it is not used in WinOF, so it should be easy to >keep this as "osd" file). unistd.h is an 'osd' type file, so I think it makes more sense to isolate it to an osd related area. But if you really prefer, I can abstract these. (Windows provides an errno.h file, so at least there's some precedence.) >> @@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[], >void *cxt, >> char str_opts[1024]; >> const struct ibdiag_opt *o; >> >> - memset(opts_map, 0, sizeof(opts_map)); >> + memset((void *) opts_map, 0, sizeof(opts_map)); > >Hmm, why is this casting needed? opts_map is declared as const - (i.e. my compiler whined at me) >> -int main(int argc, char **argv) >> +int CDECL main(int argc, char **argv) > >Would compiler flag /Gd do the same without code modification? > >(http://msdn.microsoft.com/en-us/library/46t77ak2(VS.71).aspx) I'll see if I can get this to work. My quick test gave me compiler option conflicts, so I'll have to look into this more. - Sean From sean.hefty at intel.com Sat Feb 14 10:46:51 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 14 Feb 2009 10:46:51 -0800 Subject: [ofa-general] RE: [ofw] [ib-diag] sminfo: add support for WinOF In-Reply-To: <20090214154045.GK14416@sashak.voltaire.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> <20090214154045.GK14416@sashak.voltaire.com> Message-ID: >What about to add files inttypes.h and unistd.h in winof tree? It could >be wrapper similars to ibdiag_osd.h. One advantage of using your approach is that the source files end up only including those headers that it needs. Moving everything into ibdiag_osd.h means that the source files pick up other includes. Anyway, just let me know your preference, and I'll update the patches. - Sean From sashak at voltaire.com Sat Feb 14 11:02:28 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 21:02:28 +0200 Subject: [ofa-general] Re: [ofw] [ib-diag] sminfo: add support for WinOF In-Reply-To: <53FBD52E94FB434A944908FCE21DC27F@amr.corp.intel.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> <20090214154045.GK14416@sashak.voltaire.com> <53FBD52E94FB434A944908FCE21DC27F@amr.corp.intel.com> Message-ID: <20090214190228.GW14416@sashak.voltaire.com> On 10:21 Sat 14 Feb , Sean Hefty wrote: > >> >+#include "ibdiag_osd.h" > >> > >> I think it'll be easier to just put this include in ibdiag_common.h... > > > >What about to add files inttypes.h and unistd.h in winof tree? It could > >be wrapper similars to ibdiag_osd.h. > > That could be done. The files would just be empty. The files could be empty or as alternative to contain logically related stuff there. (For example inttypes.h can contain PRI* macros definitions). Another (at least hypothetical) advantage of such method is that when (and if it will happen) WinOF will decide to use things like cygwin then the "porting" will be pretty trivial. > As a thought, if you think > of the porting going the reverse direction, would you want to add a windows.h to > the linux side? Only in case when I would start a win-centric project porting. :) About windows.h - I guess this file is actually included in all (or almost) user space *.c files, assuming so we can put in in config.h. Sasha From sashak at voltaire.com Sat Feb 14 11:04:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 21:04:10 +0200 Subject: [ofa-general] Re: [ofw] [ib-diag] sminfo: add support for WinOF In-Reply-To: References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <77F5818CCF984093A66E291D7B0E3AF7@amr.corp.intel.com> <20090214154045.GK14416@sashak.voltaire.com> Message-ID: <20090214190410.GX14416@sashak.voltaire.com> On 10:46 Sat 14 Feb , Sean Hefty wrote: > > One advantage of using your approach is that the source files end up only > including those headers that it needs. Moving everything into ibdiag_osd.h > means that the source files pick up other includes. Anyway, just let me know > your preference, and I'll update the patches. I would prefer to have *nix/posix style files and to minimize the needed changes in common sources. Sasha From sashak at voltaire.com Sat Feb 14 11:11:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 21:11:01 +0200 Subject: [ofa-general] Re: [ib-diag] sminfo: add support for WinOF In-Reply-To: <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <20090214155601.GL14416@sashak.voltaire.com> <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> Message-ID: <20090214191101.GY14416@sashak.voltaire.com> On 10:40 Sat 14 Feb , Sean Hefty wrote: > >> Would there be any objection to including the windows source files (.c and > >.h) > >> in the mgmt tree? > > > >Which files? Basically I prefer to not have unrelated things in my tree, > >but let's see specific needs. > > So far, I have windows/ibdiag_osd.h, ibdiag_windows.c, and > windows/cl_nodenamemap.h. Isn't cl_nodenamemap.h part of complib? > > My goal is to have the ib-diags support both Windows and Linux, so Windows files > are related in that respect. Making an exception for the build files is > reasonable IMO, given the WinOF build environment. > > >> diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband- > >diags/src/ibdiag_common.c > >> index bda1efa..154e00c 100644 > >> --- a/infiniband-diags/src/ibdiag_common.c > >> +++ b/infiniband-diags/src/ibdiag_common.c > >> @@ -43,15 +43,14 @@ > >> #include > >> #include > >> #include > >> -#include > >> #include > >> -#include > >> #include > >> > >> #include > >> #include > >> #include > >> #include > >> +#include "ibdiag_osd.h" > > > >Wouldn't it be easier (at least for linux developers :)) instead > >of filtering out pretty standard header files to put such files under > >winof tree? (Including config.h, this file is generated by autotools, > >as far as I could see it is not used in WinOF, so it should be easy to > >keep this as "osd" file). > > unistd.h is an 'osd' type file, so I think it makes more sense to isolate it to > an osd related area. But if you really prefer, I can abstract these. (Windows > provides an errno.h file, so at least there's some precedence.) > > >> @@ -273,7 +272,7 @@ int ibdiag_process_opts(int argc, char * const argv[], > >void *cxt, > >> char str_opts[1024]; > >> const struct ibdiag_opt *o; > >> > >> - memset(opts_map, 0, sizeof(opts_map)); > >> + memset((void *) opts_map, 0, sizeof(opts_map)); > > > >Hmm, why is this casting needed? > > opts_map is declared as const - (i.e. my compiler whined at me) Probably it is reasonable to just drop const then. I don't see what this const really does. Sasha > > >> -int main(int argc, char **argv) > >> +int CDECL main(int argc, char **argv) > > > >Would compiler flag /Gd do the same without code modification? > > > >(http://msdn.microsoft.com/en-us/library/46t77ak2(VS.71).aspx) > > I'll see if I can get this to work. My quick test gave me compiler option > conflicts, so I'll have to look into this more. > > - Sean > From sean.hefty at intel.com Sat Feb 14 11:26:39 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 14 Feb 2009 11:26:39 -0800 Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF In-Reply-To: <20090214191101.GY14416@sashak.voltaire.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <20090214155601.GL14416@sashak.voltaire.com> <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> <20090214191101.GY14416@sashak.voltaire.com> Message-ID: >Isn't cl_nodenamemap.h part of complib? It's not available in windows. (Yes, sadly, even the OS abstraction code doesn't share a common codebase between the two platforms...) I'm not even sure nodenamemap is really at the same level of abstraction as other complib items, but I didn't want to try changing that area of the code at this time. (It seems like adding a cl_map_insert_copy() type operation would provide the desired funcationality.) I guess I can try adding nodenamemap to the windows version of complib for now. I didn't because I'm not convinced that it should be in complib. >> opts_map is declared as const - (i.e. my compiler whined at me) > >Probably it is reasonable to just drop const then. I don't see what this >const really does. If I remember correctly, I tried that and heard a different whine out of the compiler. I'll re-examine what the problem was. From sashak at voltaire.com Sat Feb 14 12:04:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 14 Feb 2009 22:04:08 +0200 Subject: [ofa-general] Re: [ib-diag] sminfo: add support for WinOF In-Reply-To: References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <20090214155601.GL14416@sashak.voltaire.com> <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> <20090214191101.GY14416@sashak.voltaire.com> Message-ID: <20090214200408.GZ14416@sashak.voltaire.com> On 11:26 Sat 14 Feb , Sean Hefty wrote: > >Isn't cl_nodenamemap.h part of complib? > > It's not available in windows. (Yes, sadly, even the OS abstraction code > doesn't share a common codebase between the two platforms...) I'm not even sure > nodenamemap is really at the same level of abstraction as other complib items, > but I didn't want to try changing that area of the code at this time. (It seems > like adding a cl_map_insert_copy() type operation would provide the desired > funcationality.) > > I guess I can try adding nodenamemap to the windows version of complib for now. > I didn't because I'm not convinced that it should be in complib. > > >> opts_map is declared as const - (i.e. my compiler whined at me) > > > >Probably it is reasonable to just drop const then. I don't see what this > >const really does. > > If I remember correctly, I tried that and heard a different whine out of the > compiler. I'll re-examine what the problem was. Ok, I'm starting to understand (again :)) why 'const' is there: static const struct ibdiag_opt *opts_map[256]; and later: memset(opts_map, 0, sizeof(opts_map)); opts_map is array of pointers which should refer read-only areas, memset() initializes the array itself. As far as I understand there should not be a "const violations". Sasha From hnrose at comcast.net Sat Feb 14 12:36:03 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 15:36:03 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/sim_client.c: Eliminate unneeded qp param from sim_init Message-ID: <20090214203603.GC32660@comcast.net> Signed-off-by: Hal Rosenstock --- umad2sim/sim_client.c | 9 ++++----- 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c index 3fffd24..59f81d5 100644 --- a/umad2sim/sim_client.c +++ b/umad2sim/sim_client.c @@ -202,7 +202,7 @@ static int sim_disconnect(struct sim_client *sc) return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0); } -static int sim_init(struct sim_client *sc, int qp, char *nodeid) +static int sim_init(struct sim_client *sc, char *nodeid) { union name_t name; socklen_t size; @@ -222,8 +222,7 @@ static int sim_init(struct sim_client *sc, int qp, char *nodeid) if (connect_host && *connect_host) remote_mode = 1; - DEBUG("init client pid=%d, qp=%d nodeid=%s", - pid, qp, nodeid ? nodeid : "none"); + DEBUG("init client pid=%d, nodeid=%s", pid, nodeid ? nodeid : "none"); if ((fd = socket(remote_mode ? PF_INET : PF_LOCAL, SOCK_DGRAM, 0)) < 0) IBPANIC("can't get socket (fd)"); @@ -257,7 +256,7 @@ static int sim_init(struct sim_client *sc, int qp, char *nodeid) IBPANIC("can't read data from bound socket"); port = ntohs(name.name_i.sin_port); - sc->clientid = sim_connect(sc, remote_mode ? port : pid, qp, nodeid); + sc->clientid = sim_connect(sc, remote_mode ? port : pid, 0, nodeid); if (sc->clientid < 0) IBPANIC("connect failed"); @@ -289,7 +288,7 @@ int sim_client_init(struct sim_client *sc) char *nodeid; nodeid = getenv("SIM_HOST"); - if (sim_init(sc, 0, nodeid) < 0) + if (sim_init(sc, nodeid) < 0) return -1; if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, sizeof(sc->vendor)) < 0) -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 12:37:03 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 15:37:03 -0500 Subject: [ofa-general] [PATCH] ibsim/sim_client.c: In sim_client_init, return -1 on error Message-ID: <20090214203703.GD32660@comcast.net> Signed-off-by: Hal Rosenstock --- umad2sim/sim_client.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c index 59f81d5..06bb7a8 100644 --- a/umad2sim/sim_client.c +++ b/umad2sim/sim_client.c @@ -309,7 +309,7 @@ int sim_client_init(struct sim_client *sc) _exit: sim_disconnect(sc); sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1; - return 0; + return -1; } void sim_client_exit(struct sim_client *sc) -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 12:35:03 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 15:35:03 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] ibsim: Eliminate unneeded argument in sim_client_init Message-ID: <20090214203503.GB32660@comcast.net> Signed-off-by: Hal Rosenstock --- umad2sim/sim_client.c | 14 ++++++++------ umad2sim/sim_client.h | 2 +- umad2sim/umad2sim.c | 3 +-- 3 files changed, 10 insertions(+), 9 deletions(-) diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c index d86de7c..3fffd24 100644 --- a/umad2sim/sim_client.c +++ b/umad2sim/sim_client.c @@ -284,19 +284,21 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm) return sim_ctl(sc, SIM_CTL_SET_ISSM, &issm, sizeof(int)); } -int sim_client_init(struct sim_client *sc, char *nodeid) +int sim_client_init(struct sim_client *sc) { - if (!nodeid) - nodeid = getenv("SIM_HOST"); + char *nodeid; + + nodeid = getenv("SIM_HOST"); if (sim_init(sc, 0, nodeid) < 0) return -1; - if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, sizeof(sc->vendor)) < - 0) + if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, + sizeof(sc->vendor)) < 0) goto _exit; if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo, sizeof(sc->nodeinfo)) < 0) goto _exit; - sc->portinfo[0] = 0; + + sc->portinfo[0] = 0; // portno requested if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo, sizeof(sc->portinfo)) < 0) goto _exit; diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h index 605b305..80ed442 100644 --- a/umad2sim/sim_client.h +++ b/umad2sim/sim_client.h @@ -47,7 +47,7 @@ struct sim_client { }; extern int sim_client_set_sm(struct sim_client *sc, unsigned issm); -extern int sim_client_init(struct sim_client *sc, char *nodeid); +extern int sim_client_init(struct sim_client *sc); extern void sim_client_exit(struct sim_client *sc); #endif /* _SIM_CLIENT_H_ */ diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c index 1236b8f..8d83a24 100644 --- a/umad2sim/umad2sim.c +++ b/umad2sim/umad2sim.c @@ -53,7 +53,6 @@ #include #include -#include #include #ifdef UMAD2SIM_NOISY_DEBUG @@ -562,7 +561,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) dev->num = num; strncpy(dev->name, name, sizeof(dev->name) - 1); - if (sim_client_init(&dev->sim_client, NULL) < 0) + if (sim_client_init(&dev->sim_client) < 0) goto _error; dev->port = mad_get_field(&dev->sim_client.portinfo, 0, -- 1.5.6.4 From hnrose at comcast.net Sat Feb 14 12:37:53 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Sat, 14 Feb 2009 15:37:53 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] ibsim: Add better end port simulation support Message-ID: <20090214203753.GE32660@comcast.net> Add SIM_PORT environment variable to allow for end port selection Signed-off-by: Hal Rosenstock --- ibsim/ibsim.c | 6 +- include/ibsim.h | 2 + umad2sim/sim_client.c | 49 +++++++++- umad2sim/sim_client.h | 4 +- umad2sim/umad2sim.c | 254 ++++++++++++++++++++++++++----------------------- 5 files changed, 189 insertions(+), 126 deletions(-) diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c index f48e1f0..6a35fdc 100644 --- a/ibsim/ibsim.c +++ b/ibsim/ibsim.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This file is part of ibsim. * @@ -187,7 +188,8 @@ static int sm_exists(Node * node) return 0; } -static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from) +static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, + union name_t *from) { union name_t name; size_t size; @@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f ctl->type = SIM_CTL_ERROR; return -1; } - cl->port = node_get_port(node, 0); + cl->port = node_get_port(node, scl->portnum); VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64, i, node->nodeid, cl->port->portguid); } else { diff --git a/include/ibsim.h b/include/ibsim.h index 15fc37c..66ba6f9 100644 --- a/include/ibsim.h +++ b/include/ibsim.h @@ -1,5 +1,6 @@ /* * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This file is part of ibsim. * @@ -100,6 +101,7 @@ struct sim_client_info { uint32_t qp; uint32_t issm; /* accept request for qp 0 & 1 */ char nodeid[32]; + uint32_t portnum; }; union name_t { diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c index 06bb7a8..1c35109 100644 --- a/umad2sim/sim_client.c +++ b/umad2sim/sim_client.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This file is part of ibsim. * @@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid) info.id = id; info.issm = 0; info.qp = qp; + info.portnum = sc->portnum; if (nodeid) strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1); @@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc) return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0); } -static int sim_init(struct sim_client *sc, char *nodeid) +static int sim_init(struct sim_client *sc, char *nodeid, int portnum) { union name_t name; socklen_t size; @@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid) DEBUG("init %d: opened ctl fd %d as \'%s\'", pid, ctlfd, get_name(&name)); + sc->portnum = portnum; port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT; size = make_name(&name, connect_host, port, "%s:ctl", socket_basename); @@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm) int sim_client_init(struct sim_client *sc) { char *nodeid; + char *portno; + int i, j = 0, portnum = 0, startport = 1, endport; + uint8_t numports, nodetype; + uint8_t *portinfo; nodeid = getenv("SIM_HOST"); - if (sim_init(sc, nodeid) < 0) + portno = getenv("SIM_PORT"); + if (portno) + portnum = atoi(portno); + + if (sim_init(sc, nodeid, portnum) < 0) return -1; if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, sizeof(sc->vendor)) < 0) @@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc) if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo, sizeof(sc->nodeinfo)) < 0) goto _exit; + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); + if (nodetype == 2) { // switch + startport = 0; + endport = 0; + } else { + if (portnum == 0) { + IBWARN("portnum 0 is not valid end port on non switch node"); + goto _exit; + } + endport = numports; + } + if (portnum > endport) { + IBWARN("portnum %d is not a valid end port number (%d)", + portnum, endport); + goto _exit; + } - sc->portinfo[0] = 0; // portno requested - if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo, - sizeof(sc->portinfo)) < 0) + sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1)); // portinfo size x number of ports starting at 0 + if (!sc->portinfo) goto _exit; + + // loop through end ports + for (i = startport; i <= endport ; i++, j++) { + portinfo = sc->portinfo + 64 * j; + *portinfo = i + 1; // portno requested + if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0) + goto _exit; + } + + // although pkeys also per port, current config same on all end ports if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0) goto _exit; if (getenv("SIM_SET_ISSM")) @@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc) void sim_client_exit(struct sim_client *sc) { sim_disconnect(sc); + if (sc->portinfo) + free(sc->portinfo); sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1; } diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h index 80ed442..0faca80 100644 --- a/umad2sim/sim_client.h +++ b/umad2sim/sim_client.h @@ -1,5 +1,6 @@ /* * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This file is part of ibsim. * @@ -41,8 +42,9 @@ struct sim_client { int clientid; int fd_pktin, fd_pktout, fd_ctl; struct sim_vendor vendor; + int portnum; uint8_t nodeinfo[64]; - uint8_t portinfo[64]; + uint8_t *portinfo; uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)]; }; diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c index 8d83a24..6e3c269 100644 --- a/umad2sim/umad2sim.c +++ b/umad2sim/umad2sim.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This file is part of ibsim. * @@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) struct sim_client *sc = &dev->sim_client; char *str; uint8_t *portinfo; - int i; + char *ports_path_end; + int i, j; + int startport = 1, endport; + uint8_t numports, nodetype; /* /sys/class/infiniband_mad/abi_version */ snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir); @@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) strncat(path, "/ports", sizeof(path) - 1); make_path(path); - portinfo = sc->portinfo; - - /* /sys/class/infiniband/mthca0/ports/1/ */ - val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); - snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); - make_path(path); - - /* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */ - val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); - file_printf(path, SYS_PORT_LMC, "%d", val); - - /* /sys/class/infiniband/mthca0/ports/1/sm_lid */ - val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); - file_printf(path, SYS_PORT_SMLID, "0x%x", val); - - /* /sys/class/infiniband/mthca0/ports/1/sm_sl */ - val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); - file_printf(path, SYS_PORT_SMSL, "%d", val); - - /* /sys/class/infiniband/mthca0/ports/1/lid */ - val = mad_get_field(portinfo, 0, IB_PORT_LID_F); - file_printf(path, SYS_PORT_LID, "0x%x", val); - - /* /sys/class/infiniband/mthca0/ports/1/state */ - val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); - if (val == 0) - str = "NOP"; - else if (val == 1) - str = "DOWN"; - else if (val == 2) - str = "INIT"; - else if (val == 3) - str = "ARMED"; - else if (val == 4) - str = "ACTIVE"; - else if (val == 5) - str = "ACTIVE_DEFER"; - else - str = ""; - file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); - - /* /sys/class/infiniband/mthca0/ports/1/phys_state */ - val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); - if (val == 1) - str = "Sleep"; - else if (val == 2) - str = "Polling"; - else if (val == 3) - str = "Disabled"; - else if (val == 4) - str = "PortConfigurationTraining"; - else if (val == 5) - str = "LinkUp"; - else if (val == 6) - str = "LinkErrorRecovery"; - else if (val == 7) - str = "Phy Test"; - else - str = ""; - file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); - - /* /sys/class/infiniband/mthca0/ports/1/rate */ - val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); - speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); - if (val == 1) - val = 1; - else if (val == 2) - val = 4; - else if (val == 4) - val = 8; - else if (val == 8) - val = 12; - else - val = 0; - if (speed == 2) - str = " DDR"; - else if (speed == 4) - str = " QDR"; - else - str = ""; - file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", - (val * speed * 25) / 10, - (val * speed * 25) % 10 ? ".5" : "", val, str); - - /* /sys/class/infiniband/mthca0/ports/1/cap_mask */ - val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); - file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); - - /* /sys/class/infiniband/mthca0/ports/1/gids/0 */ - str = path + strlen(path); - strncat(path, "/gids", sizeof(path) - 1); - make_path(path); - *str = '\0'; - gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); - guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + - mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); - file_printf(path, SYS_PORT_GID, - "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", - (uint16_t) ((gid >> 48) & 0xffff), - (uint16_t) ((gid >> 32) & 0xffff), - (uint16_t) ((gid >> 16) & 0xffff), - (uint16_t) ((gid >> 0) & 0xffff), - (uint16_t) ((guid >> 48) & 0xffff), - (uint16_t) ((guid >> 32) & 0xffff), - (uint16_t) ((guid >> 16) & 0xffff), - (uint16_t) ((guid >> 0) & 0xffff)); + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); + if (nodetype == 2) { // switch + startport = 0; + endport = 0; + } else + endport = numports; + + ports_path_end = path + strlen(path); + + // loop through end ports + for (j = startport; j <= endport; j++) { + + portinfo = sc->portinfo + 64 * j; + + /* /sys/class/infiniband/mthca0/ports// */ + val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); + snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); + make_path(path); + + /* /sys/class/infiniband/mthca0/ports//lid_mask_count */ + val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); + file_printf(path, SYS_PORT_LMC, "%d", val); + + /* /sys/class/infiniband/mthca0/ports//sm_lid */ + val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); + file_printf(path, SYS_PORT_SMLID, "0x%x", val); + + /* /sys/class/infiniband/mthca0/ports//sm_sl */ + val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); + file_printf(path, SYS_PORT_SMSL, "%d", val); + + /* /sys/class/infiniband/mthca0/ports//lid */ + val = mad_get_field(portinfo, 0, IB_PORT_LID_F); + file_printf(path, SYS_PORT_LID, "0x%x", val); + + /* /sys/class/infiniband/mthca0/ports//state */ + val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); + if (val == 0) + str = "NOP"; + else if (val == 1) + str = "DOWN"; + else if (val == 2) + str = "INIT"; + else if (val == 3) + str = "ARMED"; + else if (val == 4) + str = "ACTIVE"; + else if (val == 5) + str = "ACTIVE_DEFER"; + else + str = ""; + file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); + + /* /sys/class/infiniband/mthca0/ports//phys_state */ + val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); + if (val == 1) + str = "Sleep"; + else if (val == 2) + str = "Polling"; + else if (val == 3) + str = "Disabled"; + else if (val == 4) + str = "PortConfigurationTraining"; + else if (val == 5) + str = "LinkUp"; + else if (val == 6) + str = "LinkErrorRecovery"; + else if (val == 7) + str = "Phy Test"; + else + str = ""; + file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); + + /* /sys/class/infiniband/mthca0/ports//rate */ + val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + if (val == 1) + val = 1; + else if (val == 2) + val = 4; + else if (val == 4) + val = 8; + else if (val == 8) + val = 12; + else + val = 0; + if (speed == 2) + str = " DDR"; + else if (speed == 4) + str = " QDR"; + else + str = ""; + file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", + (val * speed * 25) / 10, + (val * speed * 25) % 10 ? ".5" : "", val, str); + + /* /sys/class/infiniband/mthca0/ports//cap_mask */ + val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); + file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); + + /* /sys/class/infiniband/mthca0/ports//gids/0 */ + str = path + strlen(path); + strncat(path, "/gids", sizeof(path) - 1); + make_path(path); + *str = '\0'; + gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); + guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j; + file_printf(path, SYS_PORT_GID, + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + (uint16_t) ((gid >> 48) & 0xffff), + (uint16_t) ((gid >> 32) & 0xffff), + (uint16_t) ((gid >> 16) & 0xffff), + (uint16_t) ((gid >> 0) & 0xffff), + (uint16_t) ((guid >> 48) & 0xffff), + (uint16_t) ((guid >> 32) & 0xffff), + (uint16_t) ((guid >> 16) & 0xffff), + (uint16_t) ((guid >> 0) & 0xffff)); + + /* /sys/class/infiniband/mthca0/ports//pkeys/0 */ + str = path + strlen(path); + strncat(path, "/pkeys", sizeof(path) - 1); + make_path(path); + for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { + char name[8]; + snprintf(name, sizeof(name), "%u", i); + file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); + } + *str = '\0'; - /* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */ - str = path + strlen(path); - strncat(path, "/pkeys", sizeof(path) - 1); - make_path(path); - for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { - char name[8]; - snprintf(name, sizeof(name), "%u", i); - file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); + *ports_path_end = '\0'; } - *str = '\0'; /* /sys/class/infiniband_mad/umad0/ */ snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, @@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) if (sim_client_init(&dev->sim_client) < 0) goto _error; - dev->port = mad_get_field(&dev->sim_client.portinfo, 0, - IB_PORT_LOCAL_PORT_F); + dev->port = dev->sim_client.portnum; for (i = 0; i < arrsize(dev->agents); i++) dev->agents[i].id = (uint32_t)(-1); for (i = 0; i < arrsize(dev->agent_idx); i++) -- 1.5.6.4 From sean.hefty at intel.com Sat Feb 14 17:59:47 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 14 Feb 2009 17:59:47 -0800 Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF In-Reply-To: <20090214200408.GZ14416@sashak.voltaire.com> References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <20090214155601.GL14416@sashak.voltaire.com> <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> <20090214191101.GY14416@sashak.voltaire.com> <20090214200408.GZ14416@sashak.voltaire.com> Message-ID: >static const struct ibdiag_opt *opts_map[256]; > >and later: > > memset(opts_map, 0, sizeof(opts_map)); With the above: warnings in directory c:\mshefty\scm\winof\branches\winverbs\tools\infiniband-di ags\src\sminfo c:\mshefty\scm\winof\branches\winverbs\tools\infiniband-diags\src\ibdiag_common. c(276) : warning C4090: 'function' : different 'const' qualifiers (line 276 is the memset) With const removed, Windows builds fine, but then Linux build gives: src/ibdiag_common.c: In function `ibdiag_process_opts': src/ibdiag_common.c:282: warning: passing arg 3 of `make_long_opts' from incompa tible pointer type So, I gave up and added a cast at the memset. - Sean From sean.hefty at intel.com Sat Feb 14 18:04:33 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 14 Feb 2009 18:04:33 -0800 Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF In-Reply-To: References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <20090214155601.GL14416@sashak.voltaire.com> <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> <20090214191101.GY14416@sashak.voltaire.com> <20090214200408.GZ14416@sashak.voltaire.com> Message-ID: >>static const struct ibdiag_opt *opts_map[256]; >> >>and later: >> >> memset(opts_map, 0, sizeof(opts_map)); And on second thought... can we just drop the memset completely? From sashak at voltaire.com Sat Feb 14 21:07:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 15 Feb 2009 07:07:25 +0200 Subject: [ofa-general] RE: [ib-diag] sminfo: add support for WinOF In-Reply-To: References: <430FDE77B2EA44988D82EF84355CBE4A@amr.corp.intel.com> <20090214155601.GL14416@sashak.voltaire.com> <4471C3AFC992496FA9671EB87449CAA6@amr.corp.intel.com> <20090214191101.GY14416@sashak.voltaire.com> <20090214200408.GZ14416@sashak.voltaire.com> Message-ID: <20090215050725.GA14416@sashak.voltaire.com> On 18:04 Sat 14 Feb , Sean Hefty wrote: > >>static const struct ibdiag_opt *opts_map[256]; > >> > >>and later: > >> > >> memset(opts_map, 0, sizeof(opts_map)); > > And on second thought... can we just drop the memset completely? Yes, this is static and already initialized. Sasha From sashak at voltaire.com Sat Feb 14 23:45:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 15 Feb 2009 09:45:42 +0200 Subject: [ofa-general] [PATCH RESEND] ibutils: remove -libcommon linkage flag Message-ID: <20090215074542.GA7189@sashak.voltaire.com> Remove -libcommon linkage flag - libibumad doesn't depend from libibcommon anymore and libibcommon will be removed from management tree soon. Signed-off-by: Sasha Khapyorsky --- libibcommon is already removed and ibutils against OpenSM master is broken. config/osm.m4 | 2 +- ibis/config/osm.m4 | 2 +- ibmgtsim/config/osm.m4 | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/config/osm.m4 b/config/osm.m4 index da9ae81..f8d92d7 100644 --- a/config/osm.m4 +++ b/config/osm.m4 @@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then elif test -L $with_osm_libs/libopensm.so; then OSM_VENDOR=openib osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB " - OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon" + OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad" else AC_MSG_ERROR([OSM: Fail to recognize vendor type]) fi diff --git a/ibis/config/osm.m4 b/ibis/config/osm.m4 index da9ae81..f8d92d7 100644 --- a/ibis/config/osm.m4 +++ b/ibis/config/osm.m4 @@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then elif test -L $with_osm_libs/libopensm.so; then OSM_VENDOR=openib osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB " - OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon" + OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad" else AC_MSG_ERROR([OSM: Fail to recognize vendor type]) fi diff --git a/ibmgtsim/config/osm.m4 b/ibmgtsim/config/osm.m4 index da9ae81..f8d92d7 100644 --- a/ibmgtsim/config/osm.m4 +++ b/ibmgtsim/config/osm.m4 @@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then elif test -L $with_osm_libs/libopensm.so; then OSM_VENDOR=openib osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB " - OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon" + OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad" else AC_MSG_ERROR([OSM: Fail to recognize vendor type]) fi -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 14 23:47:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 15 Feb 2009 09:47:01 +0200 Subject: [ofa-general] [PATCH] ibuitls: use osm_config.h file instead of osm_build_id.h Message-ID: <20090215074701.GB7189@sashak.voltaire.com> Use standard osm_config.h file for OpenSM build mode detection rather than not valid osm_build_id.h junk which will be removed very soon from OpenSM tree. Signed-off-by: Sasha Khapyorsky --- config/osm.m4 | 7 ++++--- ibis/config/osm.m4 | 7 ++++--- ibmgtsim/config/osm.m4 | 7 ++++--- 3 files changed, 12 insertions(+), 9 deletions(-) diff --git a/config/osm.m4 b/config/osm.m4 index f8d92d7..cc50fdf 100644 --- a/config/osm.m4 +++ b/config/osm.m4 @@ -166,11 +166,12 @@ if test "x$libcheck" = "xtrue"; then dnl validate the defined path - so the build id header is there - AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,, - AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_build_id.h])) + AC_CHECK_FILE($osm_include_dir/opensm/osm_config.h,, + AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_config.h])) dnl now figure out somehow if the build was for debug or not - if test `grep debug $osm_include_dir/opensm/osm_build_id.h | wc -l` = 1; then + grep '#define OSM_DEBUG 1' $osm_include_dir/opensm/osm_config.h > /dev/null + if test $? -eq 0 ; then dnl why did they need so many ??? osm_debug_flags='-DDEBUG -D_DEBUG -D_DEBUG_ -DDBG' AC_MSG_NOTICE(OSM: compiled in DEBUG mode) diff --git a/ibis/config/osm.m4 b/ibis/config/osm.m4 index f8d92d7..cc50fdf 100644 --- a/ibis/config/osm.m4 +++ b/ibis/config/osm.m4 @@ -166,11 +166,12 @@ if test "x$libcheck" = "xtrue"; then dnl validate the defined path - so the build id header is there - AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,, - AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_build_id.h])) + AC_CHECK_FILE($osm_include_dir/opensm/osm_config.h,, + AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_config.h])) dnl now figure out somehow if the build was for debug or not - if test `grep debug $osm_include_dir/opensm/osm_build_id.h | wc -l` = 1; then + grep '#define OSM_DEBUG 1' $osm_include_dir/opensm/osm_config.h > /dev/null + if test $? -eq 0 ; then dnl why did they need so many ??? osm_debug_flags='-DDEBUG -D_DEBUG -D_DEBUG_ -DDBG' AC_MSG_NOTICE(OSM: compiled in DEBUG mode) diff --git a/ibmgtsim/config/osm.m4 b/ibmgtsim/config/osm.m4 index f8d92d7..cc50fdf 100644 --- a/ibmgtsim/config/osm.m4 +++ b/ibmgtsim/config/osm.m4 @@ -166,11 +166,12 @@ if test "x$libcheck" = "xtrue"; then dnl validate the defined path - so the build id header is there - AC_CHECK_FILE($osm_include_dir/opensm/osm_build_id.h,, - AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_build_id.h])) + AC_CHECK_FILE($osm_include_dir/opensm/osm_config.h,, + AC_MSG_ERROR([OSM: could not find $with_osm/include/opensm/osm_config.h])) dnl now figure out somehow if the build was for debug or not - if test `grep debug $osm_include_dir/opensm/osm_build_id.h | wc -l` = 1; then + grep '#define OSM_DEBUG 1' $osm_include_dir/opensm/osm_config.h > /dev/null + if test $? -eq 0 ; then dnl why did they need so many ??? osm_debug_flags='-DDEBUG -D_DEBUG -D_DEBUG_ -DDBG' AC_MSG_NOTICE(OSM: compiled in DEBUG mode) -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sun Feb 15 00:25:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 15 Feb 2009 10:25:40 +0200 Subject: [ofa-general] [PATCH v2] ibutils/ibis: link ibis dynamically Message-ID: <20090215082540.GC7189@sashak.voltaire.com> Otherwise when running against ibsim with libumad2sim.so preloaded it has two instances (static and dynamic resulted by libumad2sim.so preloading) of libibumad with different internal initializations, etc., which makes it impossible to use ibutils in ibsim environment. Signed-off-by: Sasha Khapyorsky --- The difference against previous version of the patch is noinst_LIBRARIES use, so libibiscom will not be installed. ibis/src/Makefile.am | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am index e0b512f..cfa22f6 100644 --- a/ibis/src/Makefile.am +++ b/ibis/src/Makefile.am @@ -54,9 +54,10 @@ AM_CXXFLAGS = $(TCL_CPPFLAGS) $(OSM_CFLAGS) $(DBG) -fno-strict-aliasing -fPIC - LIB_VER_TRIPLET="1:0:0" LIB_FILE_TRIPLET=1.0.0 -lib_LTLIBRARIES = libibiscom.la libibis.la +lib_LTLIBRARIES = libibis.la +noinst_LIBRARIES = libibiscom.a -libibiscom_la_SOURCES = ibbbm.c ibcr.c ibis.c ibis_gsi_mad_ctrl.c \ +libibiscom_a_SOURCES = ibbbm.c ibcr.c ibis.c ibis_gsi_mad_ctrl.c \ ibpm.c ibsac.c ibsm.c ibvs.c ibcc.c # client library to be used by IBIS TCL package: @@ -70,11 +71,9 @@ bin_PROGRAMS = ibis # this is used for the libraries link LDADD = $(OSM_LDFLAGS) -# AM_LDFLAGS = -static ibis_SOURCES = ibissh_wrap.cpp -ibis_LDFLAGS = -static # note the order of the libraries does matter as we static link ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS) -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sun Feb 15 00:27:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 15 Feb 2009 10:27:12 +0200 Subject: [ofa-general] [PATCH] opensm/Makefile.am: remove osm_build_id.h junk file generation Message-ID: <20090215082712.GD7189@sashak.voltaire.com> osm_build_id.h is not a valid C file. This is only used for OpenSM debug mode build determination, which is now available using OSM_DEBUG macro from osm_config.h. Signed-off-by: Sasha Khapyorsky --- opensm/Makefile.am | 6 ------ 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/opensm/Makefile.am b/opensm/Makefile.am index 2287edd..75b6dc5 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -7,12 +7,6 @@ ACLOCAL_AMFLAGS = -I config # we should provide a hint for other apps about the build mode of this project install-exec-hook: - mkdir -p $(DESTDIR)/$(includedir) -if DEBUG - echo "define osm_build_type \"debug\"" > $(DESTDIR)/$(includedir)/infiniband/opensm/osm_build_id.h -else - echo "define osm_build_type \"free\"" > $(DESTDIR)/$(includedir)/infiniband/opensm/osm_build_id.h -endif $(top_srcdir)/config/install-sh -m 755 -d $(DESTDIR)/$(sysconfdir)/init.d cp $(top_builddir)/scripts/opensm.init $(DESTDIR)/$(sysconfdir)/init.d/opensmd chmod 755 $(DESTDIR)/$(sysconfdir)/init.d/opensmd -- 1.6.1.2.319.gbd9e From vlad at lists.openfabrics.org Sun Feb 15 03:15:13 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 15 Feb 2009 03:15:13 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090215-0200 daily build status Message-ID: <20090215111514.11A0EE301A9@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Failed: Build failed on ia64 with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.16.21-0.8-default Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16.21-0.8-default_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16.21-0.8-default' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.21.1 Log: Build failed on ia64 with linux-2.6.23 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.21.1_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.21.1' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.23_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.23' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.22 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.22_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.22' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.25 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.25_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.25' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.24_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.26 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.26_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.26' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.16_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.17_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18-8.el5 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.18-8.el5_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18-8.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c:317: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c:325: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c: In function 'rds_iw_conn_shutdown': /home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.c:693: error: implicit declaration of function 'vfree' make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds/iw_cm.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20090215-0200_linux-2.6.19_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From tziporet at dev.mellanox.co.il Sun Feb 15 07:56:30 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 15 Feb 2009 17:56:30 +0200 Subject: [ofa-general] mlx4 changing RNR_RETRY for an established qp In-Reply-To: References: <4994A1FD.2060704@oracle.com> <4994A625.9060008@oracle.com> Message-ID: <49983B2E.6000802@mellanox.co.il> Roland Dreier wrote: > Is SQD really not supported by ConnectX? If so it is likely a temporary > firmware issue I would think. > > > Its FW but we do not plan to add it in the near future Tziporet From neutronsharc at gmail.com Sun Feb 15 14:40:36 2009 From: neutronsharc at gmail.com (neutron) Date: Sun, 15 Feb 2009 17:40:36 -0500 Subject: [ofa-general] IB function calls in kernel module fail Message-ID: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> Hi all, I'm writing a kernel module that make use of basic IB verbs to communicate, like: ib_register_client, ib_unregister_client, ib_alloc_pd, ib_create_qp, ib_reg_phys_mr, etc. I can compile the code into a kernel module: ib_rdma_lat.ko. This module is to test the RDMA write latency from kernel module. But when I "insmod", I got error reports at /var/log/messages: Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_unregister_client Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_unregister_client Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_create_cq Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_reg_phys_mr Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_dereg_mr Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_register_client Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_register_client Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_destroy_cq Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_query_port Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_alloc_pd Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_create_qp Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_modify_qp Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_destroy_qp Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of symbol ib_dealloc_pd Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd I'm running rhel5. I have rebooted the node many times but didn't help at all. [wci11-oib:~/dist_lock/ib_kernel]uname -a Linux wci11-oib 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 07:18:46 EST 2008 x86_64 x86_64 x86_64 GNU/Linux "ofed_info" is: [wci11-oib:~/dist_lock/ib_kernel]/usr/bin/ofed_info OFED-1.3.1 libibverbs: git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3 commit 40b771aa6a9c0ad092b2e20775b4723d3b173792 libmthca: git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3 commit 9501e698d257949acfab2edc90812602966dbcc9 libmlx4: git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3 ...... I'm pretty sure all IB modules are loaded already: [wci11-oib:~/dist_lock/ib_kernel]lsmod | grep ib ib_sdp 125020 0 rdma_cm 67348 2 rdma_ucm,ib_sdp ib_addr 41992 1 rdma_cm ib_ipoib 113248 0 ib_cm 67368 3 qlgc_vnic,rdma_cm,ib_ipoib ib_sa 74632 4 qlgc_vnic,rdma_cm,ib_ipoib,ib_cm ib_uverbs 75568 1 rdma_ucm ib_umad 50600 0 ib_ipath 346316 0 mlx4_ib 95932 0 mlx4_core 109008 1 mlx4_ib ib_mthca 159044 0 ib_mad 70948 5 ib_cm,ib_sa,ib_umad,mlx4_ib,ib_mthca ib_core 97664 15 rdma_ucm,qlgc_vnic,ib_sdp,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_sa,ib_uverbs,ib_umad,iw_cxgb3,ib_ipath,mlx4_ib,ib_mthca,ib_mad libiscsi 61952 1 iscsi_tcp scsi_transport_iscsi 67344 3 iscsi_tcp,libiscsi ipoib_helper 35728 2 ib_ipoib ipv6 411425 43 ib_ipoib libata 160849 1 ata_piix scsi_mod 186361 6 iscsi_tcp,libiscsi,scsi_transport_iscsi,sg,libata,sd_mod "service openibd status" reports the status is OK: [wci11-oib:~/dist_lock/ib_kernel]sudo service openibd status HCA driver loaded Configured devices: ib0 ib1 ib2 ib3 Currently active devices: ib0 ib2 The following OFED modules are loaded: rdma_ucm qlgc_vnic ib_sdp rdma_cm ib_addr ib_ipoib ib_ipath mlx4_core mlx4_ib ib_mthca ib_uverbs ib_umad ib_sa ib_cm ib_mad ib_core iw_cxgb3 I have no idea what's going on. Any suggestions? From wangwhao at cn.ibm.com Sun Feb 15 17:29:30 2009 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Mon, 16 Feb 2009 09:29:30 +0800 Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: <1234541612.751.1.camel@firewall.xsintricity.com> Message-ID: Wen Hao Wang (王文昊) Software Engineer IBM China Software Development Laboratory Email: wangwhao at cn.ibm.com Tel: 86-10-82451055 Fax: 86-10-82782244 ext. 2312 Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District Beijing, 100193, P.R.China Doug Ledford 写于 2009-02-14 00:13:32: > On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote: > > Doug Ledford 写于 2009-02-12 21:20:30: > > > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote: > > > > Wen Hao Wang wrote: > > > > > > > > > > Hi all: > > > > > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed OFED > > (shipped > > > > > in RHEL5.3 image) by "yum groupisntall". Then I load some > > drivers and > > > > > wrote network interface configuration file ifcfg-ib0. ifup ib0 > > also > > > > > succeeded. But IB utilites report Connetion timed out. > > > > > > > > > > > > > > > [root at xblade06 network-scripts]# sminfo > > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed out > > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > > > > sminfo: iberror: failed: query > > > > > > > > > > I had to reboot the blade and rerun "openibd start". Then > > sminfo > > > > > reported correct contents. I do not suppose this reboot is > > required. > > > > > Did I miss any configuration step? > > > > > > There was an unintentional bug in the rhel5.2 openibd init script in > > > that it automatically turned itself on during install (generally, > > most > > > init scripts should default to *not* turning themselves on during > > > install of the package, nor should they start themselves during > > install > > > of the package...this is for security reasons, imagine if you > > installed > > > the bind name server on your box and it automatically started up > > before > > > you had a chance to configure it). In rhel5.3 we fixed that bug. > > So, > > > > Yeah. I heard of this bug. > > > > > you may need to 'chkconfig --level 2345 openibd on' to make sure > > openibd > > > starts up each time. The error you list above is consistent with > > not > > > all of the kernel modules being loaded when you tried to use the > > sminfo > > > program. > > > > Even after reboot, service openibd is not started automatically. > > [root at xblade06 ~]# chkconfig --list openibd > > openibd 0:off 1:off 2:off 3:off 4:off 5:off 6:off > > That's because you have to run the command I listed in my first email to > turn it on. > I totally agree with this. But I am still confused why sminfo gave errors before reboot, or which steps I should take for the first OFED usage before reboot. As far as I can see, whether the service is added into system runlevel DB is not related to the sminfo error. Please correct me if that is not the case. > > I agree with you that maybe some modules were not loaded. But what's > > that? > > Before reboot, I run "/etc/init.d/openibd start" and > > "/etc/init.d/network > > restart". No error was reported. "openibd status" also looked good. > > Running start on a service does not enable that service at the next > reboot. You must specifically enable the service in order for it to > start automatically. > > > > > > > > > Moreover, "openibd start" report one warning message about > > hwconf. > > > > > Anyone has comments about this? > > > > > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: No > > such > > > > > file or directory > > > > > [ OK ] > > > > > > Can you see if the kudzu package is installed on your machine? The > > > openib package uses this config file written by kudzu to determine > > what > > > hardware drivers to load. I suppose I should put a specific > > requires in > > > the rpm for that. > > > > kudzu is installed. > > [root at xblade06 ~]# rpm -q kudzu > > kudzu-1.2.57.1.21-1 > > Make sure kudzu has been run at least once then (it would appear to be > turned off on your machine or else /etc/sysconfig/hwconf would exist). > You can run it manually from the command line and that should be > sufficient for the openibd init script's needs. > Yes. After kudza created the file on my machine, openibd script had no error> this time. I want to know in my scenario, is "openibd restart" needed/required? Many thanks! Wen Hao Wang Email: wangwhao at cn.ibm.com > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除] -------------- next part -------------- An HTML attachment was scrubbed... URL: From subbukl at gmail.com Mon Feb 16 01:51:13 2009 From: subbukl at gmail.com (subbu kl) Date: Mon, 16 Feb 2009 15:21:13 +0530 Subject: ***SPAM*** Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not working In-Reply-To: References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> Message-ID: anyone any clue on this ? As I am seeing the same issue with centos 5.2 HVM guest also with xen 3.4 unstable ! ~subbu On Thu, Feb 12, 2009 at 1:50 PM, subbu kl wrote: > did a quick search, > I believe its MMIO, as it is > > in file - http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_main.c > mthca_QUERY_FW () is resulting into > > mthca_QUERY_FW() which inturn will result into mthca_cmd_post_dbell()/mthca_cmd_post_hcr() which inturn results into > __raw_writel((__force u32) cpu_to_be32(in_param >> 32), ptr + offs[0]); > > > in the file - http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/infiniband/hw/mthca/mthca_cmd.c > > OFED people should be more helpful here to comment if I have missed out > something. Roland any clue? > > ~subbu > > > On Thu, Feb 12, 2009 at 1:31 PM, Jiang, Yunhong wrote: > >> Can you please share more information how will the ib_mthca do QUERY_FW? >> Through config space access? Through MMIO access? I think more information >> will be helpful. The only thing seems strange to me is, from "Memory at >> fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M]" , seems the MMIO >> is disabled? >> >> Thanks >> Yunhong Jiang >> >> ------------------------------ >> *From:* subbu kl [mailto:subbukl at gmail.com] >> *Sent:* 2009年2月12日 15:46 >> >> *To:* Jiang, Yunhong >> *Cc:* David Brown; xen-devel at lists.xensource.com; >> general at lists.openfabrics.org >> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >> working >> >> so back to square one ? >> Why QUERY_FW should fail in domU ? >> >> ~subbu >> >> On Thu, Feb 12, 2009 at 12:30 PM, Jiang, Yunhong > > wrote: >> >>> DomU access config space through pcibackend, so that message is ok. >>> >>> ------------------------------ >>> *From:* subbu kl [mailto:subbukl at gmail.com] >>> *Sent:* 2009年2月12日 14:59 >>> >>> *To:* Jiang, Yunhong >>> *Cc:* David Brown; xen-devel at lists.xensource.com; >>> general at lists.openfabrics.org >>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>> working >>> >>> So getting PCI config space access in domU will solve the problem ? if >>> so how can I achieve that ? >>> >>> ~subbu >>> >>> On Thu, Feb 12, 2009 at 12:26 PM, Jiang, Yunhong < >>> yunhong.jiang at intel.com> wrote: >>> >>>> Sorry that seems the original mail has tried the permissive already :$ >>>> How will So how will the card do the QEUREY_FW command?Through config >>>> space or through MMIO? Following information is something strange, why all >>>> the MMIO range is disabled? >>>> >>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] >>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>> >>>> As for the following information, I think it should be harmless since >>>> domU has no method of config spacess access method. >>>> PCI: Fatal: No PCI config space access function found >>>> >>>> Thanks >>>> Yunhong Jiang >>>> >>>> ------------------------------ >>>> *From:* subbu kl [mailto:subbukl at gmail.com] >>>> *Sent:* 2009年2月12日 14:43 >>>> >>>> *To:* Jiang, Yunhong >>>> *Cc:* David Brown; xen-devel at lists.xensource.com; >>>> general at lists.openfabrics.org >>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>>> working >>>> >>>> oops missed it, >>>> >>>> well now I dont see that enable permissive...message. here goes the >>>> messages what I got in dom0 while booting domU >>>> >>>> tap tap-1-51712: 2 getting info >>>> pciback: vpci: 0000:0e:00.0: assign to virtual slot 0 >>>> device vif1.0 entered promiscuous mode >>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >>>> blktap: ring-ref 9, event-channel 9, protocol 1 (x86_64-abi) >>>> PCI: Enabling device 0000:0e:00.0 (0000 -> 0002) >>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>> PCI: Setting latency timer of device 0000:0e:00.0 to 64 >>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>>> ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready >>>> xenbr0: topology change detected, propagating >>>> xenbr0: port 3(vif1.0) entering forwarding state >>>> >>>> any suspicious message ? >>>> any Idea why I get that : >>>> PCI: Fatal: No PCI config space access function found >>>> rtc: IRQ 8 is not free. >>>> >>>> message in domU bootup message ? >>>> >>>> ~subbu >>>> >>>> On Thu, Feb 12, 2009 at 11:50 AM, Jiang, Yunhong < >>>> yunhong.jiang at intel.com> wrote: >>>> >>>>> So any changes in dom0's dmesg? >>>>> >>>>> >>>>> ------------------------------ >>>>> *From:* subbu kl [mailto:subbukl at gmail.com] >>>>> *Sent:* 2009年2月12日 13:52 >>>>> *To:* Jiang, Yunhong >>>>> *Cc:* David Brown; xen-devel at lists.xensource.com; >>>>> general at lists.openfabrics.org >>>>> *Subject:* Re: [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>>>> working >>>>> >>>>> no luck ! >>>>> dmesg in XEN PV guest shows : >>>>> >>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >>>>> ib_mthca: Initializing 0000:00:00.0 >>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>>> >>>>> even after executingh the following in dom0: >>>>> >>>>> #echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/permissive >>>>> >>>>> I am getting the follwing messages on the console as part of the >>>>> initial bootup messages of the guest: >>>>> >>>>> Started domain rhel52_64_3 >>>>> PCI: Fatal: No PCI config space access function found >>>>> rtc: IRQ 8 is not free. >>>>> i8042.c: No controller found. >>>>> >>>>> after executing the following in dom0 : >>>>> #xm create -c rhel52_64_3 >>>>> >>>>> >>>>> so, problem persisits, >>>>> >>>>> ~subbu >>>>> >>>>> >>>>> 2009/2/12 Jiang, Yunhong >>>>> >>>>>> Seems it is because PCI frontend try to write some configuration >>>>>> space that PCIback has no config_field entry to support it. >>>>>> I think you can firstly try to do as dom0's dmesg suggested: "see >>>>>> permissive attribute in sysfs" (it should be "set permissive attribute...", >>>>>> I think). >>>>>> >>>>>> BTW, where you got following log? That seems suggest config space >>>>>> function not found. >>>>>> >>>>>> PCI: Fatal: No PCI config space access function found >>>>>> rtc: IRQ 8 is not free. >>>>>> i8042.c: No controller found." >>>>>> >>>>>> -- Yunhong Jiang >>>>>> >>>>>> ------------------------------ >>>>>> *From:* xen-devel-bounces at lists.xensource.com [mailto: >>>>>> xen-devel-bounces at lists.xensource.com] *On Behalf Of *subbu kl >>>>>> *Sent:* 2009年2月11日 22:18 >>>>>> *To:* David Brown >>>>>> *Cc:* xen-devel at lists.xensource.com; general at lists.openfabrics.org >>>>>> *Subject:* [Xen-devel] Re: [ofa-general] Fwd: pciback module not >>>>>> working >>>>>> >>>>>> I am getting the same QUERY_FW failed on RHEL5.2 with xenxen >>>>>> paravirtualized guest with pciback module. >>>>>> >>>>>> No one seems to have tried answering this question on the list, let me >>>>>> ping xen-devel and ofed people again. >>>>>> >>>>>> after executing in dom0 >>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/ib_mthca/unbind >>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/new_slot >>>>>> echo -n 0000:0e:00.0 > /sys/bus/pci/drivers/pciback/bind >>>>>> >>>>>> #dmesg >>>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>>>>> tap tap-1-51712: 2 getting info >>>>>> tap tap-2-51712: 2 getting info >>>>>> pciback 0000:0e:00.0: seizing device >>>>>> PCI: Enabling device 0000:0e:00.0 (0140 -> 0142) >>>>>> ACPI: PCI Interrupt 0000:0e:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>>> ACPI: PCI interrupt for device 0000:0e:00.0 disabled >>>>>> >>>>>> #xm create -c rhel52_64_3 >>>>>> >>>>>> PCI: Fatal: No PCI config space access function found >>>>>> rtc: IRQ 8 is not free. >>>>>> i8042.c: No controller found. >>>>>> >>>>>> >>>>>> GUEST dmesg: >>>>>> >>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >>>>>> ib_mthca: Initializing 0000:00:00.0 >>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>>>> >>>>>> in dom0: >>>>>> Feb 11 19:44:37 p128 kernel: tap tap-3-51712: 2 getting info >>>>>> Feb 11 19:44:37 p128 kernel: pciback: vpci: 0000:0e:00.0: assign to >>>>>> virtual slot 0 >>>>>> Feb 11 19:44:37 p128 kernel: device vif3.0 entered promiscuous mode >>>>>> Feb 11 19:44:37 p128 kernel: ADDRCONF(NETDEV_UP): vif3.0: link is not >>>>>> ready >>>>>> Feb 11 19:44:39 p128 kernel: blktap: ring-ref 9, event-channel 9, >>>>>> protocol 1 (x86_64-abi) >>>>>> Feb 11 19:44:48 p128 kernel: pciback 0000:0e:00.0: Driver tried to >>>>>> write to a read-only configuration space field at offset 0x44, size 2. This >>>>>> may be harmless, but if you have problems with your device: >>>>>> Feb 11 19:44:48 p128 kernel: 1) see permissive attribute in sysfs >>>>>> Feb 11 19:44:48 p128 kernel: 2) report problems to the xen-devel >>>>>> mailing list along with details of your device obtained from lspci. >>>>>> Feb 11 19:44:48 p128 kernel: PCI: Enabling device 0000:0e:00.0 (0000 >>>>>> -> 0002) >>>>>> Feb 11 19:44:48 p128 kernel: ACPI: PCI Interrupt 0000:0e:00.0[A] -> >>>>>> GSI 16 (level, low) -> IRQ 16 >>>>>> Feb 11 19:44:49 p128 kernel: ACPI: PCI interrupt for device >>>>>> 0000:0e:00.0 disabled >>>>>> >>>>>> >>>>>> >>>>>> some more details - [root at p128 ~]# rpm -qa | grep xen >>>>>> kernel-xen-2.6.18-92.1.22.el5 >>>>>> xen-3.0.3-64.el5_2.9 >>>>>> xen-libs-3.0.3-64.el5_2.9 >>>>>> xen-libs-3.0.3-64.el5_2.9 >>>>>> >>>>>> [root at p128 ~]# ibv_devinfo >>>>>> hca_id: mthca0 >>>>>> fw_ver: 5.3.0 >>>>>> node_guid: 0002:c902:0022:cd48 >>>>>> sys_image_guid: 0002:c902:0022:cd4b >>>>>> vendor_id: 0x02c9 >>>>>> vendor_part_id: 25218 >>>>>> hw_ver: 0x20 >>>>>> board_id: MT_0370130002 >>>>>> phys_port_cnt: 2 >>>>>> port: 1 >>>>>> state: PORT_INIT (2) >>>>>> max_mtu: 2048 (4) >>>>>> active_mtu: 512 (2) >>>>>> sm_lid: 0 >>>>>> port_lid: 0 >>>>>> port_lmc: 0x00 >>>>>> >>>>>> port: 2 >>>>>> state: PORT_DOWN (1) >>>>>> max_mtu: 2048 (4) >>>>>> active_mtu: 512 (2) >>>>>> sm_lid: 0 >>>>>> port_lid: 0 >>>>>> port_lmc: 0x00 >>>>>> >>>>>> >>>>>> any help greatly appreciated. >>>>>> >>>>>> ~subbu >>>>>> >>>>>> On Sat, Oct 18, 2008 at 4:54 AM, David Brown wrote: >>>>>> >>>>>>> Okay so my question to the openfabrics guys is, why would the OFED >>>>>>> drivers fail to read the firmware? >>>>>>> >>>>>>> Any thoughts? >>>>>>> >>>>>>> Thanks, >>>>>>> - David Brown >>>>>>> >>>>>>> >>>>>>> ---------- Forwarded message ---------- >>>>>>> From: David Brown >>>>>>> Date: Thu, Sep 11, 2008 at 2:24 PM >>>>>>> Subject: pciback module not working >>>>>>> To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com >>>>>>> >>>>>>> >>>>>>> This issue was brought up about a year and a half ago. So I'll bring >>>>>>> it up again and see if anything happens. >>>>>>> >>>>>>> I've got an infiniband network and am attempting to pass the >>>>>>> infiniband card through the host and give it to the guest. >>>>>>> I'm working with standard CentOS 5.2 on both guest and host with >>>>>>> their >>>>>>> provided xen (3.0.3 ish). I've also attempted to install the newest >>>>>>> Xen 3.3 and use their standard host kernel and that did the same >>>>>>> thing. The guest dmesg output in the guest is similar on both >>>>>>> permissive and normal mode. >>>>>>> >>>>>>> I'm getting issues with detecting the firmware on the card for some >>>>>>> reason... >>>>>>> >>>>>>> Any help would be appreciated. >>>>>>> >>>>>>> Thanks, >>>>>>> - David Brown >>>>>>> >>>>>>> === GUEST dmesg output === >>>>>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) >>>>>>> ib_mthca: Initializing 0000:00:00.0 >>>>>>> PCI: Enabling device 0000:00:00.0 (0000 -> 0002) >>>>>>> PCI: Setting latency timer of device 0000:00:00.0 to 64 >>>>>>> ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. >>>>>>> ib_mthca: probe of 0000:00:00.0 failed with error -11 >>>>>>> ======================= >>>>>>> >>>>>>> === Host modprobe.conf === >>>>>>> alias eth0 bnx2 >>>>>>> alias eth1 bnx2 >>>>>>> alias scsi_hostadapter cciss >>>>>>> options pciback hide=(41:00.0) >>>>>>> ===================== >>>>>>> >>>>>>> === Host lspci output === >>>>>>> # lspci -vs 41:00.0 >>>>>>> 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>>>>> HCA] (rev 20) >>>>>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>>>>> Flags: fast devsel, IRQ 16 >>>>>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] >>>>>>> [size=1M] >>>>>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>>>>> Capabilities: [40] Power Management version 2 >>>>>>> Capabilities: [48] Vital Product Data >>>>>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ >>>>>>> Queue=0/5 Enable- >>>>>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>>>>> Capabilities: [60] Express Endpoint IRQ 0 >>>>>>> ===================== >>>>>>> >>>>>>> This makes sure it get loaded first off before anything else. >>>>>>> === Host mkinitrd cmd === >>>>>>> # mkinitrd -f --with=pciback --preload pciback >>>>>>> /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen >>>>>>> ==================== >>>>>>> >>>>>>> === Host pciback dmesg === >>>>>>> pciback 0000:41:00.0: Driver tried to write to a read-only >>>>>>> configuration space field at offset 0x44, size 2. This may be >>>>>>> harmless, but if you have problems with your device: >>>>>>> 1) see permissive attribute in sysfs >>>>>>> 2) report problems to the xen-devel mailing list along with details >>>>>>> of >>>>>>> your device obtained from lspci. >>>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>>>>> ====================== >>>>>>> >>>>>>> === Host pciback dmesg (after setting it permissive) === >>>>>>> pciback 0000:41:00.0: enabling permissive mode configuration space >>>>>>> accesses! >>>>>>> pciback 0000:41:00.0: permissive mode is potentially unsafe! >>>>>>> pciback: vpci: 0000:41:00.0: assign to virtual slot 0 >>>>>>> device vif1.0 entered promiscuous mode >>>>>>> ADDRCONF(NETDEV_UP): vif1.0: link is not ready >>>>>>> blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) >>>>>>> PCI: Enabling device 0000:41:00.0 (0000 -> 0002) >>>>>>> ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 >>>>>>> PCI: Setting latency timer of device 0000:41:00.0 to 64 >>>>>>> ACPI: PCI interrupt for device 0000:41:00.0 disabled >>>>>>> ========================================= >>>>>>> >>>>>>> === Guest lspci output === >>>>>>> # lspci -v >>>>>>> 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx >>>>>>> HCA] (rev 20) >>>>>>> Subsystem: Hewlett-Packard Company Unknown device 170a >>>>>>> Flags: fast devsel, IRQ 16 >>>>>>> Memory at fdc00000 (64-bit, non-prefetchable) [disabled] >>>>>>> [size=1M] >>>>>>> Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] >>>>>>> Capabilities: [40] Power Management version 2 >>>>>>> Capabilities: [48] Vital Product Data >>>>>>> Capabilities: [90] Message Signalled Interrupts: 64bit+ >>>>>>> Queue=0/5 Enable- >>>>>>> Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 >>>>>>> Capabilities: [60] Express Endpoint IRQ 0 >>>>>>> ===================== >>>>>>> _______________________________________________ >>>>>>> general mailing list >>>>>>> general at lists.openfabrics.org >>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>>> >>>>>>> To unsubscribe, please visit >>>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> . . . s u b b u >>>>>> "You've got to be original, because if you're like someone else, what >>>>>> do they need you for?" >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> . . . s u b b u >>>>> "You've got to be original, because if you're like someone else, what >>>>> do they need you for?" >>>>> >>>>> >>>> >>>> >>>> -- >>>> . . . s u b b u >>>> "You've got to be original, because if you're like someone else, what do >>>> they need you for?" >>>> >>>> >>> >>> >>> -- >>> . . . s u b b u >>> "You've got to be original, because if you're like someone else, what do >>> they need you for?" >>> >>> >> >> >> -- >> . . . s u b b u >> "You've got to be original, because if you're like someone else, what do >> they need you for?" >> >> > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what do > they need you for?" > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?" -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Mon Feb 16 03:18:51 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 16 Feb 2009 03:18:51 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090216-0200 daily build status Message-ID: <20090216111851.95C9AE6106E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From tziporet at dev.mellanox.co.il Mon Feb 16 03:19:14 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 16 Feb 2009 13:19:14 +0200 Subject: [ofa-general] IB function calls in kernel module fail In-Reply-To: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> Message-ID: <49994BB2.3010206@mellanox.co.il> neutron wrote: > Hi all, > > I'm writing a kernel module that make use of basic IB verbs to > communicate, like: > ib_register_client, ib_unregister_client, ib_alloc_pd, > ib_create_qp, ib_reg_phys_mr, etc. > > I can compile the code into a kernel module: ib_rdma_lat.ko. This > module is to test the RDMA write latency from kernel module. > > But when I "insmod", I got error reports at /var/log/messages: > > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_unregister_client > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_unregister_client > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_create_cq > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_reg_phys_mr > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_dereg_mr > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_register_client > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_register_client > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_destroy_cq > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_query_port > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_alloc_pd > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_create_qp > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_modify_qp > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_destroy_qp > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of > symbol ib_dealloc_pd > Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd > > I'm running rhel5. I have rebooted the node many times but didn't > help at all. > > From OFED_tips: 4. External Module Compilation Over OFED-1.4 =============================================================================== To build kernel modules depending on OFED's modules, take the Modules.symvers file from /src/openib/Module.symvers (part of the kernel-ib-devel RPM) and copy it to the modules subdir and then compile your module. If /src/openib/Module.symvers does not exist or it is empty, use the create_Module.symvers.sh (a part of the ofed-docs RPM) script to create the Module.symvers file. See "Module versioning & Module.symvers" in the modules.txt from kernel documentation (e.g. linux-2.6.20/Documentation/kbuild/modules.txt). Tziporet From john.russo at qlogic.com Mon Feb 16 05:59:18 2009 From: john.russo at qlogic.com (John Russo) Date: Mon, 16 Feb 2009 07:59:18 -0600 Subject: [ofa-general] ***SPAM*** Clearing port counters Message-ID: "When accessing port counters through /sys/class/infiniband//ports//counters/ is there a way to clear the value in a counter (or the values in multiple counters)?" [cid:image001.jpg at 01C99014.D945DD70] __________________________ John F. Russo Manager, Engineering QLogic Corporation 780 Fifth Avenue, Suite 140 King of Prussia, PA 19406 Direct: 610-233-4866 Main: 610-233-4800 Fax: 610-233-4777 Cell: 610-246-9903 Email: John.Russo at qlogic.com www.qlogic.com True success is the undeniable truth that we have proved ourselves. -Joe Luppino-Esposito -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 3677 bytes Desc: image001.jpg URL: From hal.rosenstock at gmail.com Mon Feb 16 06:09:56 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 16 Feb 2009 09:09:56 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Clearing port counters In-Reply-To: References: Message-ID: On Mon, Feb 16, 2009 at 8:59 AM, John Russo wrote: > "When accessing port counters through > /sys/class/infiniband//ports//counters/ > is there a way to clear the value in a counter (or the values in multiple > counters)?" Only via MADs AFAIK. -- Hal > __________________________ > John F. Russo > Manager, Engineering > QLogic Corporation > 780 Fifth Avenue, Suite 140 > King of Prussia, PA 19406 > Direct: 610-233-4866 > Main: 610-233-4800 > Fax: 610-233-4777 > Cell: 610-246-9903 > Email: John.Russo at qlogic.com > www.qlogic.com > > > > True success is the undeniable truth that we have proved ourselves. > > -Joe Luppino-Esposito > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Mon Feb 16 06:52:37 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 16 Feb 2009 09:52:37 -0500 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20081231170413.GD21950@sashak.voltaire.com> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> Message-ID: Sasha, On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky wrote: > > I looked at implementation of safe_*() functions (safe_smp_query, > safe_smp_set and safe_ca_call) and found that they are not actually > "safe" as declared by its names. The only thread-unsafe thing which > is used there is static 'mad_portid' structure (from rpc.c), I'm not sure that the only thread unsafe thing in the mad rpc mechanism is the portid. > but modification of this structure is not protected by same mutex (actually > not protected at all). A first step would be removing the portid as static. If so, portid would need to be a supplied parameter to various mad routines and the existing ones relying on madrpc_portid would be deprecated. Does this make sense to do ? Would you accept such a patch ? -- Hal > As far as I know nothing uses those safe_*() primitives right now outside > libibmad, so I think it is better to remove this confused functions from > API (with changing library version, etc.). > > The primitives madrpc_lock() and madrpc_unlock() are just wrappers to > hidden static pthread mutex which is not controlled by caller > application. I think that it will be more robust for multithreaded > application to use its own synchronization methods (pthread mutex or any > other) for better control. So let's remove madrpc_lock/unlock() too. > > Signed-off-by: Sasha Khapyorsky > --- > libibmad/include/infiniband/mad.h | 41 ------------------------------------- > libibmad/libibmad.ver | 2 +- > libibmad/src/libibmad.map | 2 - > libibmad/src/rpc.c | 15 ------------- > libibmad/src/sa.c | 5 ++- > 5 files changed, 4 insertions(+), 61 deletions(-) > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index eff6738..89b4be5 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -703,8 +703,6 @@ void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, > void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > void madrpc_save_mad(void *madbuf, int len); > -void madrpc_lock(void); > -void madrpc_unlock(void); > void madrpc_show_errors(int set); > > void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t *id, unsigned attrid, > uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, > unsigned timeout, const void *srcport); > > -inline static uint8_t * > -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid, unsigned mod, > - unsigned timeout) > -{ > - uint8_t *p; > - > - madrpc_lock(); > - p = smp_query(rcvbuf, portid, attrid, mod, timeout); > - madrpc_unlock(); > - > - return p; > -} > - > -inline static uint8_t * > -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid, unsigned mod, > - unsigned timeout) > -{ > - uint8_t *p; > - > - madrpc_lock(); > - p = smp_set(rcvbuf, portid, attrid, mod, timeout); > - madrpc_unlock(); > - > - return p; > -} > - > /* sa.c */ > uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, > unsigned timeout); > @@ -761,19 +733,6 @@ int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, > int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf); > > -inline static uint8_t * > -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, > - unsigned timeout) > -{ > - uint8_t *p; > - > - madrpc_lock(); > - p = sa_call(rcvbuf, portid, sa, timeout); > - madrpc_unlock(); > - > - return p; > -} > - > /* resolve.c */ > int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); > int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, > diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver > index 7e93c16..23d2dc2 100644 > --- a/libibmad/libibmad.ver > +++ b/libibmad/libibmad.ver > @@ -6,4 +6,4 @@ > # API_REV - advance on any added API > # RUNNING_REV - advance any change to the vendor files > # AGE - number of backward versions the API still supports > -LIBVERSION=5:0:4 > +LIBVERSION=2:0:0 > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > index 927e51c..f944d86 100644 > --- a/libibmad/src/libibmad.map > +++ b/libibmad/src/libibmad.map > @@ -72,14 +72,12 @@ IBMAD_1.3 { > madrpc; > madrpc_def_timeout; > madrpc_init; > - madrpc_lock; > madrpc_portid; > madrpc_rmpp; > madrpc_save_mad; > madrpc_set_retries; > madrpc_set_timeout; > madrpc_show_errors; > - madrpc_unlock; > ib_path_query; > sa_call; > sa_rpc_call; > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index 5226540..670a936 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -38,7 +38,6 @@ > #include > #include > #include > -#include > #include > #include > > @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data) > return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); > } > > -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; > - > -void > -madrpc_lock(void) > -{ > - pthread_mutex_lock(&rpclock); > -} > - > -void > -madrpc_unlock(void) > -{ > - pthread_mutex_unlock(&rpclock); > -} > - > void > madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) > { > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > index 27b9d52..c601254 100644 > --- a/libibmad/src/sa.c > +++ b/libibmad/src/sa.c > @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, > if (srcport) { > p = sa_rpc_call (srcport, buf, sm_id, &sa, 0); > } else { > - p = safe_sa_call(buf, sm_id, &sa, 0); > + p = sa_call(buf, sm_id, &sa, 0); > } > if (!p) { > IBWARN("sa call path_query failed"); > @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, > mad_decode_field(p, IB_SA_PR_DLID_F, &dlid); > return dlid; > } > + > int > ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf) > { > - return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf); > + return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf); > } > -- > 1.6.0.4.766.g6fc4a > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tzachid at mellanox.co.il Mon Feb 16 05:17:22 2009 From: tzachid at mellanox.co.il (Tzachi Dar) Date: Mon, 16 Feb 2009 15:17:22 +0200 Subject: [ofa-general] RE: [ofw] ib_create_qp and ib_get_err_str weirdness In-Reply-To: <01fa01c98df0$47baed30$0100000a@DIEGO> References: <01fa01c98df0$47baed30$0100000a@DIEGO> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01C93FE9@mtlexch01.mtl.com> Hi Diego, It seems that if you know the hw that you are working with you can find (by experiments) the maximum number of sge that you can use. (probably around 29). So, you can limit your work requests to this number of SGE. Depending if you are in user or in kernel you can also use buffers that have the same Contiguous memory. (I don't know the control that you have on the buffers so this is just a suggestion). Thanks Tzachi > -----Original Message----- > From: ofw-bounces at lists.openfabrics.org > [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Diego Guella > Sent: Friday, February 13, 2009 5:32 PM > To: ofw at lists.openfabrics.org; OpenIB General > Subject: [ofw] ib_create_qp and ib_get_err_str weirdness > > Hello, > > I am using Mellanox WinOF 2.0.0 with a MHES14-XTC SDR > single-port card. > I noticed a strange behavior of ib_create_qp function: > > ----- > memset(&qp_create, 0, sizeof(qp_create)); qp_create.qp_type = > IB_QPT_RELIABLE_CONN; // Reliable Connected > qp_create.sq_depth = ctx->qdepth; qp_create.rq_depth = > ctx->qdepth; qp_create.sq_sge = ctx->hca_attr->max_sges; > qp_create.rq_sge = ctx->hca_attr->max_sges; qp_create.h_sq_cq > = ctx->cq_h; qp_create.h_rq_cq = ctx->cq_h; qp_create.h_srq = > NULL; qp_create.sq_signaled = 1; > ctx->qp_h = 0; > rc = ib_create_qp(ctx->pd_h, &qp_create, NULL, NULL, &ctx->qp_h); > ----- > return value ("rc") is 3 (=IB_INVALID_PARAMETER). > > I spent some time figuring out the problem was the SQ SGE value: > http://lists.openfabrics.org/pipermail/general/2006-June/023417.html > > According to iba/ib_al.h: > ----- > * IB_INVALID_MAX_SGE > * The requested maximum number of scatter-gather entries for > the send or > * receive queue could not be supported. > ----- > so, why the return value isn't 22 (=IB_INVALID_MAX_SGE)? > > In the discussion I mentioned, it turned out that even using > hca_attr->max_sges there is the possibility that ib_create_qp fails. > Which is my case. > I have the need to send some audio buffers (32 or more) from > an IO node to a computing node using RDMA WRITE. > The ownership of the buffers is of the audio driver, and I > haven't the guarantee that the audio buffers are contiguous. > I was trying and send them using the lowest possible number > of WR, each one with the highest possible number of sge. > But, given the hca_attr->max_sge unreliability, how do you > recommend to achieve this goal? > Should I post a WR for each buffer I'd want to send through > RDMA WRITE? > > > Another less-related problem: > ib_get_err_str is not correct for every input value, for > example I noticed that for > ib_get_err_str(IB_INVALID_PD_HANDLE) the string returned is > IB_INVALID_MR_HANDLE > > > I don't know if these problems apply to linux too, so I'm > including general list. > > Thanks and best regards, > Diego > > _______________________________________________ > ofw mailing list > ofw at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > From neutronsharc at gmail.com Mon Feb 16 07:11:21 2009 From: neutronsharc at gmail.com (neutron) Date: Mon, 16 Feb 2009 10:11:21 -0500 Subject: [ofa-general] IB function calls in kernel module fail In-Reply-To: <49994BB2.3010206@mellanox.co.il> References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> <49994BB2.3010206@mellanox.co.il> Message-ID: <7d5928b30902160711rc24d11epd5827ad548a2256b@mail.gmail.com> The problem solved following your advice. Thanks a ton!! On Mon, Feb 16, 2009 at 6:19 AM, Tziporet Koren wrote: > neutron wrote: >> >> Hi all, >> >> I'm writing a kernel module that make use of basic IB verbs to >> communicate, like: >> ib_register_client, ib_unregister_client, ib_alloc_pd, >> ib_create_qp, ib_reg_phys_mr, etc. >> >> I can compile the code into a kernel module: ib_rdma_lat.ko. This >> module is to test the RDMA write latency from kernel module. >> >> But when I "insmod", I got error reports at /var/log/messages: >> >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_unregister_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol >> ib_unregister_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_create_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_reg_phys_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_dereg_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_register_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol >> ib_register_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_destroy_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_query_port >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_alloc_pd >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_create_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_modify_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_destroy_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_dealloc_pd >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd >> >> I'm running rhel5. I have rebooted the node many times but didn't >> help at all. >> >> > > From OFED_tips: > 4. External Module Compilation Over OFED-1.4 > =============================================================================== > > To build kernel modules depending on OFED's modules, take the > Modules.symvers > file from /src/openib/Module.symvers (part of the kernel-ib-devel > RPM) > and copy it to the modules subdir and then compile your module. > > If /src/openib/Module.symvers does not exist or it is empty, use the > create_Module.symvers.sh (a part of the ofed-docs RPM) script to create the > Module.symvers file. > > See "Module versioning & Module.symvers" in the modules.txt from kernel > documentation (e.g. linux-2.6.20/Documentation/kbuild/modules.txt). > > > Tziporet > > From neutronsharc at gmail.com Mon Feb 16 07:32:50 2009 From: neutronsharc at gmail.com (neutron) Date: Mon, 16 Feb 2009 10:32:50 -0500 Subject: [ofa-general] IB function calls in kernel module fail In-Reply-To: <49994BB2.3010206@mellanox.co.il> References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> <49994BB2.3010206@mellanox.co.il> Message-ID: <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com> One remaining question. In my code of kernel module, do I need to #include the header files from /src/openib/include/.... Or I just include the header files from /include/..... Thanks! On Mon, Feb 16, 2009 at 6:19 AM, Tziporet Koren wrote: > neutron wrote: >> >> Hi all, >> >> I'm writing a kernel module that make use of basic IB verbs to >> communicate, like: >> ib_register_client, ib_unregister_client, ib_alloc_pd, >> ib_create_qp, ib_reg_phys_mr, etc. >> >> I can compile the code into a kernel module: ib_rdma_lat.ko. This >> module is to test the RDMA write latency from kernel module. >> >> But when I "insmod", I got error reports at /var/log/messages: >> >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_unregister_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol >> ib_unregister_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_create_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_reg_phys_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_reg_phys_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_dereg_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dereg_mr >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_register_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol >> ib_register_client >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_destroy_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_cq >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_query_port >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_query_port >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_alloc_pd >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_alloc_pd >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_create_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_create_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_modify_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_modify_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_destroy_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_destroy_qp >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: disagrees about version of >> symbol ib_dealloc_pd >> Feb 15 16:33:28 wci11 kernel: ib_rdma_lat: Unknown symbol ib_dealloc_pd >> >> I'm running rhel5. I have rebooted the node many times but didn't >> help at all. >> >> > > From OFED_tips: > 4. External Module Compilation Over OFED-1.4 > =============================================================================== > > To build kernel modules depending on OFED's modules, take the > Modules.symvers > file from /src/openib/Module.symvers (part of the kernel-ib-devel > RPM) > and copy it to the modules subdir and then compile your module. > > If /src/openib/Module.symvers does not exist or it is empty, use the > create_Module.symvers.sh (a part of the ofed-docs RPM) script to create the > Module.symvers file. > > See "Module versioning & Module.symvers" in the modules.txt from kernel > documentation (e.g. linux-2.6.20/Documentation/kbuild/modules.txt). > > > Tziporet > > From dledford at redhat.com Mon Feb 16 09:49:33 2009 From: dledford at redhat.com (Doug Ledford) Date: Mon, 16 Feb 2009 12:49:33 -0500 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: References: Message-ID: <1234806573.751.74.camel@firewall.xsintricity.com> On Mon, 2009-02-16 at 09:29 +0800, Wen Hao Wang wrote: > > Wen Hao Wang (王文昊) > > Software Engineer > IBM China Software Development Laboratory > Email: wangwhao at cn.ibm.com > Tel: 86-10-82451055 > Fax: 86-10-82782244 ext. 2312 > Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software > Park,No.8 Dong Bei Wang West Road, Haidian District Beijing, 100193, > P.R.China > > > Doug Ledford 写于 2009-02-14 00:13:32: > > > On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote: > > > Doug Ledford 写于 2009-02-12 21:20:30: > > > > > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote: > > > > > Wen Hao Wang wrote: > > > > > > > > > > > > Hi all: > > > > > > > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed > OFED > > > (shipped > > > > > > in RHEL5.3 image) by "yum groupisntall". Then I load some > > > drivers and > > > > > > wrote network interface configuration file ifcfg-ib0. ifup > ib0 > > > also > > > > > > succeeded. But IB utilites report Connetion timed out. > > > > > > > > > > > > > > > > > > [root at xblade06 network-scripts]# sminfo > > > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed > out > > > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > > > > > sminfo: iberror: failed: query > > > > > > > > > > > > I had to reboot the blade and rerun "openibd start". Then > > > sminfo > > > > > > reported correct contents. I do not suppose this reboot is > > > required. > > > > > > Did I miss any configuration step? > > > > > > > > There was an unintentional bug in the rhel5.2 openibd init > script in > > > > that it automatically turned itself on during install > (generally, > > > most > > > > init scripts should default to *not* turning themselves on > during > > > > install of the package, nor should they start themselves during > > > install > > > > of the package...this is for security reasons, imagine if you > > > installed > > > > the bind name server on your box and it automatically started up > > > before > > > > you had a chance to configure it). In rhel5.3 we fixed that > bug. > > > So, > > > > > > Yeah. I heard of this bug. > > > > > > > you may need to 'chkconfig --level 2345 openibd on' to make sure > > > openibd > > > > starts up each time. The error you list above is consistent > with > > > not > > > > all of the kernel modules being loaded when you tried to use the > > > sminfo > > > > program. > > > > > > Even after reboot, service openibd is not started automatically. > > > [root at xblade06 ~]# chkconfig --list openibd > > > openibd 0:off 1:off 2:off 3:off 4:off 5:off > 6:off > > > > That's because you have to run the command I listed in my first > email to > > turn it on. > > > > I totally agree with this. But I am still confused why sminfo gave > errors > before reboot, or which steps I should take for the first OFED usage > before > reboot. As far as I can see, whether the service is added into system > runlevel DB is not related to the sminfo error. Please correct me if > that > is not the case. It is related. The runlevel db is only consulted on boot up. If the openibd service was not enabled at startup, then adding it to the runlevel startup does *not* start it at that time. You have to both add it to the runlevel startup and also start it manually if you want things to work properly prior to reboot. The sminfo errors you first posted are consistent with some of the modules not being loaded, and it went away after you started the openibd service, which is also consistent with the problem. > > > I agree with you that maybe some modules were not loaded. But > what's > > > that? > > > Before reboot, I run "/etc/init.d/openibd start" and > > > "/etc/init.d/network > > > restart". No error was reported. "openibd status" also looked > good. > > > > Running start on a service does not enable that service at the next > > reboot. You must specifically enable the service in order for it to > > start automatically. > > > > > > > > > > > > Moreover, "openibd start" report one warning message about > > > hwconf. > > > > > > Anyone has comments about this? > > > > > > > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > > > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: > No > > > such > > > > > > file or directory > > > > > > [ OK ] > > > > > > > > Can you see if the kudzu package is installed on your machine? > The > > > > openib package uses this config file written by kudzu to > determine > > > what > > > > hardware drivers to load. I suppose I should put a specific > > > requires in > > > > the rpm for that. > > > > > > kudzu is installed. > > > [root at xblade06 ~]# rpm -q kudzu > > > kudzu-1.2.57.1.21-1 > > > > Make sure kudzu has been run at least once then (it would appear to > be > > turned off on your machine or else /etc/sysconfig/hwconf would > exist). > > You can run it manually from the command line and that should be > > sufficient for the openibd init script's needs. > > > > Yes. After kudza created the file on my machine, openibd script had no > error > this time. I want to know in my scenario, is "openibd restart" > needed/required? It would probably be advisable, but only if you haven't rebooted since running kudzu for the first time. If you've rebooted since then, then it doesn't matter. > Many thanks! > > Wen Hao Wang > Email: wangwhao at cn.ibm.com > > > -- > > Doug Ledford > > GPG KeyID: CFBFF194 > > http://people.redhat.com/dledford > > > > Infiniband specific RPMs available at > > http://people.redhat.com/dledford/Infiniband > > > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除] > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From wangwhao at cn.ibm.com Mon Feb 16 16:31:57 2009 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Tue, 17 Feb 2009 08:31:57 +0800 Subject: ***SPAM*** Re: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: <1234806573.751.74.camel@firewall.xsintricity.com> Message-ID: OK, Doug: Thanks a lot for your detailed explanation! So if I donot want to reboot the machine, I need run "chkconfig", "kudzu" and "openibd start". Wen Hao Wang Email: wangwhao at cn.ibm.com Doug Ledford wrote on 2009-02-17 01:49:33: > On Mon, 2009-02-16 at 09:29 +0800, Wen Hao Wang wrote: > > > > Wen Hao Wang > > > > Software Engineer > > IBM China Software Development Laboratory > > Email: wangwhao at cn.ibm.com > > Tel: 86-10-82451055 > > Fax: 86-10-82782244 ext. 2312 > > Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun Software > > Park,No.8 Dong Bei Wang West Road, Haidian District Beijing, 100193, > > P.R.China > > > > > > Doug Ledford 写于 2009-02-14 00:13:32: > > > > > On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote: > > > > Doug Ledford 写于 2009-02-12 21:20:30: > > > > > > > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote: > > > > > > Wen Hao Wang wrote: > > > > > > > > > > > > > > Hi all: > > > > > > > > > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed > > OFED > > > > (shipped > > > > > > > in RHEL5.3 image) by "yum groupisntall". Then I load some > > > > drivers and > > > > > > > wrote network interface configuration file ifcfg-ib0. ifup > > ib0 > > > > also > > > > > > > succeeded. But IB utilites report Connetion timed out. > > > > > > > > > > > > > > > > > > > > > [root at xblade06 network-scripts]# sminfo > > > > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection timed > > out > > > > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid 9) > > > > > > > sminfo: iberror: failed: query > > > > > > > > > > > > > > I had to reboot the blade and rerun "openibd start". Then > > > > sminfo > > > > > > > reported correct contents. I do not suppose this reboot is > > > > required. > > > > > > > Did I miss any configuration step? > > > > > > > > > > There was an unintentional bug in the rhel5.2 openibd init > > script in > > > > > that it automatically turned itself on during install > > (generally, > > > > most > > > > > init scripts should default to *not* turning themselves on > > during > > > > > install of the package, nor should they start themselves during > > > > install > > > > > of the package...this is for security reasons, imagine if you > > > > installed > > > > > the bind name server on your box and it automatically started up > > > > before > > > > > you had a chance to configure it). In rhel5.3 we fixed that > > bug. > > > > So, > > > > > > > > Yeah. I heard of this bug. > > > > > > > > > you may need to 'chkconfig --level 2345 openibd on' to make sure > > > > openibd > > > > > starts up each time. The error you list above is consistent > > with > > > > not > > > > > all of the kernel modules being loaded when you tried to use the > > > > sminfo > > > > > program. > > > > > > > > Even after reboot, service openibd is not started automatically. > > > > [root at xblade06 ~]# chkconfig --list openibd > > > > openibd 0:off 1:off 2:off 3:off 4:off 5:off > > 6:off > > > > > > That's because you have to run the command I listed in my first > > email to > > > turn it on. > > > > > > > I totally agree with this. But I am still confused why sminfo gave > > errors > > before reboot, or which steps I should take for the first OFED usage > > before > > reboot. As far as I can see, whether the service is added into system > > runlevel DB is not related to the sminfo error. Please correct me if > > that > > is not the case. > > It is related. The runlevel db is only consulted on boot up. If the > openibd service was not enabled at startup, then adding it to the > runlevel startup does *not* start it at that time. You have to both add > it to the runlevel startup and also start it manually if you want things > to work properly prior to reboot. The sminfo errors you first posted > are consistent with some of the modules not being loaded, and it went > away after you started the openibd service, which is also consistent > with the problem. > > > > > I agree with you that maybe some modules were not loaded. But > > what's > > > > that? > > > > Before reboot, I run "/etc/init.d/openibd start" and > > > > "/etc/init.d/network > > > > restart". No error was reported. "openibd status" also looked > > good. > > > > > > Running start on a service does not enable that service at the next > > > reboot. You must specifically enable the service in order for it to > > > start automatically. > > > > > > > > > > > > > > > Moreover, "openibd start" report one warning message about > > > > hwconf. > > > > > > > Anyone has comments about this? > > > > > > > > > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > > > > > > Loading OpenIB kernel modules:grep: /etc/sysconfig/hwconf: > > No > > > > such > > > > > > > file or directory > > > > > > > [ OK ] > > > > > > > > > > Can you see if the kudzu package is installed on your machine? > > The > > > > > openib package uses this config file written by kudzu to > > determine > > > > what > > > > > hardware drivers to load. I suppose I should put a specific > > > > requires in > > > > > the rpm for that. > > > > > > > > kudzu is installed. > > > > [root at xblade06 ~]# rpm -q kudzu > > > > kudzu-1.2.57.1.21-1 > > > > > > Make sure kudzu has been run at least once then (it would appear to > > be > > > turned off on your machine or else /etc/sysconfig/hwconf would > > exist). > > > You can run it manually from the command line and that should be > > > sufficient for the openibd init script's needs. > > > > > > > Yes. After kudza created the file on my machine, openibd script had no > > error > > this time. I want to know in my scenario, is "openibd restart" > > needed/required? > > It would probably be advisable, but only if you haven't rebooted since > running kudzu for the first time. If you've rebooted since then, then > it doesn't matter. > > > Many thanks! > > > > Wen Hao Wang > > Email: wangwhao at cn.ibm.com > > > > > -- > > > Doug Ledford > > > GPG KeyID: CFBFF194 > > > http://people.redhat.com/dledford > > > > > > Infiniband specific RPMs available at > > > http://people.redhat.com/dledford/Infiniband > > > > > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除] > > > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除] -------------- next part -------------- An HTML attachment was scrubbed... URL: From dledford at redhat.com Mon Feb 16 18:40:08 2009 From: dledford at redhat.com (Doug Ledford) Date: Mon, 16 Feb 2009 21:40:08 -0500 Subject: [ofa-general] sminfo report iberror in the first configuration on RHEL5.3 In-Reply-To: References: Message-ID: <1234838408.751.96.camel@firewall.xsintricity.com> On Tue, 2009-02-17 at 08:31 +0800, Wen Hao Wang wrote: > OK, Doug: > > Thanks a lot for your detailed explanation! So if I donot want to > reboot the machine, I need run "chkconfig", "kudzu" and "openibd > start". Correct. > Wen Hao Wang > Email: wangwhao at cn.ibm.com > > > Doug Ledford wrote on 2009-02-17 01:49:33: > > > On Mon, 2009-02-16 at 09:29 +0800, Wen Hao Wang wrote: > > > > > > Wen Hao Wang > > > > > > Software Engineer > > > IBM China Software Development Laboratory > > > Email: wangwhao at cn.ibm.com > > > Tel: 86-10-82451055 > > > Fax: 86-10-82782244 ext. 2312 > > > Address: 1/F, IBM ZGC Campus. Ring Building 28,ZhongGuanCun > Software > > > Park,No.8 Dong Bei Wang West Road, Haidian District Beijing, > 100193, > > > P.R.China > > > > > > > > > Doug Ledford 写于 2009-02-14 00:13:32: > > > > > > > On Fri, 2009-02-13 at 08:05 +0800, Wen Hao Wang wrote: > > > > > Doug Ledford 写于 2009-02-12 21:20:30: > > > > > > > > > > > On Thu, 2009-02-12 at 13:20 +0200, Tziporet Koren wrote: > > > > > > > Wen Hao Wang wrote: > > > > > > > > > > > > > > > > Hi all: > > > > > > > > > > > > > > > > I changed my blade OS to RHEL5.3 yesterday and installed > > > OFED > > > > > (shipped > > > > > > > > in RHEL5.3 image) by "yum groupisntall". Then I load > some > > > > > drivers and > > > > > > > > wrote network interface configuration file ifcfg-ib0. > ifup > > > ib0 > > > > > also > > > > > > > > succeeded. But IB utilites report Connetion timed out. > > > > > > > > > > > > > > > > > > > > > > > > [root at xblade06 network-scripts]# sminfo > > > > > > > > ibwarn: [32593] _do_madrpc: recv failed: Connection > timed > > > out > > > > > > > > ibwarn: [32593] mad_rpc: _do_madrpc failed; dport (Lid > 9) > > > > > > > > sminfo: iberror: failed: query > > > > > > > > > > > > > > > > I had to reboot the blade and rerun "openibd start". > Then > > > > > sminfo > > > > > > > > reported correct contents. I do not suppose this reboot > is > > > > > required. > > > > > > > > Did I miss any configuration step? > > > > > > > > > > > > There was an unintentional bug in the rhel5.2 openibd init > > > script in > > > > > > that it automatically turned itself on during install > > > (generally, > > > > > most > > > > > > init scripts should default to *not* turning themselves on > > > during > > > > > > install of the package, nor should they start themselves > during > > > > > install > > > > > > of the package...this is for security reasons, imagine if > you > > > > > installed > > > > > > the bind name server on your box and it automatically > started up > > > > > before > > > > > > you had a chance to configure it). In rhel5.3 we fixed that > > > bug. > > > > > So, > > > > > > > > > > Yeah. I heard of this bug. > > > > > > > > > > > you may need to 'chkconfig --level 2345 openibd on' to make > sure > > > > > openibd > > > > > > starts up each time. The error you list above is consistent > > > with > > > > > not > > > > > > all of the kernel modules being loaded when you tried to use > the > > > > > sminfo > > > > > > program. > > > > > > > > > > Even after reboot, service openibd is not started > automatically. > > > > > [root at xblade06 ~]# chkconfig --list openibd > > > > > openibd 0:off 1:off 2:off 3:off 4:off 5:off > > > 6:off > > > > > > > > That's because you have to run the command I listed in my first > > > email to > > > > turn it on. > > > > > > > > > > I totally agree with this. But I am still confused why sminfo gave > > > errors > > > before reboot, or which steps I should take for the first OFED > usage > > > before > > > reboot. As far as I can see, whether the service is added into > system > > > runlevel DB is not related to the sminfo error. Please correct me > if > > > that > > > is not the case. > > > > It is related. The runlevel db is only consulted on boot up. If > the > > openibd service was not enabled at startup, then adding it to the > > runlevel startup does *not* start it at that time. You have to both > add > > it to the runlevel startup and also start it manually if you want > things > > to work properly prior to reboot. The sminfo errors you first > posted > > are consistent with some of the modules not being loaded, and it > went > > away after you started the openibd service, which is also consistent > > with the problem. > > > > > > > I agree with you that maybe some modules were not loaded. But > > > what's > > > > > that? > > > > > Before reboot, I run "/etc/init.d/openibd start" and > > > > > "/etc/init.d/network > > > > > restart". No error was reported. "openibd status" also looked > > > good. > > > > > > > > Running start on a service does not enable that service at the > next > > > > reboot. You must specifically enable the service in order for > it to > > > > start automatically. > > > > > > > > > > > > > > > > > > Moreover, "openibd start" report one warning message > about > > > > > hwconf. > > > > > > > > Anyone has comments about this? > > > > > > > > > > > > > > > > [root at xblade07 ~]# /etc/init.d/openibd start > > > > > > > > Loading OpenIB kernel > modules:grep: /etc/sysconfig/hwconf: > > > No > > > > > such > > > > > > > > file or directory > > > > > > > > [ OK ] > > > > > > > > > > > > Can you see if the kudzu package is installed on your > machine? > > > The > > > > > > openib package uses this config file written by kudzu to > > > determine > > > > > what > > > > > > hardware drivers to load. I suppose I should put a specific > > > > > requires in > > > > > > the rpm for that. > > > > > > > > > > kudzu is installed. > > > > > [root at xblade06 ~]# rpm -q kudzu > > > > > kudzu-1.2.57.1.21-1 > > > > > > > > Make sure kudzu has been run at least once then (it would appear > to > > > be > > > > turned off on your machine or else /etc/sysconfig/hwconf would > > > exist). > > > > You can run it manually from the command line and that should be > > > > sufficient for the openibd init script's needs. > > > > > > > > > > Yes. After kudza created the file on my machine, openibd script > had no > > > error > > > this time. I want to know in my scenario, is "openibd restart" > > > needed/required? > > > > It would probably be advisable, but only if you haven't rebooted > since > > running kudzu for the first time. If you've rebooted since then, > then > > it doesn't matter. > > > > > Many thanks! > > > > > > Wen Hao Wang > > > Email: wangwhao at cn.ibm.com > > > > > > > -- > > > > Doug Ledford > > > > GPG KeyID: CFBFF194 > > > > http://people.redhat.com/dledford > > > > > > > > Infiniband specific RPMs available at > > > > http://people.redhat.com/dledford/Infiniband > > > > > > > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除] > > > > > -- > > Doug Ledford > > GPG KeyID: CFBFF194 > > http://people.redhat.com/dledford > > > > Infiniband specific RPMs available at > > http://people.redhat.com/dledford/Infiniband > > > > [附件 "signature.asc" 被 Wen Hao Wang/China/IBM 删除] > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From sfr at canb.auug.org.au Mon Feb 16 23:44:03 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 17 Feb 2009 18:44:03 +1100 Subject: [ofa-general] linux-next: infiniband tree build warning Message-ID: <20090217184403.7e1f18c5.sfr@canb.auug.org.au> Hi Roland, Today's linux-next build (x86_64 allmodconfig) produced these warnings: drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'build_rdma_recv': drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift Caused by commit 1557b4f052cb739a4ae1dd9641249b3e69fb6a0d ("RDMA/cxgb3: Remove modulo math from build_rdma_recv()"). -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From tziporet at dev.mellanox.co.il Tue Feb 17 01:57:52 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 17 Feb 2009 11:57:52 +0200 Subject: [ofa-general] IB function calls in kernel module fail In-Reply-To: <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com> References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> <49994BB2.3010206@mellanox.co.il> <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com> Message-ID: <499A8A20.1090507@mellanox.co.il> neutron wrote: > One remaining question. > > In my code of kernel module, do I need to #include the header files > from /src/openib/include/.... > Or I just include the header files from /include/..... > > You should use the headers from ofed if you wish to use OFED kernel modules. Tziporet From vlad at lists.openfabrics.org Tue Feb 17 03:19:45 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 17 Feb 2009 03:19:45 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090217-0200 daily build status Message-ID: <20090217111945.49D8AE61047@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From Bert.Wiegers at t-systems-sfr.com Tue Feb 17 04:23:21 2009 From: Bert.Wiegers at t-systems-sfr.com (Wiegers, Bert) Date: Tue, 17 Feb 2009 13:23:21 +0100 Subject: [ofa-general] opensm logoutput Message-ID: Hi, we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from SUN. As we are debugging our System I'm trying to understand the opensm.log's. (Where can I find any documentation to that?) We see frequent messages as follows: Feb 17 10:25:34 134964 [41802940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:111 TID:0x000000000000006e Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:111 GID:fe80::14:4fa4:cff8:50 Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:65 (GID out of service) from LID:336 GID:fe80::3:ba00:100:3341 Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port: Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of node:MT25408 ConnectX Mellanox Technologies Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP Feb 17 10:25:46 662611 [41802940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:111 TID:0x000000000000006f Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:111 GID:fe80::14:4fa4:cff8:50 Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches Feb 17 10:25:52 476653 [44007940] 0x01 -> __osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00 Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x81 (SubnGetResp) D bit...................0x1 status..................0x1C00 hop_ptr.................0x0 hop_count...............0x4 trans_id................0x18c08de attr_id.................0x15 (PortInfo) resv....................0x0 attr_mod................0x6 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1,10,15,23 Return path: 0,23,20,12,17 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 03 03 02 34 52 00 23 40 40 00 08 08 04 F0 4C 00 00 00 00 00 00 00 00 00 88 00 00 00 00 00 00 00 00 00 00 Other issues I see with messages similar to the following ones: __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies) po __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5 (Invalid argument) I'm still googleing, but hopefully someone can give me some answers. Thanks and best regards Bert From kliteyn at dev.mellanox.co.il Tue Feb 17 04:41:12 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 17 Feb 2009 14:41:12 +0200 Subject: [ofa-general] [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node Message-ID: <499AB068.2020205@dev.mellanox.co.il> Hi Sasha, This patch fixes bugzilla issue #1515: Topology: |---------------| | SW2 | |---------------| |x |y |z |v |----| | | |----| | | | | | |----| |----| | | | | | a| b| c| d| |---------------| |---------------| | SW1 | | SW3 | |---------------| |---------------| | | | | HCA with SM HCA During the discovery: SM sends NodeInfo request to SW1 SM sends NodeInfo request to SW2 through link a->x SM discovers new node SW2: - updates DR to SW2 to go through link a->x - creates physp x SM sends NodeInfo request to SW2 through link b->y SM discovers a known node SW2 - DOES NOT create physp y - updates DR to SW2 to go through link b->y >From now on, the DR to SW2 is going through port y, so OpenSM won't deal with port y any more, leaving it uninitialized (no physp object for this port). The fix is to create physp for the newly discovered port of the known switch node, same way as it is done for HCAs. I also added one log message for the case that showed the problem - when one of the link sides is uninitialized (no valid ports check). Perhaps this log message should be an error message instead? Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_node_info_rcv.c | 24 +++++++++++++++++++++++- 1 files changed, 23 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index c52c0d5..7da3103 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, */ if (!osm_node_link_has_valid_ports(p_node, port_num, p_neighbor_node, - p_ni_context->port_num)) + p_ni_context->port_num)) { + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); goto _exit; + } if (osm_node_link_exists(p_node, port_num, p_neighbor_node, p_ni_context->port_num)) { @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, IN osm_node_t * const p_node, IN const osm_madw_t * const p_madw) { + + ib_smp_t *p_smp; + ib_node_info_t *p_ni; + uint8_t port_num; + OSM_LOG_ENTER(sm->p_log); + p_smp = osm_madw_get_smp_ptr(p_madw); + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + port_num = ib_node_info_get_local_port_num(p_ni); + + if (!osm_node_get_physp_ptr(p_node, port_num)) { + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Creating physp for node GUID:0x%" + PRIx64 ", port %u\n", + cl_ntoh64(osm_node_get_node_guid(p_node)), + port_num); + osm_node_init_physp(p_node, p_madw); + } + /* If this switch has already been probed during this sweep, then don't bother reprobing it. -- 1.5.1.4 From Bert.Wiegers at t-systems-sfr.com Tue Feb 17 05:54:00 2009 From: Bert.Wiegers at t-systems-sfr.com (Wiegers, Bert) Date: Tue, 17 Feb 2009 14:54:00 +0100 Subject: [ofa-general] osmtest fails Message-ID: Hi. I can't start osmtest (using ofed 3.2.5 with opensm running on one node - no other subnetmanagers) # osmtest -f c Command Line Arguments Done with args Flow = Create Inventory Feb 17 14:46:49 769646 [AB76BF80] 0x7f -> Setting log level to: 0x03 Feb 17 14:46:49 769830 [AB76BF80] 0x02 -> osm_vendor_init: 1000 pending umads specified Feb 17 14:46:49 783700 [AB76BF80] 0x02 -> osm_vendor_bind: Binding to port 0x3ba0001003341 Feb 17 14:46:49 801051 [AB76BF80] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- Feb 17 14:46:53 952476 [41001940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x12 attr=0x35 trans_id=0x7300000004) -- dropping Feb 17 14:46:53 952521 [41001940] 0x01 -> umad_receiver: ERR 5410: class 0x3 LID 0x150 Feb 17 14:46:53 952535 [41001940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_TIMEOUT) Feb 17 14:46:53 956429 [AB76BF80] 0x01 -> osmtest_get_all_recs: ERR 0004: ib_query failed (IB_TIMEOUT) Feb 17 14:46:53 956475 [AB76BF80] 0x01 -> osmtest_write_all_path_recs: ERR 0025: osmtest_get_all_recs failed (IB_TIMEOUT) Feb 17 14:46:53 956543 [AB76BF80] 0x01 -> osmtest_run: ERR 0139: Inventory file create failed (IB_TIMEOUT) OSMTEST: TEST "Create Inventory" FAIL In the Logfile opensm.log I can see: Feb 17 14:46:54 412573 [42804940] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x8688560 of size 75064952 failed -5 (Invalid argument) Feb 17 14:46:54 420846 [42804940] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR Feb 17 14:46:55 534830 [42003940] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x2aaab7f791d0 of size 75064952 failed -5 (Invalid argument) Feb 17 14:46:55 546577 [42003940] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR Feb 17 14:46:56 483555 [41001940] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x4b97a10 of size 75064952 failed -5 (Invalid argument) Feb 17 14:46:56 493298 [41001940] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR Feb 17 14:47:02 042134 [44007940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x92 attr=0x35 trans_id=0x7300000004) -- dropping Feb 17 14:47:02 042168 [44007940] 0x01 -> umad_receiver: ERR 5410: class 0x3 LID 0x150 Feb 17 14:47:02 042187 [44007940] 0x01 -> umad_receiver: ERR 5412: Failed to obtain request madw for timed out MAD(method=0x92 attr=0x35 tid=0x7300000004) -- dropping Why can't it be initialized? Best regards, Bert From neutronsharc at gmail.com Tue Feb 17 06:50:21 2009 From: neutronsharc at gmail.com (neutron) Date: Tue, 17 Feb 2009 09:50:21 -0500 Subject: [ofa-general] ***SPAM*** ib_reg_phys_mr( ) results in crash Message-ID: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> Hi all, In my kernel module program, a call to ib_reg_phys_mr( ) always results in a system crash. My code is like: buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE, &dma_addr, GFP_KERNEL); iovstart = (u64) send_buf; mr = ib_reg_phys_mr(ctx->pd, dma_addr, 1, IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_LOCAL_WRITE, &iovstart ); Before calling ib_reg_phys_mr, printk() shows that all its arguments are valid. But the system always crashes immediately after entering the function ib_reg_phys_mr( ). Any possible reasons ? Thanks!! I'm using kernel 2.6.18-53.1.14.el5. My kernel module is built using OFED-1.3.1 modules. From jackm at dev.mellanox.co.il Tue Feb 17 07:01:35 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 17 Feb 2009 17:01:35 +0200 Subject: [ofa-general] [PATCH] IPoIB: In unicast_arp, do path_free only for newly-created paths Message-ID: <200902171701.36107.jackm@dev.mellanox.co.il> If path_rec_start() returns error, call path_free() only if the path was newly-created. If we free an existing path whose valid flag was zero, (but do not detach it from the list) we cause corruption of the path list (of which it is a member), and get a kernel crash. The simplest solution is to not free an existing path -- just leave it in the list as-is (i.e., with its valid flag cleared). Thanks to Yossi Etigin of Voltaire for identifying the problem flow which caused the kernel crash. Signed-off-by: Jack Morgenstein Signed-off-by: Moni Shua --- Roland, I ran checkpatch.pl on this, and compiled it with Sparse. However, I would still like to continue using KMail. If you have any editing/formatting problems with the patch, please let me know. The patch was generated by git diff against your kernel git/master branch. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 0bd2a4f..2c8b15f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -660,8 +660,11 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, path = __path_find(dev, phdr->hwaddr + 4); if (!path || !path->valid) { - if (!path) + int new_path = 0; + if (!path) { path = path_rec_create(dev, phdr->hwaddr + 4); + new_path = 1; + } if (path) { /* put pseudoheader back on for next time */ skb_push(skb, sizeof *phdr); @@ -669,7 +672,8 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, if (!path->query && path_rec_start(dev, path)) { spin_unlock_irqrestore(&priv->lock, flags); - path_free(dev, path); + if (new_path) + path_free(dev, path); return; } else __path_add(dev, path); From jackm at dev.mellanox.co.il Tue Feb 17 07:42:38 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 17 Feb 2009 17:42:38 +0200 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver Message-ID: <200902171742.38223.jackm@dev.mellanox.co.il> We have found a race condition in sysfs.c which occurs when unloading low-level modules (e.g., mlx4_ib) in the driver. Specifically: Although the kernel takes reference counts on sysfs files, it does not take such counts on modules which implement attribute reads. For example, we have: static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, char *buf) { struct port_table_attribute *tab_attr = container_of(attr, struct port_table_attribute, attr); u16 pkey; ssize_t ret; ====>race condition HERE ***** ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); if (ret) return ret; return sprintf(buf, "0x%04x\n", pkey); } The sysfs file /sys/class/infiniband//ports/1/pkey/ is protected from destruction while we are in show_port_pkey. However, the underlying module which implements ib_query_pkey (in this case, mlx4_ib) is not. Thus, if another process is busy unloading mlx4_ib, and the time-slice of the process which is reading sysfs expires at the point indicated above in the code, ib_query_pkey() will fail with a page-fault (kernel panic), since it will not find the code page which implements ib_query_pkey() (inlined to the query_pkey() function in the low-level driver virtual function table). Now, when a low-level driver is unloaded, the following procedure (in sysfs.c) is called: void ib_device_unregister_sysfs(struct ib_device *device) { struct kobject *p, *t; struct ib_port *port; list_for_each_entry_safe(p, t, &device->port_list, entry) { list_del(&p->entry); port = container_of(p, struct ib_port, kobj); mutex_lock(&port->mutex); port->valid = 0; sysfs_remove_group(p, &pma_group); sysfs_remove_group(p, &port->pkey_group); sysfs_remove_group(p, &port->gid_group); mutex_unlock(&port->mutex); kobject_put(p); } kobject_put(device->ports_parent); device_unregister(&device->dev); } After this call, the kernel continues with unloading the low-level module. However, until device_unregister(&device->dev) is invoked, the sysfs attribute path for the low-level device is still valid. Hence the race condition -- Process A Process B --------- --------------- 1. starts unloading low-level mod 2. cat /sys/class/infiniband/... 3. Time slice runs out just before accessing low-level module for requested info. 4. Low-level module is fully unloaded 5. Page-fault panic when trying to access a procedure in the just-unloaded module. Some attempt was made for some (but not all) of the "show" procedures to check if the module is alive: if (!ibdev_is_alive(p->ibdev)) return -ENODEV; This narrows the race window considerably, but does not eliminate it. (I put this fix in show_port_pkey(), and was still able to generate the kernel panic). The only way I was able to eliminate the kernel panic entirely was via a mutex (declaration and init not shown): static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, char *buf) { struct port_table_attribute *tab_attr = container_of(attr, struct port_table_attribute, attr); u16 pkey; ssize_t ret; mutex_lock(&p->mutex); ==> if (p->valid) ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); else ret = -EINVAL; ==> mutex_unlock(&p->mutex); if (ret) return ret; return sprintf(buf, "0x%04x\n", pkey); } and: void ib_device_unregister_sysfs(struct ib_device *device) { struct kobject *p, *t; struct ib_port *port; list_for_each_entry_safe(p, t, &device->port_list, entry) { list_del(&p->entry); port = container_of(p, struct ib_port, kobj); ==> mutex_lock(&port->mutex); port->valid = 0; sysfs_remove_group(p, &pma_group); sysfs_remove_group(p, &port->pkey_group); sysfs_remove_group(p, &port->gid_group); ==> mutex_unlock(&port->mutex); kobject_put(p); } kobject_put(device->ports_parent); device_unregister(&device->dev); } This is approach is fine for the port-based groups. What about class-device attributes themselves? I believe that the best approach is to add a sysfs_mutex to ib_device, and lock that for ALL "show" methods in this file. Opinions? - Jack From rdreier at cisco.com Tue Feb 17 09:03:15 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Feb 2009 09:03:15 -0800 Subject: [ofa-general] linux-next: infiniband tree build warning In-Reply-To: <20090217184403.7e1f18c5.sfr@canb.auug.org.au> (Stephen Rothwell's message of "Tue, 17 Feb 2009 18:44:03 +1100") References: <20090217184403.7e1f18c5.sfr@canb.auug.org.au> Message-ID: > Today's linux-next build (x86_64 allmodconfig) produced these warnings: > drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'build_rdma_recv': > drivers/infiniband/hw/cxgb3/iwch_qp.c:266: warning: suggest parentheses around + or - inside shift Thanks, should be fixed in the next pull of for-next. - R. From arlin.r.davis at intel.com Tue Feb 17 09:06:18 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 17 Feb 2009 09:06:18 -0800 Subject: [ofa-general] [PATCH] [DAPL] scm: add support for WinOF In-Reply-To: <6402857E406545A895F63DF7FA784D42@amr.corp.intel.com> References: <6402857E406545A895F63DF7FA784D42@amr.corp.intel.com> Message-ID: Thanks, applied. Since this was in the CM code I had to do some regression testing before accepting. >-----Original Message----- >From: general-bounces at lists.openfabrics.org >[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hefty, Sean >Sent: Friday, February 13, 2009 2:55 PM >To: Davis, Arlin R; general at lists.openfabrics.org; >ofw at lists.openfabrics.org >Subject: [ofa-general] [PATCH] [DAPL] scm: add support for WinOF > >Modify the openib_scm provider to support both OFED and WinOF releases. >This takes advantage of having a libibverbs compatibility library.* > >Signed-off-by: Sean Hefty >--- >* If only there were a sockets compatility layer... gurgle >This is only build tested for windows, but does run on Linux. > >diff --git a/Makefile.am b/Makefile.am >index bfc93f7..5044e36 100755 >--- a/Makefile.am >+++ b/Makefile.am >@@ -49,7 +49,8 @@ dapl_udapl_libdaploscm_la_CFLAGS = >$(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAG > -DOPENIB -DCQ_WAIT_OBJECT \ > -I$(srcdir)/dat/include/ >-I$(srcdir)/dapl/include/ \ > -I$(srcdir)/dapl/common >-I$(srcdir)/dapl/udapl/linux \ >- -I$(srcdir)/dapl/openib_scm >+ -I$(srcdir)/dapl/openib_scm \ >+ -I$(srcdir)/dapl/openib_scm/linux > > if HAVE_LD_VERSION_SCRIPT > dat_version_script = >-Wl,--version-script=$(srcdir)/dat/udat/libdat2.map >diff --git a/dapl/openib_scm/README b/dapl/openib_scm/README >deleted file mode 100644 >index 239dfe6..0000000 >--- a/dapl/openib_scm/README >+++ /dev/null >@@ -1,40 +0,0 @@ >- >-OpenIB uDAPL provider using socket-based CM, in leiu of >uCM/uAT, to setup QP/channels. >- >-to build: >- >-cd dapl/udapl >-make VERBS=openib_scm clean >-make VERBS=openib_scm >- >- >-Modifications to common code: >- >-- added dapl/openib_scm directory >- >- dapl/udapl/Makefile >- >-New files for openib_scm provider >- >- dapl/openib/dapl_ib_cq.c >- dapl/openib/dapl_ib_dto.h >- dapl/openib/dapl_ib_mem.c >- dapl/openib/dapl_ib_qp.c >- dapl/openib/dapl_ib_util.c >- dapl/openib/dapl_ib_util.h >- dapl/openib/dapl_ib_cm.c >- >-A simple dapl test just for openib_scm testing... >- >- test/dtest/dtest.c >- test/dtest/makefile >- >- server: dtest -s >- client: dtest -h hostname >- >-known issues: >- >- no memory windows support in ibverbs, dat_create_rmr fails. >- >- >- >diff --git a/dapl/openib_scm/dapl_ib_cm.c >b/dapl/openib_scm/dapl_ib_cm.c >index 80a7d5e..9a15e42 100644 >--- a/dapl/openib_scm/dapl_ib_cm.c >+++ b/dapl/openib_scm/dapl_ib_cm.c >@@ -52,26 +52,169 @@ > #include "dapl_cr_util.h" > #include "dapl_name_service.h" > #include "dapl_ib_util.h" >- >-#include >-#include >-#include >-#include >-#include >-#include >- >-#include >-#include >-#include >- >-#if __BYTE_ORDER == __LITTLE_ENDIAN >-static inline uint64_t cpu_to_be64(uint64_t x) {return bswap_64(x);} >-#elif __BYTE_ORDER == __BIG_ENDIAN >-static inline uint64_t cpu_to_be64(uint64_t x) {return x;} >-#endif >+#include "dapl_osd.h" > > extern int g_scm_pipe[2]; > >+#if defined(_WIN32) || defined(_WIN64) >+enum DAPL_FD_EVENTS { >+ DAPL_FD_READ = 0x1, >+ DAPL_FD_WRITE = 0x2, >+ DAPL_FD_ERROR = 0x4 >+}; >+ >+static int dapl_config_socket(DAPL_SOCKET s) >+{ >+ unsigned long nonblocking = 1; >+ return ioctlsocket(s, FIONBIO, &nonblocking); >+} >+ >+static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr *addr, >+ int addrlen) >+{ >+ int err; >+ >+ connect(s, addr, addrlen); >+ err = WSAGetLastError(); >+ return (err == WSAEWOULDBLOCK) ? EAGAIN : err; >+} >+ >+struct dapl_fd_set { >+ struct fd_set set[3]; >+}; >+ >+static struct dapl_fd_set *dapl_alloc_fd_set(void) >+{ >+ return dapl_os_alloc(sizeof(struct dapl_fd_set)); >+} >+ >+static void dapl_fd_zero(struct dapl_fd_set *set) >+{ >+ FD_ZERO(&set->set[0]); >+ FD_ZERO(&set->set[1]); >+ FD_ZERO(&set->set[2]); >+} >+ >+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set, >+ enum DAPL_FD_EVENTS event) >+{ >+ FD_SET(s, &set->set[(event == DAPL_FD_READ) ? 0 : 1]); >+ FD_SET(s, &set->set[2]); >+ return 0; >+} >+ >+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum >DAPL_FD_EVENTS event) >+{ >+ struct fd_set rw_fds; >+ struct fd_set err_fds; >+ struct timeval tv; >+ int ret; >+ >+ FD_ZERO(&rw_fds); >+ FD_ZERO(&err_fds); >+ FD_SET(s, &rw_fds); >+ FD_SET(s, &err_fds); >+ >+ tv.tv_sec = 0; >+ tv.tv_usec = 0; >+ >+ if (event == DAPL_FD_READ) >+ ret = select(1, &rw_fds, NULL, &err_fds, &tv); >+ else >+ ret = select(1, NULL, &rw_fds, &err_fds, &tv); >+ >+ if (ret == 0) >+ return 0; >+ else if (FD_ISSET(s, &rw_fds)) >+ return event; >+ else if (FD_ISSET(s, &err_fds)) >+ return DAPL_FD_ERROR; >+ else >+ return WSAGetLastError(); >+} >+ >+static int dapl_select(struct dapl_fd_set *set) >+{ >+ return select(0, &set->set[0], &set->set[1], >&set->set[2], NULL); >+} >+#else // _WIN32 || _WIN64 >+enum DAPL_FD_EVENTS { >+ DAPL_FD_READ = POLLIN, >+ DAPL_FD_WRITE = POLLOUT, >+ DAPL_FD_ERROR = POLLERR >+}; >+ >+static int dapl_config_socket(DAPL_SOCKET s) >+{ >+ int ret; >+ >+ ret = fcntl(s, F_GETFL); >+ if (ret >= 0) >+ ret = fcntl(s, F_SETFL, ret | O_NONBLOCK); >+ return ret; >+} >+ >+static int dapl_connect_socket(DAPL_SOCKET s, struct sockaddr >*addr, int addrlen) >+{ >+ int ret; >+ >+ ret = connect(s, addr, addrlen); >+ >+ return (errno == EINPROGRESS) ? EAGAIN : ret; >+} >+ >+struct dapl_fd_set { >+ int index; >+ struct pollfd set[DAPL_FD_SETSIZE]; >+}; >+ >+static struct dapl_fd_set *dapl_alloc_fd_set(void) >+{ >+ return dapl_os_alloc(sizeof(struct dapl_fd_set)); >+} >+ >+static void dapl_fd_zero(struct dapl_fd_set *set) >+{ >+ set->index = 0; >+} >+ >+static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set, >+ enum DAPL_FD_EVENTS event) >+{ >+ if (set->index == DAPL_FD_SETSIZE - 1) { >+ dapl_log(DAPL_DBG_TYPE_ERR, >+ "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n", >+ set->index + 1); >+ return -1; >+ } >+ >+ set->set[set->index].fd = s; >+ set->set[set->index].revents = 0; >+ set->set[set->index++].events = event; >+ return 0; >+} >+ >+static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum >DAPL_FD_EVENTS event) >+{ >+ struct pollfd fds; >+ int ret; >+ >+ fds.fd = s; >+ fds.events = event; >+ fds.revents = 0; >+ ret = poll(&fds, 1, 0); >+ if (ret <= 0) >+ return ret; >+ >+ return fds.revents; >+} >+ >+static int dapl_select(struct dapl_fd_set *set) >+{ >+ return poll(set->set, set->index, -1); >+} >+#endif >+ > static struct ib_cm_handle *dapli_cm_create(void) > { > struct ib_cm_handle *cm_ptr; >@@ -85,7 +228,7 @@ static struct ib_cm_handle *dapli_cm_create(void) > > (void)dapl_os_memzero(cm_ptr, sizeof(*cm_ptr)); > cm_ptr->dst.ver = htons(DSCM_VER); >- cm_ptr->socket = -1; >+ cm_ptr->socket = DAPL_INVALID_SOCKET; > return cm_ptr; > bail: > dapl_os_free(cm_ptr, sizeof(*cm_ptr)); >@@ -100,8 +243,8 @@ static void dapli_cm_destroy(struct >ib_cm_handle *cm_ptr) > > /* cleanup, never made it to work queue */ > if (cm_ptr->state == SCM_INIT) { >- if (cm_ptr->socket >= 0) >- close(cm_ptr->socket); >+ if (cm_ptr->socket != DAPL_INVALID_SOCKET) >+ closesocket(cm_ptr->socket); > dapl_os_free(cm_ptr, sizeof(*cm_ptr)); > return; > } >@@ -112,9 +255,9 @@ static void dapli_cm_destroy(struct >ib_cm_handle *cm_ptr) > cm_ptr->ep->cm_handle = IB_INVALID_HANDLE; > > /* close socket if still active */ >- if (cm_ptr->socket >= 0) { >- close(cm_ptr->socket); >- cm_ptr->socket = -1; >+ if (cm_ptr->socket != DAPL_INVALID_SOCKET) { >+ closesocket(cm_ptr->socket); >+ cm_ptr->socket = DAPL_INVALID_SOCKET; > } > dapl_os_unlock(&cm_ptr->lock); > >@@ -172,14 +315,14 @@ >dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr) > return DAT_SUCCESS; > } else { > /* send disc date, close socket, schedule destroy */ >- if (cm_ptr->socket >= 0) { >- if (write(cm_ptr->socket, >- &disc_data, sizeof(disc_data)) == -1) >+ if (cm_ptr->socket != DAPL_INVALID_SOCKET) { >+ if (send(cm_ptr->socket, (char *) &disc_data, >+ sizeof(disc_data), 0) == -1) > dapl_log(DAPL_DBG_TYPE_WARN, > " cm_disc: write error >= %s\n", > strerror(errno)); >- close(cm_ptr->socket); >- cm_ptr->socket = -1; >+ closesocket(cm_ptr->socket); >+ cm_ptr->socket = DAPL_INVALID_SOCKET; > } > cm_ptr->state = SCM_DISCONNECTED; > } >@@ -211,7 +354,7 @@ void > dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err) > { > int len, opt = 1; >- struct iovec iovec[2]; >+ struct iovec iov[2]; > struct dapl_ep *ep_ptr = cm_ptr->ep; > > if (err) { >@@ -226,18 +369,21 @@ dapli_socket_connected(dp_ib_cm_handle_t >cm_ptr, int err) > " socket connected, write QP and private data\n"); > > /* no delay for small packets */ >- >setsockopt(cm_ptr->socket,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt)); >+ setsockopt(cm_ptr->socket, IPPROTO_TCP, TCP_NODELAY, >+ (char *) &opt, sizeof(opt)); > > /* send qp info and pdata to remote peer */ >- iovec[0].iov_base = &cm_ptr->dst; >- iovec[0].iov_len = sizeof(ib_qp_cm_t); >+ iov[0].iov_base = (void *) &cm_ptr->dst; >+ iov[0].iov_len = sizeof(ib_qp_cm_t); > if (cm_ptr->dst.p_size) { >- iovec[1].iov_base = cm_ptr->p_data; >- iovec[1].iov_len = ntohl(cm_ptr->dst.p_size); >+ iov[1].iov_base = cm_ptr->p_data; >+ iov[1].iov_len = ntohl(cm_ptr->dst.p_size); >+ len = writev(cm_ptr->socket, iov, 2); >+ } else { >+ len = writev(cm_ptr->socket, iov, 1); > } > >- len = writev(cm_ptr->socket, iovec, (cm_ptr->dst.p_size ? 2:1)); >- if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) { >+ if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) { > dapl_log(DAPL_DBG_TYPE_ERR, > " CONN_PENDING write: ERR %s, wcnt=%d -> %s\n", > strerror(errno), len, >@@ -253,9 +399,9 @@ dapli_socket_connected(dp_ib_cm_handle_t >cm_ptr, int err) > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " connected: sending SRC GID subnet >%016llx id %016llx\n", > (unsigned long long) >- >cpu_to_be64(cm_ptr->dst.gid.global.subnet_prefix), >+ htonll(cm_ptr->dst.gid.global.subnet_prefix), > (unsigned long long) >- >cpu_to_be64(cm_ptr->dst.gid.global.interface_id)); >+ htonll(cm_ptr->dst.gid.global.interface_id)); > > /* queue up to work thread to avoid blocking consumer */ > cm_ptr->state = SCM_RTU_PENDING; >@@ -290,25 +436,23 @@ dapli_socket_connect(DAPL_EP *ep_ptr, > return DAT_INSUFFICIENT_RESOURCES; > > /* create, connect, sockopt, and exchange QP information */ >- if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) { >+ if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) == >DAPL_INVALID_SOCKET) { > dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); > return DAT_INSUFFICIENT_RESOURCES; > } > >- /* non-blocking */ >- ret = fcntl(cm_ptr->socket, F_GETFL); >- if (ret < 0 || fcntl(cm_ptr->socket, >- F_SETFL, ret | O_NONBLOCK) < 0) { >- dapl_log(DAPL_DBG_TYPE_ERR, >- " socket connect: fcntl on socket %d >ERR %d %s\n", >- cm_ptr->socket, ret, >- strerror(errno)); >- goto bail; >- } >+ ret = dapl_config_socket(cm_ptr->socket); >+ if (ret < 0) { >+ dapl_log(DAPL_DBG_TYPE_ERR, >+ " socket connect: config socket %d ERR %d %s\n", >+ cm_ptr->socket, ret, strerror(errno)); >+ goto bail; >+ } > > ((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual); >- ret = connect(cm_ptr->socket, r_addr, sizeof(*r_addr)); >- if (ret && errno != EINPROGRESS) { >+ ret = dapl_connect_socket(cm_ptr->socket, (struct >sockaddr *) r_addr, >+ sizeof(*r_addr)); >+ if (ret && ret != EAGAIN) { > dapl_log(DAPL_DBG_TYPE_ERR, > " socket connect ERROR: %s -> %s r_qual %d\n", > strerror(errno), >@@ -391,16 +535,13 @@ >dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) > { > DAPL_EP *ep_ptr = cm_ptr->ep; > int len; >- struct iovec iovec[2]; > short rtu_data = htons(0x0E0F); > ib_cm_events_t event = IB_CME_DESTINATION_REJECT; > > /* read DST information into cm_ptr, overwrite SRC info */ > dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: recv peer >QP data\n"); > >- iovec[0].iov_base = &cm_ptr->dst; >- iovec[0].iov_len = sizeof(ib_qp_cm_t); >- len = readv(cm_ptr->socket, iovec, 1); >+ len = recv(cm_ptr->socket, (char *) &cm_ptr->dst, >sizeof(ib_qp_cm_t), 0); > if (len != sizeof(ib_qp_cm_t) || ntohs(cm_ptr->dst.ver) >!= DSCM_VER) { > dapl_log(DAPL_DBG_TYPE_ERR, > " CONN_RTU read: ERR %s, rcnt=%d, ver=%d -> %s\n", >@@ -456,9 +597,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) > /* read private data into cm_handle if any present */ > dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, read >private data\n"); > if (cm_ptr->dst.p_size) { >- iovec[0].iov_base = cm_ptr->p_data; >- iovec[0].iov_len = cm_ptr->dst.p_size; >- len = readv(cm_ptr->socket, iovec, 1); >+ len = recv(cm_ptr->socket, cm_ptr->p_data, >cm_ptr->dst.p_size, 0); > if (len != cm_ptr->dst.p_size) { > dapl_log(DAPL_DBG_TYPE_ERR, > " CONN_RTU read pdata: ERR %s, >rcnt=%d -> %s\n", >@@ -495,7 +634,7 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) > dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n"); > > /* complete handshake after final QP state change */ >- if (write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)) == -1) { >+ if (send(cm_ptr->socket, (char *) &rtu_data, >sizeof(rtu_data), 0) == -1) { > dapl_log(DAPL_DBG_TYPE_ERR, > " CONN_RTU: write error = %s\n", >strerror(errno)); > goto bail; >@@ -564,7 +703,7 @@ dapli_socket_listen(DAPL_IA *ia_ptr, > cm_ptr->hca = ia_ptr->hca_ptr; > > /* bind, listen, set sockopt, accept, exchange data */ >- if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { >+ if ((cm_ptr->socket = socket(AF_INET, SOCK_STREAM, 0)) >== DAPL_INVALID_SOCKET) { > dapl_log(DAPL_DBG_TYPE_ERR, > " ERR: listen socket create: %s\n", > strerror(errno)); >@@ -572,7 +711,8 @@ dapli_socket_listen(DAPL_IA *ia_ptr, > goto bail; > } > >- >setsockopt(cm_ptr->socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt)); >+ setsockopt(cm_ptr->socket, SOL_SOCKET, SO_REUSEADDR, >+ (char *) &opt, sizeof(opt)); > addr.sin_port = htons(serviceID); > addr.sin_family = AF_INET; > addr.sin_addr.s_addr = INADDR_ANY; >@@ -625,7 +765,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > > (void) dapl_os_memzero(acm_ptr, sizeof(*acm_ptr)); > >- acm_ptr->socket = -1; >+ acm_ptr->socket = DAPL_INVALID_SOCKET; > acm_ptr->sp = cm_ptr->sp; > acm_ptr->hca = cm_ptr->hca; > >@@ -633,7 +773,7 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > acm_ptr->socket = accept(cm_ptr->socket, > (struct >sockaddr*)&acm_ptr->dst.ia_address, > (socklen_t*)&len); >- if (acm_ptr->socket < 0) { >+ if (acm_ptr->socket == DAPL_INVALID_SOCKET) { > dapl_log(DAPL_DBG_TYPE_ERR, > " accept: ERR %s on FD %d l_cr %p\n", > strerror(errno),cm_ptr->socket,cm_ptr); >@@ -664,7 +804,7 @@ >dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr) > dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read >QP data\n"); > > /* read in DST QP info, IA address. check for private data */ >- len = read(acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t)); >+ len = recv(acm_ptr->socket, (char *) &acm_ptr->dst, >sizeof(ib_qp_cm_t), 0); > if (len != sizeof(ib_qp_cm_t) || > ntohs(acm_ptr->dst.ver) != DSCM_VER) { > dapl_log(DAPL_DBG_TYPE_ERR, >@@ -700,8 +840,7 @@ >dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr) > > /* read private data into cm_handle if any present */ > if (acm_ptr->dst.p_size) { >- len = read( acm_ptr->socket, >- acm_ptr->p_data, acm_ptr->dst.p_size); >+ len = recv(acm_ptr->socket, acm_ptr->p_data, >acm_ptr->dst.p_size, 0); > if (len != acm_ptr->dst.p_size) { > dapl_log(DAPL_DBG_TYPE_ERR, > " accept read pdata: ERR >%s, rcnt=%d\n", >@@ -757,14 +896,14 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, > DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; > dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; > ib_qp_cm_t local; >- struct iovec iovec[2]; >+ struct iovec iov[2]; > int len; > > if (p_size > IB_MAX_REP_PDATA_SIZE) > return DAT_LENGTH_ERROR; > > /* must have a accepted socket */ >- if (cm_ptr->socket < 0) >+ if (cm_ptr->socket == DAPL_INVALID_SOCKET) > return DAT_INTERNAL_ERROR; > > dapl_dbg_log(DAPL_DBG_TYPE_EP, >@@ -844,14 +983,17 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, > > local.ia_address = ia_ptr->hca_ptr->hca_address; > local.p_size = htonl(p_size); >- iovec[0].iov_base = &local; >- iovec[0].iov_len = sizeof(ib_qp_cm_t); >+ iov[0].iov_base = (void *) &local; >+ iov[0].iov_len = sizeof(ib_qp_cm_t); > if (p_size) { >- iovec[1].iov_base = p_data; >- iovec[1].iov_len = p_size; >+ iov[1].iov_base = p_data; >+ iov[1].iov_len = p_size; >+ len = writev(cm_ptr->socket, iov, 2); >+ } else { >+ len = writev(cm_ptr->socket, iov, 1); > } >- len = writev(cm_ptr->socket, iovec, (p_size ? 2:1)); >- if (len != (p_size + sizeof(ib_qp_cm_t))) { >+ >+ if (len != (p_size + sizeof(ib_qp_cm_t))) { > dapl_log(DAPL_DBG_TYPE_ERR, > " ACCEPT_USR: ERR %s, wcnt=%d -> %s\n", > strerror(errno), len, >@@ -859,6 +1001,7 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, > &cm_ptr->dst.ia_address)->sin_addr)); > goto bail; > } >+ > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " ACCEPT_USR: local port=0x%x lid=0x%x" > " qpn=0x%x psize=%d\n", >@@ -867,9 +1010,9 @@ dapli_socket_accept_usr(DAPL_EP *ep_ptr, > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " ACCEPT_USR SRC GID subnet %016llx id >%016llx\n", > (unsigned long long) >- cpu_to_be64(local.gid.global.subnet_prefix), >+ htonll(local.gid.global.subnet_prefix), > (unsigned long long) >- cpu_to_be64(local.gid.global.interface_id)); >+ htonll(local.gid.global.interface_id)); > > /* save state and reference to EP, queue for RTU data */ > cm_ptr->ep = ep_ptr; >@@ -894,7 +1037,7 @@ dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) > short rtu_data = 0; > > /* complete handshake after final QP state change */ >- len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data)); >+ len = recv(cm_ptr->socket, (char *) &rtu_data, >sizeof(rtu_data), 0); > if (len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f) { > dapl_log(DAPL_DBG_TYPE_ERR, > " ACCEPT_RTU: ERR %s, rcnt=%d rdata=%x\n", >@@ -1108,9 +1251,9 @@ dapls_ib_remove_conn_listener ( > > /* close accepted socket, free cm_srvc_handle and return */ > if (cm_ptr != NULL) { >- if (cm_ptr->socket >= 0) { >- close(cm_ptr->socket ); >- cm_ptr->socket = -1; >+ if (cm_ptr->socket != DAPL_INVALID_SOCKET) { >+ closesocket(cm_ptr->socket); >+ cm_ptr->socket = DAPL_INVALID_SOCKET; > } > /* cr_thread will free */ > cm_ptr->state = SCM_DESTROY; >@@ -1195,27 +1338,29 @@ dapls_ib_reject_connection( > IN DAT_COUNT psize, > IN const DAT_PVOID pdata) > { >- struct iovec iovec[2]; >+ struct iovec iov[2]; > > dapl_dbg_log (DAPL_DBG_TYPE_EP, > " reject(cm %p reason %x, pdata %p, psize %d)\n", > cm_ptr, reason, pdata, psize); > > /* write reject data to indicate reject */ >- if (cm_ptr->socket >= 0) { >+ if (cm_ptr->socket != DAPL_INVALID_SOCKET) { > cm_ptr->dst.rej = (uint16_t)reason; > cm_ptr->dst.rej = htons(cm_ptr->dst.rej); >- iovec[0].iov_base = &cm_ptr->dst; >- iovec[0].iov_len = sizeof(ib_qp_cm_t); >+ >+ iov[0].iov_base = (void *) &cm_ptr->dst; >+ iov[0].iov_len = sizeof(ib_qp_cm_t); > if (psize) { >- iovec[1].iov_base = pdata; >- iovec[2].iov_len = psize; >- writev(cm_ptr->socket, &iovec[0], 2); >- } else >- writev(cm_ptr->socket, &iovec[0], 1); >- >- close(cm_ptr->socket); >- cm_ptr->socket = -1; >+ iov[1].iov_base = pdata; >+ iov[1].iov_len = psize; >+ writev(cm_ptr->socket, iov, 2); >+ } else { >+ writev(cm_ptr->socket, iov, 1); >+ } >+ >+ closesocket(cm_ptr->socket); >+ cm_ptr->socket = DAPL_INVALID_SOCKET; > } > > /* cr_thread will destroy CR */ >@@ -1444,138 +1589,141 @@ dapls_ib_get_cm_event ( > } > > /* outbound/inbound CR processing thread to avoid blocking >applications */ >-#define SCM_MAX_CONN 8192 > void cr_thread(void *arg) > { >- struct dapl_hca *hca_ptr = arg; >- dp_ib_cm_handle_t cr, next_cr; >- int opt,ret,idx; >- socklen_t opt_len; >- char rbuf[2]; >- struct pollfd ufds[SCM_MAX_CONN]; >- >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca >%p\n",hca_ptr); >- >- dapl_os_lock( &hca_ptr->ib_trans.lock ); >- hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; >- while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { >- idx=0; >- ufds[idx].fd = g_scm_pipe[0]; /* wakeup and process work */ >- ufds[idx].events = POLLIN; >- ufds[idx].revents = 0; >- >- if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) >- next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list); >- else >- next_cr = NULL; >- >- while (next_cr) { >- cr = next_cr; >- if ((cr->socket == -1 && cr->state == SCM_DESTROY) || >- hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { >- >- dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: Free >%p\n", cr); >- next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, >- >(DAPL_LLIST_ENTRY*)&cr->entry ); >- dapl_llist_remove_entry(&hca_ptr->ib_trans.list, >- (DAPL_LLIST_ENTRY*)&cr->entry); >- dapl_os_free(cr, sizeof(*cr)); >- continue; >- } >- >- if (idx==SCM_MAX_CONN-1) { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- "SCM ERR: cm_thread exceeded >FD_SETSIZE %d\n",idx+1); >- continue; >- } >- >- /* Add to ufds for poll, check for immediate work */ >- ufds[++idx].fd = cr->socket; /* add listen or cr */ >- ufds[idx].revents = 0; >- if (cr->state == SCM_CONN_PENDING) >- ufds[idx].events = POLLOUT; >- else >- ufds[idx].events = POLLIN; >- >- /* check socket for event, accept in or connect out */ >- dapl_dbg_log(DAPL_DBG_TYPE_CM," poll cr=%p, fd=%d,%d\n", >- cr, cr->socket, ufds[idx].fd); >- dapl_os_unlock(&hca_ptr->ib_trans.lock); >- ret = poll(&ufds[idx],1,0); >- dapl_dbg_log(DAPL_DBG_TYPE_CM, >- " poll wakeup ret=%d cr->st=%d" >- " ev=0x%x fd=%d\n", >- ret,cr->state,ufds[idx].revents,ufds[idx].fd); >- >- /* data on listen, qp exchange, and on disconnect request */ >- if ((ret == 1) && ufds[idx].revents == POLLIN) { >- if (cr->socket > 0) { >- if (cr->state == SCM_LISTEN) >- dapli_socket_accept(cr); >- else if (cr->state == SCM_ACCEPTING) >- dapli_socket_accept_data(cr); >- else if (cr->state == SCM_ACCEPTED) >- dapli_socket_accept_rtu(cr); >- else if (cr->state == SCM_RTU_PENDING) >- dapli_socket_connect_rtu(cr); >- else if (cr->state == SCM_CONNECTED) >- dapli_socket_disconnect(cr); >+ struct dapl_hca *hca_ptr = arg; >+ dp_ib_cm_handle_t cr, next_cr; >+ int opt, ret; >+ socklen_t opt_len; >+ char rbuf[2]; >+ struct dapl_fd_set *set; >+ enum DAPL_FD_EVENTS event; >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca >%p\n", hca_ptr); >+ set = dapl_alloc_fd_set(); >+ if (!set) >+ goto out; >+ >+ dapl_os_lock(&hca_ptr->ib_trans.lock); >+ hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; >+ >+ while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { >+ dapl_fd_zero(set); >+ dapl_fd_set(g_scm_pipe[0], set, DAPL_FD_READ); >+ >+ if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) >+ next_cr = >dapl_llist_peek_head(&hca_ptr->ib_trans.list); >+ else >+ next_cr = NULL; >+ >+ while (next_cr) { >+ cr = next_cr; >+ if ((cr->socket == DAPL_INVALID_SOCKET >&& cr->state == SCM_DESTROY) || >+ hca_ptr->ib_trans.cr_state != >IB_THREAD_RUN) { >+ next_cr = >dapl_llist_next_entry(&hca_ptr->ib_trans.list, >+ >(DAPL_LLIST_ENTRY*)&cr->entry); >+ >dapl_llist_remove_entry(&hca_ptr->ib_trans.list, >+ >(DAPL_LLIST_ENTRY*)&cr->entry); >+ dapl_os_free(cr, sizeof(*cr)); >+ continue; >+ } >+ >+ event = (cr->state == SCM_CONN_PENDING) ? >+ DAPL_FD_WRITE : DAPL_FD_READ; >+ if (dapl_fd_set(cr->socket, set, event)) { >+ dapl_log(DAPL_DBG_TYPE_ERR, >+ " cr_thread: DESTROY >CR st=%d fd %d" >+ " -> %s\n", cr->state, >cr->socket, >+ inet_ntoa(((struct >sockaddr_in*) >+ >&cr->dst.ia_address)->sin_addr)); >+ dapli_cm_destroy(cr); >+ continue; >+ } >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_CM, " poll >cr=%p, fd=%d\n", >+ cr, cr->socket); >+ dapl_os_unlock(&hca_ptr->ib_trans.lock); >+ >+ ret = dapl_poll(cr->socket, event); >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_CM, >+ " poll wakeup ret=%d cr->st=%d fd=%d\n", >+ ret, cr->state, cr->socket); >+ >+ /* data on listen, qp exchange, and on >disconnect request */ >+ if (ret == DAPL_FD_READ) { >+ if (cr->socket != DAPL_INVALID_SOCKET) { >+ switch (cr->state) { >+ case SCM_LISTEN: >+ dapli_socket_accept(cr); >+ break; >+ case SCM_ACCEPTING: >+ >dapli_socket_accept_data(cr); >+ break; >+ case SCM_ACCEPTED: >+ >dapli_socket_accept_rtu(cr); >+ break; >+ case SCM_RTU_PENDING: >+ >dapli_socket_connect_rtu(cr); >+ break; >+ case SCM_CONNECTED: >+ >dapli_socket_disconnect(cr); >+ break; >+ default: >+ break; >+ } >+ } >+ /* connect socket is writable, check status */ >+ } else if (ret == DAPL_FD_WRITE || ret >== DAPL_FD_ERROR) { >+ if (cr->state == SCM_CONN_PENDING) { >+ opt = 0; >+ ret = >getsockopt(cr->socket, SOL_SOCKET, >+ SO_ERROR, (char >*) &opt, &opt_len); >+ if (!ret) >+ >dapli_socket_connected(cr, opt); >+ else >+ >dapli_socket_connected(cr, errno); >+ } else { >+ dapl_log(DAPL_DBG_TYPE_CM, >+ " CM poll ERR, >wrong state(%d) -> %s SKIP\n", cr->state, >+ >inet_ntoa(((struct sockaddr_in*)&cr->dst.ia_address)->sin_addr)); >+ } >+ } else if (ret != 0) { >+ dapl_log(DAPL_DBG_TYPE_CM, >+ " CM poll warning %s, >ret=%d st=%d -> %s\n", >+ strerror(errno), ret, cr->state, >+ inet_ntoa(((struct sockaddr_in*) >+ >&cr->dst.ia_address)->sin_addr)); >+ >+ /* POLLUP, NVAL, or poll error, >issue event if connected */ >+ if (cr->state == SCM_CONNECTED) >+ dapli_socket_disconnect(cr); >+ } >+ >+ dapl_os_lock(&hca_ptr->ib_trans.lock); >+ next_cr = >dapl_llist_next_entry(&hca_ptr->ib_trans.list, >+ (DAPL_LLIST_ENTRY*)&cr->entry); > } >- /* connect socket is writable, check status */ >- } else if ((ret == 1) && >- (ufds[idx].revents & POLLOUT || >- ufds[idx].revents & POLLERR)) { >- if (cr->state == SCM_CONN_PENDING) { >- opt = 0; >- ret = getsockopt(cr->socket, SOL_SOCKET, >- SO_ERROR, &opt, &opt_len); >- if (!ret) >- dapli_socket_connected(cr,opt); >- else >- dapli_socket_connected(cr,errno); >- } else { >- dapl_log(DAPL_DBG_TYPE_CM, >- " CM poll ERR, wrong state(%d) >-> %s SKIP\n", >- cr->state, >- inet_ntoa(((struct sockaddr_in*) >- >&cr->dst.ia_address)->sin_addr)); >+ >+ dapl_os_unlock(&hca_ptr->ib_trans.lock); >+ dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: >sleep, fds=%d\n", >+ set->index+1); >+ dapl_select(set); >+ dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n"); >+ >+ /* if pipe used to wakeup, consume */ >+ if (dapl_poll(g_scm_pipe[0], DAPL_FD_READ) == >DAPL_FD_READ) { >+ if (read(g_scm_pipe[0], rbuf, 2) == -1) >+ dapl_log(DAPL_DBG_TYPE_CM, >+ " cr_thread: read pipe >error = %s\n", >+ strerror(errno)); > } >- } else if (ret != 0) { >- dapl_log(DAPL_DBG_TYPE_CM, >- " CM poll warning %s, ret=%d revnt=%x >st=%d -> %s\n", >- strerror(errno), ret, >ufds[idx].revents, cr->state, >- inet_ntoa(((struct sockaddr_in*) >- &cr->dst.ia_address)->sin_addr)); >- >- /* POLLUP, NVAL, or poll error, issue event if >connected */ >- if (cr->state == SCM_CONNECTED) >- dapli_socket_disconnect(cr); >- } >- dapl_os_lock(&hca_ptr->ib_trans.lock); >- next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, >- >(DAPL_LLIST_ENTRY*)&cr->entry); >+ dapl_os_lock(&hca_ptr->ib_trans.lock); > } >+ > dapl_os_unlock(&hca_ptr->ib_trans.lock); >- dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: sleep, %d\n", idx+1); >- poll(ufds,idx+1,-1); /* infinite, all sockets and pipe */ >- /* if pipe used to wakeup, consume */ >- if (ufds[0].revents == POLLIN) >- if (read(g_scm_pipe[0], rbuf, 2) == -1) >- dapl_log(DAPL_DBG_TYPE_CM, >- " cr_thread: read pipe error = %s\n", >- strerror(errno)); >- dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n"); >- dapl_os_lock(&hca_ptr->ib_trans.lock); >- } >- dapl_os_unlock(&hca_ptr->ib_trans.lock); >- hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT; >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) >exit\n",hca_ptr); >+ free(set); >+out: >+ hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT; >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) >exit\n",hca_ptr); > } >- >-/* >- * Local variables: >- * c-indent-level: 4 >- * c-basic-offset: 4 >- * tab-width: 8 >- * End: >- */ >diff --git a/dapl/openib_scm/dapl_ib_cq.c >b/dapl/openib_scm/dapl_ib_cq.c >index 7d6bd4f..59fff11 100644 >--- a/dapl/openib_scm/dapl_ib_cq.c >+++ b/dapl/openib_scm/dapl_ib_cq.c >@@ -46,97 +46,111 @@ > * > >*************************************************************** >***********/ > >+#include "openib_osd.h" > #include "dapl.h" > #include "dapl_adapter_util.h" > #include "dapl_lmr_util.h" > #include "dapl_evd_util.h" > #include "dapl_ring_buffer_util.h" >-#include >-#include > >-int dapli_cq_thread_init(struct dapl_hca *hca_ptr) >+#if defined(_WIN64) || defined(_WIN32) >+void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) > { >- DAT_RETURN dat_status; >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," >cq_thread_destroy(%p)\n", hca_ptr); > >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL," >cq_thread_init(%p)\n", hca_ptr); >+ if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) >+ return; > >- /* create thread to process inbound connect request */ >- hca_ptr->ib_trans.cq_state = IB_THREAD_INIT; >- dat_status = dapl_os_thread_create(cq_thread, >(void*)hca_ptr, &hca_ptr->ib_trans.cq_thread); >- if (dat_status != DAT_SUCCESS) >- { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- " cq_thread_init: failed to >create thread\n"); >- return 1; >- } >+ /* destroy cr_thread and lock */ >+ hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; >+ SetEvent(hca_ptr->ib_trans.ib_cq->event); >+ dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) >cancel\n",hca_ptr); >+ while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { >+ dapl_os_sleep_usec(20000); >+ } >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) >exit\n",dapl_os_getpid()); >+} >+ >+static void cq_thread(void *arg) >+{ >+ struct dapl_hca *hca_ptr = arg; >+ struct dapl_evd *evd_ptr; >+ struct ibv_cq *ibv_cq = NULL; >+ >+ hca_ptr->ib_trans.cq_state = IB_THREAD_RUN; >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca >%p\n",hca_ptr); > >- /* wait for thread to start */ >- while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) { >- struct timespec sleep, remain; >- sleep.tv_sec = 0; >- sleep.tv_nsec = 20000000; /* 20 ms */ >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL, >- " cq_thread_init: waiting for >cq_thread\n"); >- nanosleep (&sleep, &remain); >- } >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) >exit\n",getpid()); >- return 0; >+ /* wait on DTO event, or signal to abort */ >+ while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { >+ if (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, >&ibv_cq, (void*)&evd_ptr)) { >+ >+ if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) { >+ ibv_ack_cq_events(ibv_cq, 1); >+ return; >+ } >+ >+ /* process DTO event via callback */ >+ >dapl_evd_dto_callback(hca_ptr->ib_hca_handle, evd_ptr->ib_cq_handle, >+ (void*)evd_ptr ); >+ >+ ibv_ack_cq_events(ibv_cq, 1); >+ } >+ } >+ hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca >%p \n", hca_ptr); > } > >+#else // _WIN32 || _WIN64 >+ > void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) > { >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL," >cq_thread_destroy(%p)\n", hca_ptr); >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," >cq_thread_destroy(%p)\n", hca_ptr); > > if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) > return; > >- /* destroy cr_thread and lock */ >- hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; >- pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); >- dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) >cancel\n",hca_ptr); >- while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { >- struct timespec sleep, remain; >- sleep.tv_sec = 0; >- sleep.tv_nsec = 2000000; /* 2 ms */ >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL, >- " cq_thread_destroy: waiting for >cq_thread\n"); >- nanosleep (&sleep, &remain); >- } >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL," >cq_thread_destroy(%d) exit\n",getpid()); >+ /* destroy cr_thread and lock */ >+ hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; >+ pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); >+ dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) >cancel\n",hca_ptr); >+ while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { >+ dapl_os_sleep_usec(20000); >+ } >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) >exit\n",dapl_os_getpid()); > } > > /* catch the signal */ > static void ib_cq_handler(int signum) > { >- return; >+ return; > } > >-void cq_thread( void *arg ) >+static void cq_thread(void *arg) > { >- struct dapl_hca *hca_ptr = arg; >- struct dapl_evd *evd_ptr; >- struct ibv_cq *ibv_cq = NULL; >+ struct dapl_hca *hca_ptr = arg; >+ struct dapl_evd *evd_ptr; >+ struct ibv_cq *ibv_cq = NULL; > sigset_t sigset; > > sigemptyset(&sigset); >- sigaddset(&sigset,SIGUSR1); >- pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); >- signal(SIGUSR1, ib_cq_handler); >+ sigaddset(&sigset,SIGUSR1); >+ pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); >+ signal(SIGUSR1, ib_cq_handler); > > hca_ptr->ib_trans.cq_state = IB_THREAD_RUN; >- >+ > dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca >%p\n",hca_ptr); > >- /* wait on DTO event, or signal to abort */ >- while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { >- struct pollfd cq_fd = { >- .fd = hca_ptr->ib_trans.ib_cq->fd, >- .events = POLLIN, >- .revents = 0 >- }; >+ /* wait on DTO event, or signal to abort */ >+ while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { >+ struct pollfd cq_fd = { >+ .fd = hca_ptr->ib_trans.ib_cq->fd, >+ .events = POLLIN, >+ .revents = 0 >+ }; > if ((poll(&cq_fd, 1, -1) == 1) && >- (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, >- &ibv_cq, (void*)&evd_ptr))) { >+ >(!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, &ibv_cq, >(void*)&evd_ptr))) { > > if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) { > ibv_ack_cq_events(ibv_cq, 1); >@@ -144,15 +158,40 @@ void cq_thread( void *arg ) > } > > /* process DTO event via callback */ >- dapl_evd_dto_callback ( hca_ptr->ib_hca_handle, >- evd_ptr->ib_cq_handle, >- (void*)evd_ptr ); >+ dapl_evd_dto_callback(hca_ptr->ib_hca_handle, >+ evd_ptr->ib_cq_handle, (void*)evd_ptr ); > > ibv_ack_cq_events(ibv_cq, 1); > } >- } >- hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: >hca %p \n", hca_ptr); >+ } >+ hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca >%p \n", hca_ptr); >+} >+ >+#endif // _WIN32 || _WIN64 >+ >+ >+int dapli_cq_thread_init(struct dapl_hca *hca_ptr) >+{ >+ DAT_RETURN dat_status; >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," >cq_thread_init(%p)\n", hca_ptr); >+ >+ /* create thread to process inbound connect request */ >+ hca_ptr->ib_trans.cq_state = IB_THREAD_INIT; >+ dat_status = dapl_os_thread_create(cq_thread, >(void*)hca_ptr, &hca_ptr->ib_trans.cq_thread); >+ if (dat_status != DAT_SUCCESS) { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ " cq_thread_init: failed to create thread\n"); >+ return 1; >+ } >+ >+ /* wait for thread to start */ >+ while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) { >+ dapl_os_sleep_usec(20000); >+ } >+ dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) >exit\n",dapl_os_getpid()); >+ return 0; > } > > >@@ -308,11 +347,11 @@ dapls_ib_cq_alloc ( > IN DAPL_EVD *evd_ptr, > IN DAT_COUNT *cqlen ) > { >+ struct ibv_comp_channel *channel = >ia_ptr->hca_ptr->ib_trans.ib_cq; >+ > dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, > "dapls_ib_cq_alloc: evd %p cqlen=%d \n", >evd_ptr, *cqlen ); > >- struct ibv_comp_channel *channel = >ia_ptr->hca_ptr->ib_trans.ib_cq; >- > #ifdef CQ_WAIT_OBJECT > if (evd_ptr->cq_wait_obj_handle) > channel = evd_ptr->cq_wait_obj_handle; >diff --git a/dapl/openib_scm/dapl_ib_dto.h >b/dapl/openib_scm/dapl_ib_dto.h >index 45000b9..fa19d01 100644 >--- a/dapl/openib_scm/dapl_ib_dto.h >+++ b/dapl/openib_scm/dapl_ib_dto.h >@@ -147,12 +147,6 @@ dapls_ib_post_send ( > IN const DAT_RMR_TRIPLET *remote_iov, > IN DAT_COMPLETION_FLAGS completion_flags) > { >- dapl_dbg_log(DAPL_DBG_TYPE_EP, >- " post_snd: ep %p op %d ck %p sgs", >- "%d l_iov %p r_iov %p f %d\n", >- ep_ptr, op_type, cookie, segments, local_iov, >- remote_iov, completion_flags); >- > ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; > ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; > struct ibv_send_wr wr; >@@ -163,6 +157,12 @@ dapls_ib_post_send ( > int ret; > > dapl_dbg_log(DAPL_DBG_TYPE_EP, >+ " post_snd: ep %p op %d ck %p sgs", >+ "%d l_iov %p r_iov %p f %d\n", >+ ep_ptr, op_type, cookie, segments, local_iov, >+ remote_iov, completion_flags); >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, > " post_snd: ep %p cookie %p segs %d l_iov %p\n", > ep_ptr, cookie, segments, local_iov); > >@@ -317,12 +317,6 @@ dapls_ib_post_ext_send ( > IN DAT_COMPLETION_FLAGS completion_flags, > IN DAT_IB_ADDR_HANDLE *remote_ah) > { >- dapl_dbg_log(DAPL_DBG_TYPE_EP, >- " post_ext_snd: ep %p op %d ck %p sgs", >- "%d l_iov %p r_iov %p f %d\n", >- ep_ptr, op_type, cookie, segments, local_iov, >- remote_iov, completion_flags, remote_ah); >- > ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; > ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; > struct ibv_send_wr wr; >@@ -331,6 +325,12 @@ dapls_ib_post_ext_send ( > int ret; > > dapl_dbg_log(DAPL_DBG_TYPE_EP, >+ " post_ext_snd: ep %p op %d ck %p sgs", >+ "%d l_iov %p r_iov %p f %d\n", >+ ep_ptr, op_type, cookie, segments, local_iov, >+ remote_iov, completion_flags, remote_ah); >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, > " post_snd: ep %p cookie %p segs %d l_iov %p\n", > ep_ptr, cookie, segments, local_iov); > >diff --git a/dapl/openib_scm/dapl_ib_mem.c >b/dapl/openib_scm/dapl_ib_mem.c >index 54340ed..9a97e5e 100644 >--- a/dapl/openib_scm/dapl_ib_mem.c >+++ b/dapl/openib_scm/dapl_ib_mem.c >@@ -1,4 +1,4 @@ >-/* >+ /* > * Copyright (c) 2005-2007 Intel Corporation. All rights reserved. > * > * This Software is licensed under one of the following licenses: >@@ -35,13 +35,6 @@ > * > >**********************************************************************/ > >-#include /* for IOCTL's */ >-#include /* for socket(2) and related bits and >pieces */ >-#include /* for socket(2) */ >-#include /* for struct ifreq */ >-#include /* for ARPHRD_ETHER */ >-#include /* for _SC_CLK_TCK */ >- > #include "dapl.h" > #include "dapl_adapter_util.h" > #include "dapl_lmr_util.h" >@@ -215,10 +208,9 @@ dapls_ib_mr_register(IN DAPL_IA *ia_ptr, > lmr->param.registered_address = (DAT_VADDR)(uintptr_t)virt_addr; > > dapl_dbg_log(DAPL_DBG_TYPE_UTIL, >- " mr_register: mr=%p addr=%p h %x pd %p ctx %p " >+ " mr_register: mr=%p addr=%p pd %p ctx %p " > "lkey=0x%x rkey=0x%x priv=%x\n", > lmr->mr_handle, lmr->mr_handle->addr, >- lmr->mr_handle->handle, > lmr->mr_handle->pd, lmr->mr_handle->context, > lmr->mr_handle->lkey, lmr->mr_handle->rkey, > length, dapls_convert_privileges(privileges)); >diff --git a/dapl/openib_scm/dapl_ib_util.c >b/dapl/openib_scm/dapl_ib_util.c >index 92b45d5..d82d3f5 100644 >--- a/dapl/openib_scm/dapl_ib_util.c >+++ b/dapl/openib_scm/dapl_ib_util.c >@@ -49,17 +49,13 @@ > static const char rcsid[] = "$Id: $"; > #endif > >+#include "openib_osd.h" > #include "dapl.h" > #include "dapl_adapter_util.h" > #include "dapl_ib_util.h" >+#include "dapl_osd.h" > > #include >-#include >-#include >-#include >-#include >-#include >-#include > > int g_dapl_loopback_connection = 0; > int g_scm_pipe[2]; >@@ -88,52 +84,43 @@ char *dapl_ib_mtu_str(enum ibv_mtu mtu) > } > } > >-/* just get IP address for hostname */ >-DAT_RETURN getipaddr( char *addr, int addr_len) >+static DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len) > { >- struct sockaddr_in *ipv4_addr = (struct sockaddr_in*)addr; >- struct hostent *h_ptr; >- struct utsname ourname; >+ struct sockaddr_in *sin; >+ struct addrinfo *res, hint, *ai; >+ int ret; >+ char hostname[256]; > >- if (uname(&ourname) < 0) { >- dapl_log(DAPL_DBG_TYPE_ERR, >- " open_hca: uname err=%s\n", strerror(errno)); >+ if (addr_len < sizeof(*sin)) { > return DAT_INTERNAL_ERROR; > } > >- h_ptr = gethostbyname(ourname.nodename); >- if (h_ptr == NULL) { >- dapl_log(DAPL_DBG_TYPE_ERR, >- " open_hca: gethostbyname err=%s\n", >- strerror(errno)); >- return DAT_INTERNAL_ERROR; >+ ret = gethostname(hostname,256); >+ if (ret) >+ return ret; >+ >+ memset(&hint, 0, sizeof hint); >+ hint.ai_flags = AI_PASSIVE; >+ hint.ai_family = AF_INET; >+ hint.ai_socktype = SOCK_STREAM; >+ hint.ai_protocol = IPPROTO_TCP; >+ >+ ret = getaddrinfo(hostname, NULL, &hint, &res); >+ if (ret) >+ return ret; >+ >+ ret = DAT_INVALID_ADDRESS; >+ for (ai = res; ai; ai = ai->ai_next) { >+ sin = (struct sockaddr_in *) ai->ai_addr; >+ if (*((uint32_t *) &sin->sin_addr) != >htonl(0x7f000001)) { >+ *((struct sockaddr_in *) addr) = *sin; >+ ret = DAT_SUCCESS; >+ break; >+ } > } > >- if (h_ptr->h_addrtype == AF_INET) { >- int i; >- struct in_addr **alist = >- (struct in_addr **)h_ptr->h_addr_list; >- >- *(uint32_t*)&ipv4_addr->sin_addr = 0; >- ipv4_addr->sin_family = AF_INET; >- >- /* Walk the list of addresses for host */ >- for (i=0; alist[i] != NULL; i++) { >- /* first non-loopback address */ >- if (*(uint32_t*)alist[i] != htonl(0x7f000001)) { >- dapl_os_memcpy(&ipv4_addr->sin_addr, >- h_ptr->h_addr_list[i], >- 4); >- break; >- } >- } >- /* if no acceptable address found */ >- if (*(uint32_t*)&ipv4_addr->sin_addr == 0) >- return DAT_INVALID_ADDRESS; >- } else >- return DAT_INVALID_ADDRESS; >- >- return DAT_SUCCESS; >+ freeaddrinfo(res); >+ return ret; > } > > /* >@@ -165,6 +152,28 @@ int32_t dapls_ib_release (void) > return 0; > } > >+#if defined(_WIN64) || defined(_WIN32) >+int dapls_config_comp_channel(struct ibv_comp_channel *channel) >+{ >+ return 0; >+} >+#else // _WIN64 || WIN32 >+int dapls_config_comp_channel(struct ibv_comp_channel *channel) >+{ >+ int opts; >+ >+ opts = fcntl(channel->fd, F_GETFL); /* uCQ */ >+ if (opts < 0 || fcntl(channel->fd, F_SETFL, opts | >O_NONBLOCK) < 0) { >+ dapl_log(DAPL_DBG_TYPE_ERR, >+ " dapls_create_comp_channel: fcntl on >ib_cq->fd %d ERR %d %s\n", >+ channel->fd, opts, strerror(errno)); >+ return errno; >+ } >+ >+ return 0; >+} >+#endif >+ > /* > * dapls_ib_open_hca > * >@@ -187,7 +196,6 @@ DAT_RETURN dapls_ib_open_hca ( > IN DAPL_HCA *hca_ptr) > { > struct ibv_device **dev_list; >- int opts; > int i; > DAT_RETURN dat_status = DAT_SUCCESS; > >@@ -219,7 +227,7 @@ found: > dapl_dbg_log(DAPL_DBG_TYPE_UTIL," open_hca: Found dev >%s %016llx\n", > ibv_get_device_name(hca_ptr->ib_trans.ib_dev), > (unsigned long long) >- >bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); >+ >ntohll(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); > > hca_ptr->ib_hca_handle = >ibv_open_device(hca_ptr->ib_trans.ib_dev); > if (!hca_ptr->ib_hca_handle) { >@@ -268,13 +276,7 @@ found: > goto bail; > } > >- opts = fcntl(hca_ptr->ib_trans.ib_cq->fd, F_GETFL); /* uCQ */ >- if (opts < 0 || fcntl(hca_ptr->ib_trans.ib_cq->fd, >- F_SETFL, opts | O_NONBLOCK) < 0) { >- dapl_log(DAPL_DBG_TYPE_ERR, >- " open_hca: fcntl on ib_cq->fd %d ERR >%d %s\n", >- hca_ptr->ib_trans.ib_cq->fd, opts, >- strerror(errno)); >+ if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) { > goto bail; > } > >@@ -309,16 +311,11 @@ found: > > /* wait for thread */ > while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { >- struct timespec sleep, remain; >- sleep.tv_sec = 0; >- sleep.tv_nsec = 2000000; /* 2 ms */ >- dapl_dbg_log(DAPL_DBG_TYPE_UTIL, >- " open_hca: waiting for cr_thread\n"); >- nanosleep (&sleep, &remain); >+ dapl_os_sleep_usec(20000); > } > > /* get the IP address of the device */ >- dat_status = getipaddr((char*)&hca_ptr->hca_address, >+ dat_status = getlocalipaddr((DAT_SOCK_ADDR*) >&hca_ptr->hca_address, > sizeof(DAT_SOCK_ADDR6)); > > dapl_dbg_log(DAPL_DBG_TYPE_UTIL, >@@ -376,16 +373,13 @@ DAT_RETURN dapls_ib_close_hca ( IN >DAPL_HCA *hca_ptr ) > " thread_destroy: thread wakeup err = %s\n", > strerror(errno)); > while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { >- struct timespec sleep, remain; >- sleep.tv_sec = 0; >- sleep.tv_nsec = 2000000; /* 2 ms */ > dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > " close_hca: waiting for cr_thread\n"); > if (write(g_scm_pipe[1], "w", sizeof "w") == -1) > dapl_log(DAPL_DBG_TYPE_UTIL, > " thread_destroy: thread >wakeup err = %s\n", > strerror(errno)); >- nanosleep (&sleep, &remain); >+ dapl_os_sleep_usec(20000); > } > dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); > >diff --git a/dapl/openib_scm/dapl_ib_util.h >b/dapl/openib_scm/dapl_ib_util.h >index 863da2b..fd1c24e 100644 >--- a/dapl/openib_scm/dapl_ib_util.h >+++ b/dapl/openib_scm/dapl_ib_util.h >@@ -49,8 +49,8 @@ > #ifndef _DAPL_IB_UTIL_H_ > #define _DAPL_IB_UTIL_H_ > >+#include "openib_osd.h" > #include >-#include > > #ifdef DAT_EXTENSIONS > #include >@@ -73,8 +73,6 @@ typedef struct ibv_wc >ib_work_completion_t; > typedef struct ibv_context *ib_hca_handle_t; > typedef ib_hca_handle_t dapl_ibal_ca_t; > >-/* CM mappings, user CM not complete use SOCKETS */ >- > /* destination info to exchange, define wire protocol version */ > #define DSCM_VER 3 > typedef struct _ib_qp_cm >@@ -86,7 +84,7 @@ typedef struct _ib_qp_cm > uint32_t qpn; > uint32_t p_size; > DAT_SOCK_ADDR6 ia_address; >- union ibv_gid gid; >+ union ibv_gid gid; > uint16_t qp_type; > } ib_qp_cm_t; > >@@ -110,20 +108,18 @@ struct ib_cm_handle > struct dapl_llist_entry entry; > DAPL_OS_LOCK lock; > SCM_STATE state; >- int socket; >+ DAPL_SOCKET socket; > struct dapl_hca *hca; > struct dapl_sp *sp; >- struct dapl_ep *ep; >+ struct dapl_ep *ep; > ib_qp_cm_t dst; >- unsigned char p_data[256]; >+ unsigned char p_data[256]; /* must follow >ib_qp_cm_t */ > struct ibv_ah *ah; > }; > > typedef struct ib_cm_handle *dp_ib_cm_handle_t; > typedef dp_ib_cm_handle_t ib_cm_srvc_handle_t; > >-DAT_RETURN getipaddr(char *addr, int addr_len); >- > /* CM events */ > typedef enum > { >@@ -141,9 +137,6 @@ typedef enum > > } ib_cm_events_t; > >-/* prototype for cm thread */ >-void cr_thread (void *arg); >- > /* Operation and state mappings */ > typedef enum ibv_send_flags ib_send_op_type_t; > typedef struct ibv_sge ib_data_segment_t; >@@ -289,7 +282,7 @@ typedef struct _ib_hca_transport > DAPL_OS_LOCK cq_lock; > int max_inline_send; > ib_thread_state_t cq_state; >- DAPL_OS_THREAD cq_thread; >+ DAPL_OS_THREAD cq_thread; > struct ibv_comp_channel *ib_cq; > int cr_state; > DAPL_OS_THREAD thread; >@@ -317,7 +310,6 @@ typedef uint32_t ib_shm_transport_t; > /* prototypes */ > int32_t dapls_ib_init (void); > int32_t dapls_ib_release (void); >-void cq_thread (void *arg); > void cr_thread(void *arg); > int dapli_cq_thread_init(struct dapl_hca *hca_ptr); > void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); >@@ -349,7 +341,7 @@ dapl_convert_errno( IN int err, IN const >char *str ) > if (!err) return DAT_SUCCESS; > > #if DAPL_DBG >- if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) >+ if ((err != EAGAIN) && (err != ETIMEDOUT)) > dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); > #endif > >@@ -357,24 +349,15 @@ dapl_convert_errno( IN int err, IN const >char *str ) > { > case EOVERFLOW : return DAT_LENGTH_ERROR; > case EACCES : return DAT_PRIVILEGES_VIOLATION; >- case ENXIO : >- case ERANGE : > case EPERM : return DAT_PROTECTION_VIOLATION; > >- case EINVAL : >- case EBADF : >- case ENOENT : >- case ENOTSOCK : return DAT_INVALID_HANDLE; >+ case EINVAL : return DAT_INVALID_HANDLE; > case EISCONN : return DAT_INVALID_STATE | >DAT_INVALID_STATE_EP_CONNECTED; > case ECONNREFUSED : return DAT_INVALID_STATE | >DAT_INVALID_STATE_EP_NOTREADY; >- case ETIME : > case ETIMEDOUT : return DAT_TIMEOUT_EXPIRED; > case ENETUNREACH: return DAT_INVALID_ADDRESS | >DAT_INVALID_ADDRESS_UNREACHABLE; > case EADDRINUSE : return DAT_CONN_QUAL_IN_USE; > case EALREADY : return DAT_INVALID_STATE | >DAT_INVALID_STATE_EP_ACTCONNPENDING; >- case ENOSPC : >- case ENOMEM : >- case E2BIG : >- case EDQUOT : return DAT_INSUFFICIENT_RESOURCES; >+ case ENOMEM : return DAT_INSUFFICIENT_RESOURCES; > case EAGAIN : return DAT_QUEUE_EMPTY; > case EINTR : return DAT_INTERRUPTED_CALL; > case EAFNOSUPPORT : return DAT_INVALID_ADDRESS | >DAT_INVALID_ADDRESS_MALFORMED; >diff --git a/dapl/openib_scm/linux/openib_osd.h >b/dapl/openib_scm/linux/openib_osd.h >new file mode 100644 >index 0000000..235a82e >--- /dev/null >+++ b/dapl/openib_scm/linux/openib_osd.h >@@ -0,0 +1,21 @@ >+#ifndef OPENIB_OSD_H >+#define OPENIB_OSD_H >+ >+#include >+#include >+ >+#if __BYTE_ORDER == __BIG_ENDIAN >+#define htonll(x) (x) >+#define ntohll(x) (x) >+#elif __BYTE_ORDER == __LITTLE_ENDIAN >+#define htonll(x) bswap_64(x) >+#define ntohll(x) bswap_64(x) >+#endif >+ >+#define DAPL_SOCKET int >+#define DAPL_INVALID_SOCKET -1 >+#define DAPL_FD_SETSIZE 8192 >+ >+#define closesocket close >+ >+#endif // OPENIB_OSD_H >diff --git a/dapl/openib_scm/windows/openib_osd.h >b/dapl/openib_scm/windows/openib_osd.h >new file mode 100644 >index 0000000..67c70ec >--- /dev/null >+++ b/dapl/openib_scm/windows/openib_osd.h >@@ -0,0 +1,39 @@ >+#ifndef OPENIB_OSD_H >+#define OPENIB_OSD_H >+ >+#ifndef FD_SETSIZE >+#define FD_SETSIZE 1024 /* Set before including winsock2 - >see select help */ >+#define DAPL_FD_SETSIZE FD_SETSIZE >+#endif >+ >+#include >+#include >+#include >+#include >+ >+#define ntohll _byteswap_uint64 >+#define htonll _byteswap_uint64 >+ >+#define pipe(x) _pipe(x, 4096, _O_TEXT) >+#define read _read >+#define write _write >+#define DAPL_SOCKET SOCKET >+#define DAPL_INVALID_SOCKET INVALID_SOCKET >+ >+/* allow casting to WSABUF */ >+struct iovec >+{ >+ u_long iov_len; >+ char FAR* iov_base; >+}; >+ >+static int writev(DAPL_SOCKET s, struct iovec *vector, int count) >+{ >+ int len, ret; >+ >+ ret = WSASend(s, (WSABUF *) vector, count, &len, 0, >NULL, NULL); >+ return ret ? ret : len; >+} >+ >+#endif // OPENIB_OSD_H >+ >diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h >index 6fef9af..ae02944 100644 >--- a/dapl/udapl/linux/dapl_osd.h >+++ b/dapl/udapl/linux/dapl_osd.h >@@ -302,6 +302,15 @@ dapl_os_thread_create ( > IN void *data, > OUT DAPL_OS_THREAD *thread_id ); > >+STATIC _INLINE_ void >+dapl_os_sleep_usec(int usec) >+{ >+ struct timespec sleep, remain; >+ >+ sleep.tv_sec = 0; >+ sleep.tv_nsec = usec * 1000; >+ nanosleep(&sleep, &remain); >+} > > /* > * Lock Functions > > > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general > From weiny2 at llnl.gov Tue Feb 17 09:19:55 2009 From: weiny2 at llnl.gov (weiny2 at llnl.gov) Date: Tue, 17 Feb 2009 09:19:55 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> Message-ID: <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> Quoting Hal Rosenstock : > Sasha, > > On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky > wrote: >> >> I looked at implementation of safe_*() functions (safe_smp_query, >> safe_smp_set and safe_ca_call) and found that they are not actually >> "safe" as declared by its names. The only thread-unsafe thing which >> is used there is static 'mad_portid' structure (from rpc.c), > > I'm not sure that the only thread unsafe thing in the mad rpc > mechanism is the portid. > >> but modification of this structure is not protected by same mutex (actually >> not protected at all). > > A first step would be removing the portid as static. If so, portid > would need to be a supplied parameter to various mad routines and the > existing ones relying on madrpc_portid would be deprecated. Does this > make sense to do ? Would you accept such a patch ? > Don't we already have an interface like this with mad_rpc_open_port? I don't like the void * return but it is "struct ibmadb_port" under the hood. Are those calls which use it not thread safe? Ira > -- Hal > >> As far as I know nothing uses those safe_*() primitives right now outside >> libibmad, so I think it is better to remove this confused functions from >> API (with changing library version, etc.). >> >> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to >> hidden static pthread mutex which is not controlled by caller >> application. I think that it will be more robust for multithreaded >> application to use its own synchronization methods (pthread mutex or any >> other) for better control. So let's remove madrpc_lock/unlock() too. >> >> Signed-off-by: Sasha Khapyorsky >> --- >> libibmad/include/infiniband/mad.h | 41 >> ------------------------------------- >> libibmad/libibmad.ver | 2 +- >> libibmad/src/libibmad.map | 2 - >> libibmad/src/rpc.c | 15 ------------- >> libibmad/src/sa.c | 5 ++- >> 5 files changed, 4 insertions(+), 61 deletions(-) >> >> diff --git a/libibmad/include/infiniband/mad.h >> b/libibmad/include/infiniband/mad.h >> index eff6738..89b4be5 100644 >> --- a/libibmad/include/infiniband/mad.h >> +++ b/libibmad/include/infiniband/mad.h >> @@ -703,8 +703,6 @@ void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t >> *dport, ib_rmpp_hdr_t *rmpp, >> void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, >> int num_classes); >> void madrpc_save_mad(void *madbuf, int len); >> -void madrpc_lock(void); >> -void madrpc_unlock(void); >> void madrpc_show_errors(int set); >> >> void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, >> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t >> *id, unsigned attrid, >> uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid, >> unsigned mod, >> unsigned timeout, const void *srcport); >> >> -inline static uint8_t * >> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid, >> unsigned mod, >> - unsigned timeout) >> -{ >> - uint8_t *p; >> - >> - madrpc_lock(); >> - p = smp_query(rcvbuf, portid, attrid, mod, timeout); >> - madrpc_unlock(); >> - >> - return p; >> -} >> - >> -inline static uint8_t * >> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid, >> unsigned mod, >> - unsigned timeout) >> -{ >> - uint8_t *p; >> - >> - madrpc_lock(); >> - p = smp_set(rcvbuf, portid, attrid, mod, timeout); >> - madrpc_unlock(); >> - >> - return p; >> -} >> - >> /* sa.c */ >> uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, >> unsigned timeout); >> @@ -761,19 +733,6 @@ int ib_path_query(ibmad_gid_t srcgid, >> ibmad_gid_t destgid, ib_portid_t *sm_id, >> int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, >> ibmad_gid_t destgid, ib_portid_t *sm_id, >> void *buf); >> >> -inline static uint8_t * >> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, >> - unsigned timeout) >> -{ >> - uint8_t *p; >> - >> - madrpc_lock(); >> - p = sa_call(rcvbuf, portid, sa, timeout); >> - madrpc_unlock(); >> - >> - return p; >> -} >> - >> /* resolve.c */ >> int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); >> int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, >> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver >> index 7e93c16..23d2dc2 100644 >> --- a/libibmad/libibmad.ver >> +++ b/libibmad/libibmad.ver >> @@ -6,4 +6,4 @@ >> # API_REV - advance on any added API >> # RUNNING_REV - advance any change to the vendor files >> # AGE - number of backward versions the API still supports >> -LIBVERSION=5:0:4 >> +LIBVERSION=2:0:0 >> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map >> index 927e51c..f944d86 100644 >> --- a/libibmad/src/libibmad.map >> +++ b/libibmad/src/libibmad.map >> @@ -72,14 +72,12 @@ IBMAD_1.3 { >> madrpc; >> madrpc_def_timeout; >> madrpc_init; >> - madrpc_lock; >> madrpc_portid; >> madrpc_rmpp; >> madrpc_save_mad; >> madrpc_set_retries; >> madrpc_set_timeout; >> madrpc_show_errors; >> - madrpc_unlock; >> ib_path_query; >> sa_call; >> sa_rpc_call; >> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c >> index 5226540..670a936 100644 >> --- a/libibmad/src/rpc.c >> +++ b/libibmad/src/rpc.c >> @@ -38,7 +38,6 @@ >> #include >> #include >> #include >> -#include >> #include >> #include >> >> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, >> ib_rmpp_hdr_t *rmpp, void *data) >> return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); >> } >> >> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; >> - >> -void >> -madrpc_lock(void) >> -{ >> - pthread_mutex_lock(&rpclock); >> -} >> - >> -void >> -madrpc_unlock(void) >> -{ >> - pthread_mutex_unlock(&rpclock); >> -} >> - >> void >> madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int >> num_classes) >> { >> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c >> index 27b9d52..c601254 100644 >> --- a/libibmad/src/sa.c >> +++ b/libibmad/src/sa.c >> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport, >> ibmad_gid_t srcgid, ibmad_gid_t destgid, >> if (srcport) { >> p = sa_rpc_call (srcport, buf, sm_id, &sa, 0); >> } else { >> - p = safe_sa_call(buf, sm_id, &sa, 0); >> + p = sa_call(buf, sm_id, &sa, 0); >> } >> if (!p) { >> IBWARN("sa call path_query failed"); >> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport, >> ibmad_gid_t srcgid, ibmad_gid_t destgid, >> mad_decode_field(p, IB_SA_PR_DLID_F, &dlid); >> return dlid; >> } >> + >> int >> ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t >> *sm_id, void *buf) >> { >> - return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf); >> + return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf); >> } >> -- >> 1.6.0.4.766.g6fc4a >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http:// >> openib.org/mailman/listinfo/openib-general >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// > openib.org/mailman/listinfo/openib-general > > From brian at sun.com Tue Feb 17 09:52:23 2009 From: brian at sun.com (Brian J. Murrell) Date: Tue, 17 Feb 2009 12:52:23 -0500 Subject: [ofa-general] IB function calls in kernel module fail In-Reply-To: <499A8A20.1090507@mellanox.co.il> References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> <49994BB2.3010206@mellanox.co.il> <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com> <499A8A20.1090507@mellanox.co.il> Message-ID: <1234893143.21802.96.camel@pc.interlinx.bc.ca> On Tue, 2009-02-17 at 11:57 +0200, Tziporet Koren wrote: > neutron wrote: > > One remaining question. > > > > In my code of kernel module, do I need to #include the header files > > from /src/openib/include/.... > > Or I just include the header files from /include/..... > > > > > You should use the headers from ofed if you wish to use OFED kernel modules. Ahhh. But should he just include /src/openib/include/ or also /src/openib/kernel_addons/backport//include/ (as described in /src/openib/ofed_patch.mk as well? And in what order should these be specified in? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From sashak at voltaire.com Tue Feb 17 10:50:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Feb 2009 20:50:27 +0200 Subject: [ofa-general] [PATCH] opensm/console: dump_portguid command fixes Message-ID: <20090217185027.GJ7189@sashak.voltaire.com> Don't try to match invalid expressions, so things like 'dump_portguid *' will not crash. Free memory allocated by regcomp() and for regexp list. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_console.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index a66a7d3..0f26e51 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1247,6 +1247,8 @@ static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) fprintf(out, "Couldn't parse regular expression %s. Skipping it.\n", p_cmd); + free(p_regexp); + continue; } p_regexp->next = p_head_regexp; p_head_regexp = p_regexp; @@ -1292,6 +1294,11 @@ static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) if (output != out) fclose(output); + for (; p_head_regexp; p_head_regexp = p_regexp) { + p_regexp = p_head_regexp->next; + regfree(&p_head_regexp->exp); + free(p_head_regexp); + } } static void help_dump_portguid(FILE * out, int detail) -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Tue Feb 17 10:51:09 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Feb 2009 20:51:09 +0200 Subject: [ofa-general] [PATCH] opensm/console: dump_portguid - don't duplicate matched guids Message-ID: <20090217185109.GK7189@sashak.voltaire.com> Don't repeat port GUIDs when more then one regular expression matches. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_console.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 0f26e51..0c3cdbf 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1285,9 +1285,11 @@ static void dump_portguid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) p_regexp = p_regexp->next) if (regexec (&(p_regexp->exp), p_port->p_node->print_desc, 0, - NULL, 0) == 0) + NULL, 0) == 0) { fprintf(output, "0x%" PRIxLEAST64 "\n", cl_ntoh64(p_port->p_physp->port_guid)); + break; + } } CL_PLOCK_RELEASE(p_osm->sm.p_lock); -- 1.6.1.2.319.gbd9e From hal.rosenstock at gmail.com Tue Feb 17 13:12:12 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 17 Feb 2009 16:12:12 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> Message-ID: On Tue, Feb 17, 2009 at 12:19 PM, wrote: > Quoting Hal Rosenstock : > >> Sasha, >> >> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky >> wrote: >>> >>> I looked at implementation of safe_*() functions (safe_smp_query, >>> safe_smp_set and safe_ca_call) and found that they are not actually >>> "safe" as declared by its names. The only thread-unsafe thing which >>> is used there is static 'mad_portid' structure (from rpc.c), >> >> I'm not sure that the only thread unsafe thing in the mad rpc >> mechanism is the portid. >> >>> but modification of this structure is not protected by same mutex >>> (actually >>> not protected at all). >> >> A first step would be removing the portid as static. If so, portid >> would need to be a supplied parameter to various mad routines and the >> existing ones relying on madrpc_portid would be deprecated. Does this >> make sense to do ? Would you accept such a patch ? >> > Don't we already have an interface like this with mad_rpc_open_port? I'm not sure this was carried all the way through (The basic building blocks are there but I think some additional routines are needed). Shouldn't the in tree clients be converted over and the old routines deprecated ? > I don't like the void * return but it is "struct ibmadb_port" under the hood. Is access into that currently opaque struct needed for something by the clients of the library ? > Are those calls which use it not thread safe? They look OK but I'm not 100% sure yet. -- Hal > Ira > > >> -- Hal >> >>> As far as I know nothing uses those safe_*() primitives right now outside >>> libibmad, so I think it is better to remove this confused functions from >>> API (with changing library version, etc.). >>> >>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to >>> hidden static pthread mutex which is not controlled by caller >>> application. I think that it will be more robust for multithreaded >>> application to use its own synchronization methods (pthread mutex or any >>> other) for better control. So let's remove madrpc_lock/unlock() too. >>> >>> Signed-off-by: Sasha Khapyorsky >>> --- >>> libibmad/include/infiniband/mad.h | 41 >>> ------------------------------------- >>> libibmad/libibmad.ver | 2 +- >>> libibmad/src/libibmad.map | 2 - >>> libibmad/src/rpc.c | 15 ------------- >>> libibmad/src/sa.c | 5 ++- >>> 5 files changed, 4 insertions(+), 61 deletions(-) >>> >>> diff --git a/libibmad/include/infiniband/mad.h >>> b/libibmad/include/infiniband/mad.h >>> index eff6738..89b4be5 100644 >>> --- a/libibmad/include/infiniband/mad.h >>> +++ b/libibmad/include/infiniband/mad.h >>> @@ -703,8 +703,6 @@ void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t >>> *dport, ib_rmpp_hdr_t *rmpp, >>> void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, >>> int num_classes); >>> void madrpc_save_mad(void *madbuf, int len); >>> -void madrpc_lock(void); >>> -void madrpc_unlock(void); >>> void madrpc_show_errors(int set); >>> >>> void * mad_rpc_open_port(char *dev_name, int dev_port, int >>> *mgmt_classes, >>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t *id, >>> unsigned attrid, >>> uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid, >>> unsigned mod, >>> unsigned timeout, const void *srcport); >>> >>> -inline static uint8_t * >>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid, >>> unsigned mod, >>> - unsigned timeout) >>> -{ >>> - uint8_t *p; >>> - >>> - madrpc_lock(); >>> - p = smp_query(rcvbuf, portid, attrid, mod, timeout); >>> - madrpc_unlock(); >>> - >>> - return p; >>> -} >>> - >>> -inline static uint8_t * >>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid, >>> unsigned mod, >>> - unsigned timeout) >>> -{ >>> - uint8_t *p; >>> - >>> - madrpc_lock(); >>> - p = smp_set(rcvbuf, portid, attrid, mod, timeout); >>> - madrpc_unlock(); >>> - >>> - return p; >>> -} >>> - >>> /* sa.c */ >>> uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, >>> unsigned timeout); >>> @@ -761,19 +733,6 @@ int ib_path_query(ibmad_gid_t srcgid, >>> ibmad_gid_t destgid, ib_portid_t *sm_id, >>> int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, >>> ibmad_gid_t destgid, ib_portid_t *sm_id, void >>> *buf); >>> >>> -inline static uint8_t * >>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, >>> - unsigned timeout) >>> -{ >>> - uint8_t *p; >>> - >>> - madrpc_lock(); >>> - p = sa_call(rcvbuf, portid, sa, timeout); >>> - madrpc_unlock(); >>> - >>> - return p; >>> -} >>> - >>> /* resolve.c */ >>> int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); >>> int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, >>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver >>> index 7e93c16..23d2dc2 100644 >>> --- a/libibmad/libibmad.ver >>> +++ b/libibmad/libibmad.ver >>> @@ -6,4 +6,4 @@ >>> # API_REV - advance on any added API >>> # RUNNING_REV - advance any change to the vendor files >>> # AGE - number of backward versions the API still supports >>> -LIBVERSION=5:0:4 >>> +LIBVERSION=2:0:0 >>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map >>> index 927e51c..f944d86 100644 >>> --- a/libibmad/src/libibmad.map >>> +++ b/libibmad/src/libibmad.map >>> @@ -72,14 +72,12 @@ IBMAD_1.3 { >>> madrpc; >>> madrpc_def_timeout; >>> madrpc_init; >>> - madrpc_lock; >>> madrpc_portid; >>> madrpc_rmpp; >>> madrpc_save_mad; >>> madrpc_set_retries; >>> madrpc_set_timeout; >>> madrpc_show_errors; >>> - madrpc_unlock; >>> ib_path_query; >>> sa_call; >>> sa_rpc_call; >>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c >>> index 5226540..670a936 100644 >>> --- a/libibmad/src/rpc.c >>> +++ b/libibmad/src/rpc.c >>> @@ -38,7 +38,6 @@ >>> #include >>> #include >>> #include >>> -#include >>> #include >>> #include >>> >>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, >>> ib_rmpp_hdr_t *rmpp, void *data) >>> return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); >>> } >>> >>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; >>> - >>> -void >>> -madrpc_lock(void) >>> -{ >>> - pthread_mutex_lock(&rpclock); >>> -} >>> - >>> -void >>> -madrpc_unlock(void) >>> -{ >>> - pthread_mutex_unlock(&rpclock); >>> -} >>> - >>> void >>> madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int >>> num_classes) >>> { >>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c >>> index 27b9d52..c601254 100644 >>> --- a/libibmad/src/sa.c >>> +++ b/libibmad/src/sa.c >>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport, ibmad_gid_t >>> srcgid, ibmad_gid_t destgid, >>> if (srcport) { >>> p = sa_rpc_call (srcport, buf, sm_id, &sa, 0); >>> } else { >>> - p = safe_sa_call(buf, sm_id, &sa, 0); >>> + p = sa_call(buf, sm_id, &sa, 0); >>> } >>> if (!p) { >>> IBWARN("sa call path_query failed"); >>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport, ibmad_gid_t >>> srcgid, ibmad_gid_t destgid, >>> mad_decode_field(p, IB_SA_PR_DLID_F, &dlid); >>> return dlid; >>> } >>> + >>> int >>> ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t >>> *sm_id, void *buf) >>> { >>> - return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf); >>> + return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf); >>> } >>> -- >>> 1.6.0.4.766.g6fc4a >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http:// >>> openib.org/mailman/listinfo/openib-general >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http:// >> openib.org/mailman/listinfo/openib-general >> >> > > > > From sashak at voltaire.com Tue Feb 17 13:18:48 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Feb 2009 23:18:48 +0200 Subject: [ofa-general] Re: [PATCH] ibsim: Add better end port simulation support In-Reply-To: <20090214203753.GE32660@comcast.net> References: <20090214203753.GE32660@comcast.net> Message-ID: <20090217211848.GP7189@sashak.voltaire.com> Hi Hal, On 15:37 Sat 14 Feb , hnrose at comcast.net wrote: > > Add SIM_PORT environment variable to allow for end port selection How this would handle case when SIM_PORT=N, but program tries to work via another port (for example: SIM_PORT=2 and ibnetdiscover -P 1)? IOW should port number selection be initiated natively by program rather than by using environment variables? > Signed-off-by: Hal Rosenstock > --- > ibsim/ibsim.c | 6 +- > include/ibsim.h | 2 + > umad2sim/sim_client.c | 49 +++++++++- > umad2sim/sim_client.h | 4 +- > umad2sim/umad2sim.c | 254 ++++++++++++++++++++++++++----------------------- > 5 files changed, 189 insertions(+), 126 deletions(-) > > diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c > index f48e1f0..6a35fdc 100644 > --- a/ibsim/ibsim.c > +++ b/ibsim/ibsim.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -187,7 +188,8 @@ static int sm_exists(Node * node) > return 0; > } > > -static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from) > +static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, > + union name_t *from) > { > union name_t name; > size_t size; > @@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f > ctl->type = SIM_CTL_ERROR; > return -1; > } > - cl->port = node_get_port(node, 0); > + cl->port = node_get_port(node, scl->portnum); > VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64, > i, node->nodeid, cl->port->portguid); > } else { > diff --git a/include/ibsim.h b/include/ibsim.h > index 15fc37c..66ba6f9 100644 > --- a/include/ibsim.h > +++ b/include/ibsim.h > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -100,6 +101,7 @@ struct sim_client_info { > uint32_t qp; > uint32_t issm; /* accept request for qp 0 & 1 */ > char nodeid[32]; > + uint32_t portnum; > }; > > union name_t { > diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c > index 06bb7a8..1c35109 100644 > --- a/umad2sim/sim_client.c > +++ b/umad2sim/sim_client.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid) > info.id = id; > info.issm = 0; > info.qp = qp; > + info.portnum = sc->portnum; > > if (nodeid) > strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1); > @@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc) > return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0); > } > > -static int sim_init(struct sim_client *sc, char *nodeid) > +static int sim_init(struct sim_client *sc, char *nodeid, int portnum) > { > union name_t name; > socklen_t size; > @@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid) > DEBUG("init %d: opened ctl fd %d as \'%s\'", > pid, ctlfd, get_name(&name)); > > + sc->portnum = portnum; > port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT; > size = make_name(&name, connect_host, port, "%s:ctl", socket_basename); > > @@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm) > int sim_client_init(struct sim_client *sc) > { > char *nodeid; > + char *portno; > + int i, j = 0, portnum = 0, startport = 1, endport; > + uint8_t numports, nodetype; > + uint8_t *portinfo; > > nodeid = getenv("SIM_HOST"); > - if (sim_init(sc, nodeid) < 0) > + portno = getenv("SIM_PORT"); > + if (portno) > + portnum = atoi(portno); > + > + if (sim_init(sc, nodeid, portnum) < 0) > return -1; > if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, > sizeof(sc->vendor)) < 0) > @@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc) > if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo, > sizeof(sc->nodeinfo)) < 0) > goto _exit; > + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); > + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); > + if (nodetype == 2) { // switch > + startport = 0; > + endport = 0; > + } else { > + if (portnum == 0) { > + IBWARN("portnum 0 is not valid end port on non switch node"); > + goto _exit; > + } This makes exporting SIM_PORT environment variable to be mandatory, which doesn't look like a good idea for me (personally I will need to rewrite some amount of my scripts). I think that SIM_HOST should be optional and the default behavior should be preserved. > + endport = numports; > + } > + if (portnum > endport) { > + IBWARN("portnum %d is not a valid end port number (%d)", > + portnum, endport); > + goto _exit; > + } > > - sc->portinfo[0] = 0; // portno requested > - if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo, > - sizeof(sc->portinfo)) < 0) > + sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1)); // portinfo size x number of ports starting at 0 > + if (!sc->portinfo) > goto _exit; > + > + // loop through end ports > + for (i = startport; i <= endport ; i++, j++) { > + portinfo = sc->portinfo + 64 * j; You don't need 'j' - just move portinfo pointer. > + *portinfo = i + 1; // portno requested > + if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0) > + goto _exit; > + } > + > + // although pkeys also per port, current config same on all end ports Which is not correct really. Sasha > if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0) > goto _exit; > if (getenv("SIM_SET_ISSM")) > @@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc) > void sim_client_exit(struct sim_client *sc) > { > sim_disconnect(sc); > + if (sc->portinfo) > + free(sc->portinfo); > sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1; > } > diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h > index 80ed442..0faca80 100644 > --- a/umad2sim/sim_client.h > +++ b/umad2sim/sim_client.h > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -41,8 +42,9 @@ struct sim_client { > int clientid; > int fd_pktin, fd_pktout, fd_ctl; > struct sim_vendor vendor; > + int portnum; > uint8_t nodeinfo[64]; > - uint8_t portinfo[64]; > + uint8_t *portinfo; > uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)]; > }; > > diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c > index 8d83a24..6e3c269 100644 > --- a/umad2sim/umad2sim.c > +++ b/umad2sim/umad2sim.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) > struct sim_client *sc = &dev->sim_client; > char *str; > uint8_t *portinfo; > - int i; > + char *ports_path_end; > + int i, j; > + int startport = 1, endport; > + uint8_t numports, nodetype; > > /* /sys/class/infiniband_mad/abi_version */ > snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir); > @@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) > strncat(path, "/ports", sizeof(path) - 1); > make_path(path); > > - portinfo = sc->portinfo; > - > - /* /sys/class/infiniband/mthca0/ports/1/ */ > - val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); > - snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); > - make_path(path); > - > - /* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */ > - val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); > - file_printf(path, SYS_PORT_LMC, "%d", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/sm_lid */ > - val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); > - file_printf(path, SYS_PORT_SMLID, "0x%x", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/sm_sl */ > - val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); > - file_printf(path, SYS_PORT_SMSL, "%d", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/lid */ > - val = mad_get_field(portinfo, 0, IB_PORT_LID_F); > - file_printf(path, SYS_PORT_LID, "0x%x", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/state */ > - val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); > - if (val == 0) > - str = "NOP"; > - else if (val == 1) > - str = "DOWN"; > - else if (val == 2) > - str = "INIT"; > - else if (val == 3) > - str = "ARMED"; > - else if (val == 4) > - str = "ACTIVE"; > - else if (val == 5) > - str = "ACTIVE_DEFER"; > - else > - str = ""; > - file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); > - > - /* /sys/class/infiniband/mthca0/ports/1/phys_state */ > - val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); > - if (val == 1) > - str = "Sleep"; > - else if (val == 2) > - str = "Polling"; > - else if (val == 3) > - str = "Disabled"; > - else if (val == 4) > - str = "PortConfigurationTraining"; > - else if (val == 5) > - str = "LinkUp"; > - else if (val == 6) > - str = "LinkErrorRecovery"; > - else if (val == 7) > - str = "Phy Test"; > - else > - str = ""; > - file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); > - > - /* /sys/class/infiniband/mthca0/ports/1/rate */ > - val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); > - speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); > - if (val == 1) > - val = 1; > - else if (val == 2) > - val = 4; > - else if (val == 4) > - val = 8; > - else if (val == 8) > - val = 12; > - else > - val = 0; > - if (speed == 2) > - str = " DDR"; > - else if (speed == 4) > - str = " QDR"; > - else > - str = ""; > - file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", > - (val * speed * 25) / 10, > - (val * speed * 25) % 10 ? ".5" : "", val, str); > - > - /* /sys/class/infiniband/mthca0/ports/1/cap_mask */ > - val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); > - file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/gids/0 */ > - str = path + strlen(path); > - strncat(path, "/gids", sizeof(path) - 1); > - make_path(path); > - *str = '\0'; > - gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); > - guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + > - mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); > - file_printf(path, SYS_PORT_GID, > - "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", > - (uint16_t) ((gid >> 48) & 0xffff), > - (uint16_t) ((gid >> 32) & 0xffff), > - (uint16_t) ((gid >> 16) & 0xffff), > - (uint16_t) ((gid >> 0) & 0xffff), > - (uint16_t) ((guid >> 48) & 0xffff), > - (uint16_t) ((guid >> 32) & 0xffff), > - (uint16_t) ((guid >> 16) & 0xffff), > - (uint16_t) ((guid >> 0) & 0xffff)); > + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); > + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); > + if (nodetype == 2) { // switch > + startport = 0; > + endport = 0; > + } else > + endport = numports; > + > + ports_path_end = path + strlen(path); > + > + // loop through end ports > + for (j = startport; j <= endport; j++) { > + > + portinfo = sc->portinfo + 64 * j; > + > + /* /sys/class/infiniband/mthca0/ports// */ > + val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); > + snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); > + make_path(path); > + > + /* /sys/class/infiniband/mthca0/ports//lid_mask_count */ > + val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); > + file_printf(path, SYS_PORT_LMC, "%d", val); > + > + /* /sys/class/infiniband/mthca0/ports//sm_lid */ > + val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); > + file_printf(path, SYS_PORT_SMLID, "0x%x", val); > + > + /* /sys/class/infiniband/mthca0/ports//sm_sl */ > + val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); > + file_printf(path, SYS_PORT_SMSL, "%d", val); > + > + /* /sys/class/infiniband/mthca0/ports//lid */ > + val = mad_get_field(portinfo, 0, IB_PORT_LID_F); > + file_printf(path, SYS_PORT_LID, "0x%x", val); > + > + /* /sys/class/infiniband/mthca0/ports//state */ > + val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); > + if (val == 0) > + str = "NOP"; > + else if (val == 1) > + str = "DOWN"; > + else if (val == 2) > + str = "INIT"; > + else if (val == 3) > + str = "ARMED"; > + else if (val == 4) > + str = "ACTIVE"; > + else if (val == 5) > + str = "ACTIVE_DEFER"; > + else > + str = ""; > + file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); > + > + /* /sys/class/infiniband/mthca0/ports//phys_state */ > + val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); > + if (val == 1) > + str = "Sleep"; > + else if (val == 2) > + str = "Polling"; > + else if (val == 3) > + str = "Disabled"; > + else if (val == 4) > + str = "PortConfigurationTraining"; > + else if (val == 5) > + str = "LinkUp"; > + else if (val == 6) > + str = "LinkErrorRecovery"; > + else if (val == 7) > + str = "Phy Test"; > + else > + str = ""; > + file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); > + > + /* /sys/class/infiniband/mthca0/ports//rate */ > + val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); > + speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); > + if (val == 1) > + val = 1; > + else if (val == 2) > + val = 4; > + else if (val == 4) > + val = 8; > + else if (val == 8) > + val = 12; > + else > + val = 0; > + if (speed == 2) > + str = " DDR"; > + else if (speed == 4) > + str = " QDR"; > + else > + str = ""; > + file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", > + (val * speed * 25) / 10, > + (val * speed * 25) % 10 ? ".5" : "", val, str); > + > + /* /sys/class/infiniband/mthca0/ports//cap_mask */ > + val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); > + file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); > + > + /* /sys/class/infiniband/mthca0/ports//gids/0 */ > + str = path + strlen(path); > + strncat(path, "/gids", sizeof(path) - 1); > + make_path(path); > + *str = '\0'; > + gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); > + guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j; > + file_printf(path, SYS_PORT_GID, > + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", > + (uint16_t) ((gid >> 48) & 0xffff), > + (uint16_t) ((gid >> 32) & 0xffff), > + (uint16_t) ((gid >> 16) & 0xffff), > + (uint16_t) ((gid >> 0) & 0xffff), > + (uint16_t) ((guid >> 48) & 0xffff), > + (uint16_t) ((guid >> 32) & 0xffff), > + (uint16_t) ((guid >> 16) & 0xffff), > + (uint16_t) ((guid >> 0) & 0xffff)); > + > + /* /sys/class/infiniband/mthca0/ports//pkeys/0 */ > + str = path + strlen(path); > + strncat(path, "/pkeys", sizeof(path) - 1); > + make_path(path); > + for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { > + char name[8]; > + snprintf(name, sizeof(name), "%u", i); > + file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); > + } > + *str = '\0'; > > - /* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */ > - str = path + strlen(path); > - strncat(path, "/pkeys", sizeof(path) - 1); > - make_path(path); > - for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { > - char name[8]; > - snprintf(name, sizeof(name), "%u", i); > - file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); > + *ports_path_end = '\0'; > } > - *str = '\0'; > > /* /sys/class/infiniband_mad/umad0/ */ > snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, > @@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) > if (sim_client_init(&dev->sim_client) < 0) > goto _error; > > - dev->port = mad_get_field(&dev->sim_client.portinfo, 0, > - IB_PORT_LOCAL_PORT_F); > + dev->port = dev->sim_client.portnum; > for (i = 0; i < arrsize(dev->agents); i++) > dev->agents[i].id = (uint32_t)(-1); > for (i = 0; i < arrsize(dev->agent_idx); i++) > -- > 1.5.6.4 > From sashak at voltaire.com Tue Feb 17 13:27:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Feb 2009 23:27:42 +0200 Subject: [ofa-general] Re: [PATCH] ibsim/sim_client.c: In sim_client_init, return -1 on error In-Reply-To: <20090214203703.GD32660@comcast.net> References: <20090214203703.GD32660@comcast.net> Message-ID: <20090217212729.GQ7189@sashak.voltaire.com> On 15:37 Sat 14 Feb , hnrose at comcast.net wrote: > > Signed-off-by: Hal Rosenstock Those three patches are applied. Thanks. Sasha From hal.rosenstock at gmail.com Tue Feb 17 13:28:40 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 17 Feb 2009 16:28:40 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] ibsim: Add better end port simulation support In-Reply-To: <20090217211848.GP7189@sashak.voltaire.com> References: <20090214203753.GE32660@comcast.net> <20090217211848.GP7189@sashak.voltaire.com> Message-ID: Sasha, On Tue, Feb 17, 2009 at 4:18 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 15:37 Sat 14 Feb , hnrose at comcast.net wrote: >> >> Add SIM_PORT environment variable to allow for end port selection > > How this would handle case when SIM_PORT=N, but program tries to work > via another port (for example: SIM_PORT=2 and ibnetdiscover -P 1)? That's a configuration error. SIM_PORT needs to be set to same port as program intends to use. > IOW should port number selection be initiated natively by program rather > than by using environment variables? That would've been nice but AFAIT the simulation layer needs the port number earlier than the program can supply it. Maybe that could be changed but I didn't dig into that. >> Signed-off-by: Hal Rosenstock >> --- >> ibsim/ibsim.c | 6 +- >> include/ibsim.h | 2 + >> umad2sim/sim_client.c | 49 +++++++++- >> umad2sim/sim_client.h | 4 +- >> umad2sim/umad2sim.c | 254 ++++++++++++++++++++++++++----------------------- >> 5 files changed, 189 insertions(+), 126 deletions(-) >> >> diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c >> index f48e1f0..6a35fdc 100644 >> --- a/ibsim/ibsim.c >> +++ b/ibsim/ibsim.c >> @@ -1,5 +1,6 @@ >> /* >> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >> * >> * This file is part of ibsim. >> * >> @@ -187,7 +188,8 @@ static int sm_exists(Node * node) >> return 0; >> } >> >> -static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from) >> +static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, >> + union name_t *from) >> { >> union name_t name; >> size_t size; >> @@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f >> ctl->type = SIM_CTL_ERROR; >> return -1; >> } >> - cl->port = node_get_port(node, 0); >> + cl->port = node_get_port(node, scl->portnum); >> VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64, >> i, node->nodeid, cl->port->portguid); >> } else { >> diff --git a/include/ibsim.h b/include/ibsim.h >> index 15fc37c..66ba6f9 100644 >> --- a/include/ibsim.h >> +++ b/include/ibsim.h >> @@ -1,5 +1,6 @@ >> /* >> * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >> * >> * This file is part of ibsim. >> * >> @@ -100,6 +101,7 @@ struct sim_client_info { >> uint32_t qp; >> uint32_t issm; /* accept request for qp 0 & 1 */ >> char nodeid[32]; >> + uint32_t portnum; >> }; >> >> union name_t { >> diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c >> index 06bb7a8..1c35109 100644 >> --- a/umad2sim/sim_client.c >> +++ b/umad2sim/sim_client.c >> @@ -1,5 +1,6 @@ >> /* >> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >> * >> * This file is part of ibsim. >> * >> @@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid) >> info.id = id; >> info.issm = 0; >> info.qp = qp; >> + info.portnum = sc->portnum; >> >> if (nodeid) >> strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1); >> @@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc) >> return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0); >> } >> >> -static int sim_init(struct sim_client *sc, char *nodeid) >> +static int sim_init(struct sim_client *sc, char *nodeid, int portnum) >> { >> union name_t name; >> socklen_t size; >> @@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid) >> DEBUG("init %d: opened ctl fd %d as \'%s\'", >> pid, ctlfd, get_name(&name)); >> >> + sc->portnum = portnum; >> port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT; >> size = make_name(&name, connect_host, port, "%s:ctl", socket_basename); >> >> @@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm) >> int sim_client_init(struct sim_client *sc) >> { >> char *nodeid; >> + char *portno; >> + int i, j = 0, portnum = 0, startport = 1, endport; >> + uint8_t numports, nodetype; >> + uint8_t *portinfo; >> >> nodeid = getenv("SIM_HOST"); >> - if (sim_init(sc, nodeid) < 0) >> + portno = getenv("SIM_PORT"); >> + if (portno) >> + portnum = atoi(portno); >> + >> + if (sim_init(sc, nodeid, portnum) < 0) >> return -1; >> if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, >> sizeof(sc->vendor)) < 0) >> @@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc) >> if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo, >> sizeof(sc->nodeinfo)) < 0) >> goto _exit; >> + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); >> + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); >> + if (nodetype == 2) { // switch >> + startport = 0; >> + endport = 0; >> + } else { >> + if (portnum == 0) { >> + IBWARN("portnum 0 is not valid end port on non switch node"); >> + goto _exit; >> + } > > This makes exporting SIM_PORT environment variable to be mandatory, > which doesn't look like a good idea for me (personally I will need to > rewrite some amount of my scripts). > > I think that SIM_HOST should be optional and the default behavior > should be preserved. > >> + endport = numports; >> + } >> + if (portnum > endport) { >> + IBWARN("portnum %d is not a valid end port number (%d)", >> + portnum, endport); >> + goto _exit; >> + } >> >> - sc->portinfo[0] = 0; // portno requested >> - if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo, >> - sizeof(sc->portinfo)) < 0) >> + sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1)); // portinfo size x number of ports starting at 0 >> + if (!sc->portinfo) >> goto _exit; >> + >> + // loop through end ports >> + for (i = startport; i <= endport ; i++, j++) { >> + portinfo = sc->portinfo + 64 * j; > > You don't need 'j' - just move portinfo pointer. OK. >> + *portinfo = i + 1; // portno requested >> + if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0) >> + goto _exit; >> + } >> + >> + // although pkeys also per port, current config same on all end ports > > Which is not correct really. What are you referring to ? Is there some config for end port pkeys in the simulator ? -- Hal > Sasha > >> if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0) >> goto _exit; >> if (getenv("SIM_SET_ISSM")) >> @@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc) >> void sim_client_exit(struct sim_client *sc) >> { >> sim_disconnect(sc); >> + if (sc->portinfo) >> + free(sc->portinfo); >> sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1; >> } >> diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h >> index 80ed442..0faca80 100644 >> --- a/umad2sim/sim_client.h >> +++ b/umad2sim/sim_client.h >> @@ -1,5 +1,6 @@ >> /* >> * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >> * >> * This file is part of ibsim. >> * >> @@ -41,8 +42,9 @@ struct sim_client { >> int clientid; >> int fd_pktin, fd_pktout, fd_ctl; >> struct sim_vendor vendor; >> + int portnum; >> uint8_t nodeinfo[64]; >> - uint8_t portinfo[64]; >> + uint8_t *portinfo; >> uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)]; >> }; >> >> diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c >> index 8d83a24..6e3c269 100644 >> --- a/umad2sim/umad2sim.c >> +++ b/umad2sim/umad2sim.c >> @@ -1,5 +1,6 @@ >> /* >> * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >> * >> * This file is part of ibsim. >> * >> @@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) >> struct sim_client *sc = &dev->sim_client; >> char *str; >> uint8_t *portinfo; >> - int i; >> + char *ports_path_end; >> + int i, j; >> + int startport = 1, endport; >> + uint8_t numports, nodetype; >> >> /* /sys/class/infiniband_mad/abi_version */ >> snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir); >> @@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) >> strncat(path, "/ports", sizeof(path) - 1); >> make_path(path); >> >> - portinfo = sc->portinfo; >> - >> - /* /sys/class/infiniband/mthca0/ports/1/ */ >> - val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); >> - snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); >> - make_path(path); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */ >> - val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); >> - file_printf(path, SYS_PORT_LMC, "%d", val); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/sm_lid */ >> - val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); >> - file_printf(path, SYS_PORT_SMLID, "0x%x", val); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/sm_sl */ >> - val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); >> - file_printf(path, SYS_PORT_SMSL, "%d", val); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/lid */ >> - val = mad_get_field(portinfo, 0, IB_PORT_LID_F); >> - file_printf(path, SYS_PORT_LID, "0x%x", val); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/state */ >> - val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); >> - if (val == 0) >> - str = "NOP"; >> - else if (val == 1) >> - str = "DOWN"; >> - else if (val == 2) >> - str = "INIT"; >> - else if (val == 3) >> - str = "ARMED"; >> - else if (val == 4) >> - str = "ACTIVE"; >> - else if (val == 5) >> - str = "ACTIVE_DEFER"; >> - else >> - str = ""; >> - file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/phys_state */ >> - val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); >> - if (val == 1) >> - str = "Sleep"; >> - else if (val == 2) >> - str = "Polling"; >> - else if (val == 3) >> - str = "Disabled"; >> - else if (val == 4) >> - str = "PortConfigurationTraining"; >> - else if (val == 5) >> - str = "LinkUp"; >> - else if (val == 6) >> - str = "LinkErrorRecovery"; >> - else if (val == 7) >> - str = "Phy Test"; >> - else >> - str = ""; >> - file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/rate */ >> - val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); >> - speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); >> - if (val == 1) >> - val = 1; >> - else if (val == 2) >> - val = 4; >> - else if (val == 4) >> - val = 8; >> - else if (val == 8) >> - val = 12; >> - else >> - val = 0; >> - if (speed == 2) >> - str = " DDR"; >> - else if (speed == 4) >> - str = " QDR"; >> - else >> - str = ""; >> - file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", >> - (val * speed * 25) / 10, >> - (val * speed * 25) % 10 ? ".5" : "", val, str); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/cap_mask */ >> - val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); >> - file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); >> - >> - /* /sys/class/infiniband/mthca0/ports/1/gids/0 */ >> - str = path + strlen(path); >> - strncat(path, "/gids", sizeof(path) - 1); >> - make_path(path); >> - *str = '\0'; >> - gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); >> - guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + >> - mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); >> - file_printf(path, SYS_PORT_GID, >> - "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", >> - (uint16_t) ((gid >> 48) & 0xffff), >> - (uint16_t) ((gid >> 32) & 0xffff), >> - (uint16_t) ((gid >> 16) & 0xffff), >> - (uint16_t) ((gid >> 0) & 0xffff), >> - (uint16_t) ((guid >> 48) & 0xffff), >> - (uint16_t) ((guid >> 32) & 0xffff), >> - (uint16_t) ((guid >> 16) & 0xffff), >> - (uint16_t) ((guid >> 0) & 0xffff)); >> + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); >> + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); >> + if (nodetype == 2) { // switch >> + startport = 0; >> + endport = 0; >> + } else >> + endport = numports; >> + >> + ports_path_end = path + strlen(path); >> + >> + // loop through end ports >> + for (j = startport; j <= endport; j++) { >> + >> + portinfo = sc->portinfo + 64 * j; >> + >> + /* /sys/class/infiniband/mthca0/ports// */ >> + val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); >> + snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); >> + make_path(path); >> + >> + /* /sys/class/infiniband/mthca0/ports//lid_mask_count */ >> + val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); >> + file_printf(path, SYS_PORT_LMC, "%d", val); >> + >> + /* /sys/class/infiniband/mthca0/ports//sm_lid */ >> + val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); >> + file_printf(path, SYS_PORT_SMLID, "0x%x", val); >> + >> + /* /sys/class/infiniband/mthca0/ports//sm_sl */ >> + val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); >> + file_printf(path, SYS_PORT_SMSL, "%d", val); >> + >> + /* /sys/class/infiniband/mthca0/ports//lid */ >> + val = mad_get_field(portinfo, 0, IB_PORT_LID_F); >> + file_printf(path, SYS_PORT_LID, "0x%x", val); >> + >> + /* /sys/class/infiniband/mthca0/ports//state */ >> + val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); >> + if (val == 0) >> + str = "NOP"; >> + else if (val == 1) >> + str = "DOWN"; >> + else if (val == 2) >> + str = "INIT"; >> + else if (val == 3) >> + str = "ARMED"; >> + else if (val == 4) >> + str = "ACTIVE"; >> + else if (val == 5) >> + str = "ACTIVE_DEFER"; >> + else >> + str = ""; >> + file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); >> + >> + /* /sys/class/infiniband/mthca0/ports//phys_state */ >> + val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); >> + if (val == 1) >> + str = "Sleep"; >> + else if (val == 2) >> + str = "Polling"; >> + else if (val == 3) >> + str = "Disabled"; >> + else if (val == 4) >> + str = "PortConfigurationTraining"; >> + else if (val == 5) >> + str = "LinkUp"; >> + else if (val == 6) >> + str = "LinkErrorRecovery"; >> + else if (val == 7) >> + str = "Phy Test"; >> + else >> + str = ""; >> + file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); >> + >> + /* /sys/class/infiniband/mthca0/ports//rate */ >> + val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); >> + speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); >> + if (val == 1) >> + val = 1; >> + else if (val == 2) >> + val = 4; >> + else if (val == 4) >> + val = 8; >> + else if (val == 8) >> + val = 12; >> + else >> + val = 0; >> + if (speed == 2) >> + str = " DDR"; >> + else if (speed == 4) >> + str = " QDR"; >> + else >> + str = ""; >> + file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", >> + (val * speed * 25) / 10, >> + (val * speed * 25) % 10 ? ".5" : "", val, str); >> + >> + /* /sys/class/infiniband/mthca0/ports//cap_mask */ >> + val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); >> + file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); >> + >> + /* /sys/class/infiniband/mthca0/ports//gids/0 */ >> + str = path + strlen(path); >> + strncat(path, "/gids", sizeof(path) - 1); >> + make_path(path); >> + *str = '\0'; >> + gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); >> + guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j; >> + file_printf(path, SYS_PORT_GID, >> + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", >> + (uint16_t) ((gid >> 48) & 0xffff), >> + (uint16_t) ((gid >> 32) & 0xffff), >> + (uint16_t) ((gid >> 16) & 0xffff), >> + (uint16_t) ((gid >> 0) & 0xffff), >> + (uint16_t) ((guid >> 48) & 0xffff), >> + (uint16_t) ((guid >> 32) & 0xffff), >> + (uint16_t) ((guid >> 16) & 0xffff), >> + (uint16_t) ((guid >> 0) & 0xffff)); >> + >> + /* /sys/class/infiniband/mthca0/ports//pkeys/0 */ >> + str = path + strlen(path); >> + strncat(path, "/pkeys", sizeof(path) - 1); >> + make_path(path); >> + for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { >> + char name[8]; >> + snprintf(name, sizeof(name), "%u", i); >> + file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); >> + } >> + *str = '\0'; >> >> - /* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */ >> - str = path + strlen(path); >> - strncat(path, "/pkeys", sizeof(path) - 1); >> - make_path(path); >> - for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { >> - char name[8]; >> - snprintf(name, sizeof(name), "%u", i); >> - file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); >> + *ports_path_end = '\0'; >> } >> - *str = '\0'; >> >> /* /sys/class/infiniband_mad/umad0/ */ >> snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, >> @@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) >> if (sim_client_init(&dev->sim_client) < 0) >> goto _error; >> >> - dev->port = mad_get_field(&dev->sim_client.portinfo, 0, >> - IB_PORT_LOCAL_PORT_F); >> + dev->port = dev->sim_client.portnum; >> for (i = 0; i < arrsize(dev->agents); i++) >> dev->agents[i].id = (uint32_t)(-1); >> for (i = 0; i < arrsize(dev->agent_idx); i++) >> -- >> 1.5.6.4 >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Tue Feb 17 13:33:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Feb 2009 23:33:26 +0200 Subject: [ofa-general] Re: [PATCH] ibsim: Add better end port simulation support In-Reply-To: <20090214203753.GE32660@comcast.net> References: <20090214203753.GE32660@comcast.net> Message-ID: <20090217213326.GR7189@sashak.voltaire.com> On 15:37 Sat 14 Feb , hnrose at comcast.net wrote: > > Add SIM_PORT environment variable to allow for end port selection Also this patch looks like a mix of two independent ones - fetching all node ports and showing it in sysfs simulation and SIM_PORT. Likely more descriptive commit message would be helpful here. Sasha > > Signed-off-by: Hal Rosenstock > --- > ibsim/ibsim.c | 6 +- > include/ibsim.h | 2 + > umad2sim/sim_client.c | 49 +++++++++- > umad2sim/sim_client.h | 4 +- > umad2sim/umad2sim.c | 254 ++++++++++++++++++++++++++----------------------- > 5 files changed, 189 insertions(+), 126 deletions(-) > > diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c > index f48e1f0..6a35fdc 100644 > --- a/ibsim/ibsim.c > +++ b/ibsim/ibsim.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -187,7 +188,8 @@ static int sm_exists(Node * node) > return 0; > } > > -static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *from) > +static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, > + union name_t *from) > { > union name_t name; > size_t size; > @@ -219,7 +221,7 @@ static int sim_ctl_new_client(Client * cl, struct sim_ctl * ctl, union name_t *f > ctl->type = SIM_CTL_ERROR; > return -1; > } > - cl->port = node_get_port(node, 0); > + cl->port = node_get_port(node, scl->portnum); > VERB("Attaching client %d at node \"%s\" port 0x%" PRIx64, > i, node->nodeid, cl->port->portguid); > } else { > diff --git a/include/ibsim.h b/include/ibsim.h > index 15fc37c..66ba6f9 100644 > --- a/include/ibsim.h > +++ b/include/ibsim.h > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -100,6 +101,7 @@ struct sim_client_info { > uint32_t qp; > uint32_t issm; /* accept request for qp 0 & 1 */ > char nodeid[32]; > + uint32_t portnum; > }; > > union name_t { > diff --git a/umad2sim/sim_client.c b/umad2sim/sim_client.c > index 06bb7a8..1c35109 100644 > --- a/umad2sim/sim_client.c > +++ b/umad2sim/sim_client.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -182,6 +183,7 @@ static int sim_connect(struct sim_client *sc, int id, int qp, char *nodeid) > info.id = id; > info.issm = 0; > info.qp = qp; > + info.portnum = sc->portnum; > > if (nodeid) > strncpy(info.nodeid, nodeid, sizeof(info.nodeid) - 1); > @@ -202,7 +204,7 @@ static int sim_disconnect(struct sim_client *sc) > return sim_ctl(sc, SIM_CTL_DISCONNECT, 0, 0); > } > > -static int sim_init(struct sim_client *sc, char *nodeid) > +static int sim_init(struct sim_client *sc, char *nodeid, int portnum) > { > union name_t name; > socklen_t size; > @@ -238,6 +240,7 @@ static int sim_init(struct sim_client *sc, char *nodeid) > DEBUG("init %d: opened ctl fd %d as \'%s\'", > pid, ctlfd, get_name(&name)); > > + sc->portnum = portnum; > port = connect_port ? atoi(connect_port) : IBSIM_DEFAULT_SERVER_PORT; > size = make_name(&name, connect_host, port, "%s:ctl", socket_basename); > > @@ -286,9 +289,17 @@ int sim_client_set_sm(struct sim_client *sc, unsigned issm) > int sim_client_init(struct sim_client *sc) > { > char *nodeid; > + char *portno; > + int i, j = 0, portnum = 0, startport = 1, endport; > + uint8_t numports, nodetype; > + uint8_t *portinfo; > > nodeid = getenv("SIM_HOST"); > - if (sim_init(sc, nodeid) < 0) > + portno = getenv("SIM_PORT"); > + if (portno) > + portnum = atoi(portno); > + > + if (sim_init(sc, nodeid, portnum) < 0) > return -1; > if (sim_ctl(sc, SIM_CTL_GET_VENDOR, &sc->vendor, > sizeof(sc->vendor)) < 0) > @@ -296,11 +307,37 @@ int sim_client_init(struct sim_client *sc) > if (sim_ctl(sc, SIM_CTL_GET_NODEINFO, sc->nodeinfo, > sizeof(sc->nodeinfo)) < 0) > goto _exit; > + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); > + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); > + if (nodetype == 2) { // switch > + startport = 0; > + endport = 0; > + } else { > + if (portnum == 0) { > + IBWARN("portnum 0 is not valid end port on non switch node"); > + goto _exit; > + } > + endport = numports; > + } > + if (portnum > endport) { > + IBWARN("portnum %d is not a valid end port number (%d)", > + portnum, endport); > + goto _exit; > + } > > - sc->portinfo[0] = 0; // portno requested > - if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, sc->portinfo, > - sizeof(sc->portinfo)) < 0) > + sc->portinfo = malloc(64 * (nodetype != 2 ? numports + 1 : 1)); // portinfo size x number of ports starting at 0 > + if (!sc->portinfo) > goto _exit; > + > + // loop through end ports > + for (i = startport; i <= endport ; i++, j++) { > + portinfo = sc->portinfo + 64 * j; > + *portinfo = i + 1; // portno requested > + if (sim_ctl(sc, SIM_CTL_GET_PORTINFO, portinfo, 64) < 0) > + goto _exit; > + } > + > + // although pkeys also per port, current config same on all end ports > if (sim_ctl(sc, SIM_CTL_GET_PKEYS, sc->pkeys, sizeof(sc->pkeys)) < 0) > goto _exit; > if (getenv("SIM_SET_ISSM")) > @@ -315,5 +352,7 @@ int sim_client_init(struct sim_client *sc) > void sim_client_exit(struct sim_client *sc) > { > sim_disconnect(sc); > + if (sc->portinfo) > + free(sc->portinfo); > sc->fd_ctl = sc->fd_pktin = sc->fd_pktout = -1; > } > diff --git a/umad2sim/sim_client.h b/umad2sim/sim_client.h > index 80ed442..0faca80 100644 > --- a/umad2sim/sim_client.h > +++ b/umad2sim/sim_client.h > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -41,8 +42,9 @@ struct sim_client { > int clientid; > int fd_pktin, fd_pktout, fd_ctl; > struct sim_vendor vendor; > + int portnum; > uint8_t nodeinfo[64]; > - uint8_t portinfo[64]; > + uint8_t *portinfo; > uint16_t pkeys[SIM_CTL_MAX_DATA/sizeof(uint16_t)]; > }; > > diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c > index 8d83a24..6e3c269 100644 > --- a/umad2sim/umad2sim.c > +++ b/umad2sim/umad2sim.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -179,7 +180,10 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) > struct sim_client *sc = &dev->sim_client; > char *str; > uint8_t *portinfo; > - int i; > + char *ports_path_end; > + int i, j; > + int startport = 1, endport; > + uint8_t numports, nodetype; > > /* /sys/class/infiniband_mad/abi_version */ > snprintf(path, sizeof(path), "%s", sysfs_infiniband_mad_dir); > @@ -232,123 +236,138 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) > strncat(path, "/ports", sizeof(path) - 1); > make_path(path); > > - portinfo = sc->portinfo; > - > - /* /sys/class/infiniband/mthca0/ports/1/ */ > - val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); > - snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); > - make_path(path); > - > - /* /sys/class/infiniband/mthca0/ports/1/lid_mask_count */ > - val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); > - file_printf(path, SYS_PORT_LMC, "%d", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/sm_lid */ > - val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); > - file_printf(path, SYS_PORT_SMLID, "0x%x", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/sm_sl */ > - val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); > - file_printf(path, SYS_PORT_SMSL, "%d", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/lid */ > - val = mad_get_field(portinfo, 0, IB_PORT_LID_F); > - file_printf(path, SYS_PORT_LID, "0x%x", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/state */ > - val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); > - if (val == 0) > - str = "NOP"; > - else if (val == 1) > - str = "DOWN"; > - else if (val == 2) > - str = "INIT"; > - else if (val == 3) > - str = "ARMED"; > - else if (val == 4) > - str = "ACTIVE"; > - else if (val == 5) > - str = "ACTIVE_DEFER"; > - else > - str = ""; > - file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); > - > - /* /sys/class/infiniband/mthca0/ports/1/phys_state */ > - val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); > - if (val == 1) > - str = "Sleep"; > - else if (val == 2) > - str = "Polling"; > - else if (val == 3) > - str = "Disabled"; > - else if (val == 4) > - str = "PortConfigurationTraining"; > - else if (val == 5) > - str = "LinkUp"; > - else if (val == 6) > - str = "LinkErrorRecovery"; > - else if (val == 7) > - str = "Phy Test"; > - else > - str = ""; > - file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); > - > - /* /sys/class/infiniband/mthca0/ports/1/rate */ > - val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); > - speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); > - if (val == 1) > - val = 1; > - else if (val == 2) > - val = 4; > - else if (val == 4) > - val = 8; > - else if (val == 8) > - val = 12; > - else > - val = 0; > - if (speed == 2) > - str = " DDR"; > - else if (speed == 4) > - str = " QDR"; > - else > - str = ""; > - file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", > - (val * speed * 25) / 10, > - (val * speed * 25) % 10 ? ".5" : "", val, str); > - > - /* /sys/class/infiniband/mthca0/ports/1/cap_mask */ > - val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); > - file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); > - > - /* /sys/class/infiniband/mthca0/ports/1/gids/0 */ > - str = path + strlen(path); > - strncat(path, "/gids", sizeof(path) - 1); > - make_path(path); > - *str = '\0'; > - gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); > - guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + > - mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); > - file_printf(path, SYS_PORT_GID, > - "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", > - (uint16_t) ((gid >> 48) & 0xffff), > - (uint16_t) ((gid >> 32) & 0xffff), > - (uint16_t) ((gid >> 16) & 0xffff), > - (uint16_t) ((gid >> 0) & 0xffff), > - (uint16_t) ((guid >> 48) & 0xffff), > - (uint16_t) ((guid >> 32) & 0xffff), > - (uint16_t) ((guid >> 16) & 0xffff), > - (uint16_t) ((guid >> 0) & 0xffff)); > + numports = mad_get_field(sc->nodeinfo, 0, IB_NODE_NPORTS_F); > + nodetype = mad_get_field(sc->nodeinfo, 0, IB_NODE_TYPE_F); > + if (nodetype == 2) { // switch > + startport = 0; > + endport = 0; > + } else > + endport = numports; > + > + ports_path_end = path + strlen(path); > + > + // loop through end ports > + for (j = startport; j <= endport; j++) { > + > + portinfo = sc->portinfo + 64 * j; > + > + /* /sys/class/infiniband/mthca0/ports// */ > + val = mad_get_field(portinfo, 0, IB_PORT_LOCAL_PORT_F); > + snprintf(path + strlen(path), sizeof(path) - strlen(path), "/%u", val); > + make_path(path); > + > + /* /sys/class/infiniband/mthca0/ports//lid_mask_count */ > + val = mad_get_field(portinfo, 0, IB_PORT_LMC_F); > + file_printf(path, SYS_PORT_LMC, "%d", val); > + > + /* /sys/class/infiniband/mthca0/ports//sm_lid */ > + val = mad_get_field(portinfo, 0, IB_PORT_SMLID_F); > + file_printf(path, SYS_PORT_SMLID, "0x%x", val); > + > + /* /sys/class/infiniband/mthca0/ports//sm_sl */ > + val = mad_get_field(portinfo, 0, IB_PORT_SMSL_F); > + file_printf(path, SYS_PORT_SMSL, "%d", val); > + > + /* /sys/class/infiniband/mthca0/ports//lid */ > + val = mad_get_field(portinfo, 0, IB_PORT_LID_F); > + file_printf(path, SYS_PORT_LID, "0x%x", val); > + > + /* /sys/class/infiniband/mthca0/ports//state */ > + val = mad_get_field(portinfo, 0, IB_PORT_STATE_F); > + if (val == 0) > + str = "NOP"; > + else if (val == 1) > + str = "DOWN"; > + else if (val == 2) > + str = "INIT"; > + else if (val == 3) > + str = "ARMED"; > + else if (val == 4) > + str = "ACTIVE"; > + else if (val == 5) > + str = "ACTIVE_DEFER"; > + else > + str = ""; > + file_printf(path, SYS_PORT_STATE, "%d: %s\n", val, str); > + > + /* /sys/class/infiniband/mthca0/ports//phys_state */ > + val = mad_get_field(portinfo, 0, IB_PORT_PHYS_STATE_F); > + if (val == 1) > + str = "Sleep"; > + else if (val == 2) > + str = "Polling"; > + else if (val == 3) > + str = "Disabled"; > + else if (val == 4) > + str = "PortConfigurationTraining"; > + else if (val == 5) > + str = "LinkUp"; > + else if (val == 6) > + str = "LinkErrorRecovery"; > + else if (val == 7) > + str = "Phy Test"; > + else > + str = ""; > + file_printf(path, SYS_PORT_PHY_STATE, "%d: %s\n", val, str); > + > + /* /sys/class/infiniband/mthca0/ports//rate */ > + val = mad_get_field(portinfo, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); > + speed = mad_get_field(portinfo, 0, IB_PORT_LINK_SPEED_ACTIVE_F); > + if (val == 1) > + val = 1; > + else if (val == 2) > + val = 4; > + else if (val == 4) > + val = 8; > + else if (val == 8) > + val = 12; > + else > + val = 0; > + if (speed == 2) > + str = " DDR"; > + else if (speed == 4) > + str = " QDR"; > + else > + str = ""; > + file_printf(path, SYS_PORT_RATE, "%d%s Gb/sec (%dX%s)\n", > + (val * speed * 25) / 10, > + (val * speed * 25) % 10 ? ".5" : "", val, str); > + > + /* /sys/class/infiniband/mthca0/ports//cap_mask */ > + val = mad_get_field(portinfo, 0, IB_PORT_CAPMASK_F); > + file_printf(path, SYS_PORT_CAPMASK, "0x%08x", val); > + > + /* /sys/class/infiniband/mthca0/ports//gids/0 */ > + str = path + strlen(path); > + strncat(path, "/gids", sizeof(path) - 1); > + make_path(path); > + *str = '\0'; > + gid = mad_get_field64(portinfo, 0, IB_PORT_GID_PREFIX_F); > + guid = mad_get_field64(sc->nodeinfo, 0, IB_NODE_GUID_F) + j; > + file_printf(path, SYS_PORT_GID, > + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", > + (uint16_t) ((gid >> 48) & 0xffff), > + (uint16_t) ((gid >> 32) & 0xffff), > + (uint16_t) ((gid >> 16) & 0xffff), > + (uint16_t) ((gid >> 0) & 0xffff), > + (uint16_t) ((guid >> 48) & 0xffff), > + (uint16_t) ((guid >> 32) & 0xffff), > + (uint16_t) ((guid >> 16) & 0xffff), > + (uint16_t) ((guid >> 0) & 0xffff)); > + > + /* /sys/class/infiniband/mthca0/ports//pkeys/0 */ > + str = path + strlen(path); > + strncat(path, "/pkeys", sizeof(path) - 1); > + make_path(path); > + for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { > + char name[8]; > + snprintf(name, sizeof(name), "%u", i); > + file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); > + } > + *str = '\0'; > > - /* /sys/class/infiniband/mthca0/ports/1/pkeys/0 */ > - str = path + strlen(path); > - strncat(path, "/pkeys", sizeof(path) - 1); > - make_path(path); > - for (i = 0; i < sizeof(sc->pkeys)/sizeof(sc->pkeys[0]); i++) { > - char name[8]; > - snprintf(name, sizeof(name), "%u", i); > - file_printf(path, name, "0x%04x\n", ntohs(sc->pkeys[i])); > + *ports_path_end = '\0'; > } > - *str = '\0'; > > /* /sys/class/infiniband_mad/umad0/ */ > snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, > @@ -564,8 +583,7 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) > if (sim_client_init(&dev->sim_client) < 0) > goto _error; > > - dev->port = mad_get_field(&dev->sim_client.portinfo, 0, > - IB_PORT_LOCAL_PORT_F); > + dev->port = dev->sim_client.portnum; > for (i = 0; i < arrsize(dev->agents); i++) > dev->agents[i].id = (uint32_t)(-1); > for (i = 0; i < arrsize(dev->agent_idx); i++) > -- > 1.5.6.4 > From sashak at voltaire.com Tue Feb 17 13:55:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Feb 2009 23:55:40 +0200 Subject: [ofa-general] Re: [PATCH] ibsim: Add better end port simulation support In-Reply-To: References: <20090214203753.GE32660@comcast.net> <20090217211848.GP7189@sashak.voltaire.com> Message-ID: <20090217215533.GS7189@sashak.voltaire.com> On 16:28 Tue 17 Feb , Hal Rosenstock wrote: > Sasha, > > On Tue, Feb 17, 2009 at 4:18 PM, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 15:37 Sat 14 Feb , hnrose at comcast.net wrote: > >> > >> Add SIM_PORT environment variable to allow for end port selection > > > > How this would handle case when SIM_PORT=N, but program tries to work > > via another port (for example: SIM_PORT=2 and ibnetdiscover -P 1)? > > That's a configuration error. SIM_PORT needs to be set to same port as > program intends to use. This is different things - program doesn't have to know about simulator at all. so dependency between '-C' and SIM_PORT is not a good idea. Actually I think that SIM_PORT is not needed at all - see below. > > IOW should port number selection be initiated natively by program rather > > than by using environment variables? > > That would've been nice but AFAIT the simulation layer needs the port > number earlier than the program can supply it. This is using the current implementation only where sysfs tree is generated (simulated) only for one port. Now if you are going to fetch all PortInfo(s) anyway, then application can choose port number just by using it's regular mechanisms - no needs for any SIM_PORT variables. (Likely you will need additional sim_ctl() call which will be triggered by umad open() to set a port number on ibsim's client side). > Maybe that could be > changed but I didn't dig into that. > > >> Signed-off-by: Hal Rosenstock [snip...] > >> + > >> + // although pkeys also per port, current config same on all end ports > > > > Which is not correct really. > > What are you referring to ? Is there some config for end port pkeys in > the simulator ? Each port on ibsim side has each own pkey table (it has some default preset value and can be configured using OpenSM and maybe ibutils, so special "out-of-bound" config is not needed). And we need to display it properly for each port. Sasha From swise at opengridcomputing.com Tue Feb 17 14:00:00 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 17 Feb 2009 16:00:00 -0600 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for active connections. Message-ID: <20090217215959.16117.17150.stgit@NTAC> - wrapper calls into cxgb3 and fail them if we're in the middle of an eeh event. - correctly unwind and release endpoint and other resources when we are in an EEH event. - post DEVICE_FATAL event on all active QPs when cxgb3 notifies iw_cxgb3 of a fatal error. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 8 +-- drivers/infiniband/hw/cxgb3/cxio_hal.h | 1 drivers/infiniband/hw/cxgb3/iwch.c | 26 +++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 5 ++ drivers/infiniband/hw/cxgb3/iwch_cm.c | 90 +++++++++++++++++++++++--------- drivers/infiniband/hw/cxgb3/iwch_qp.c | 4 + 6 files changed, 101 insertions(+), 33 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index eeae5f5..99d114d 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -152,7 +152,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) sge_cmd = qpid << 8 | 3; wqe->sge_cmd = cpu_to_be64(sge_cmd); skb->priority = CPL_PRIORITY_CONTROL; - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); + return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); } int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) @@ -571,7 +571,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) (unsigned long long) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq, 1 << T3_CTRL_QP_SIZE_LOG2); skb->priority = CPL_PRIORITY_CONTROL; - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); + return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); err: kfree_skb(skb); return err; @@ -858,7 +858,7 @@ int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); wqe->irs = cpu_to_be32(attr->irs); skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); + return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); } void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb) @@ -1024,9 +1024,9 @@ void cxio_rdev_close(struct cxio_rdev *rdev_p) cxio_hal_pblpool_destroy(rdev_p); cxio_hal_rqtpool_destroy(rdev_p); list_del(&rdev_p->entry); - rdev_p->t3cdev_p->ulp = NULL; cxio_hal_destroy_ctrl_qp(rdev_p); cxio_hal_destroy_resource(rdev_p->rscp); + rdev_p->t3cdev_p->ulp = NULL; } } diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h index 9ed65b0..6cbf216 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.h +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h @@ -185,6 +185,7 @@ void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); void cxio_flush_hw_cq(struct t3_cq *cq); int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, u8 *cqe_flushed, u64 *cookie, u32 *credit); +int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb); #define MOD "iw_cxgb3: " #define PDBG(fmt, args...) pr_debug(MOD fmt, ## args) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 37a4fc2..e5d57fa 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -162,15 +162,37 @@ static void close_rnic_dev(struct t3cdev *tdev) mutex_unlock(&dev_mutex); } +static int iwch_post_qp_fatal(int id, void *p, void *data) +{ + struct ib_event event; + struct iwch_qp *qhp = p; + + event.event = IB_EVENT_DEVICE_FATAL; + event.device = qhp->ibqp.device; + event.element.qp = &qhp->ibqp; + BUG_ON(qhp->rhp != data); + BUG_ON(qhp->wq.qpid != id); + if (qhp->ibqp.event_handler) { + PDBG("%s posting DEVICE_FATAL for qpid %u\n", + __func__, qhp->wq.qpid); + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + } + return 0; +} + static void iwch_err_handler(struct t3cdev *tdev, u32 status, u32 error) { struct cxio_rdev *rdev = tdev->ulp; + struct iwch_dev *rnicp = rdev_to_iwch_dev(rdev); - if (status == OFFLOAD_STATUS_DOWN) + if (status == OFFLOAD_STATUS_DOWN) { rdev->flags = CXIO_ERROR_FATAL; + spin_lock_irq(&rnicp->lock); + idr_for_each(&rnicp->qpidr, iwch_post_qp_fatal, rnicp); + spin_unlock_irq(&rnicp->lock); + } return; - } static int __init iwch_init_module(void) diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index 3773453..8473550 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -117,6 +117,11 @@ static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) return container_of(ibdev, struct iwch_dev, ibdev); } +static inline struct iwch_dev *rdev_to_iwch_dev(struct cxio_rdev *rdev) +{ + return container_of(rdev, struct iwch_dev, rdev); +} + static inline int t3b_device(const struct iwch_dev *rhp) { return rhp->rdev.t3cdev_p->type == T3B; diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 8699947..8ef670d 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -139,6 +139,38 @@ static void stop_ep_timer(struct iwch_ep *ep) put_ep(&ep->com); } +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e) +{ + int error=0; + struct cxio_rdev *rdev; + + rdev = (struct cxio_rdev *)tdev->ulp; + if (rdev->flags) { + kfree_skb(skb); + return -EIO; + } + error = l2t_send(tdev, skb, l2e); + if (error) + kfree_skb(skb); + return error; +} + +int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb) +{ + int error=0; + struct cxio_rdev *rdev; + + rdev = (struct cxio_rdev *)tdev->ulp; + if (rdev->flags) { + kfree_skb(skb); + return -EIO; + } + error = cxgb3_ofld_send(tdev, skb); + if (error) + kfree_skb(skb); + return error; +} + static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) { struct cpl_tid_release *req; @@ -150,7 +182,7 @@ static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); skb->priority = CPL_PRIORITY_SETUP; - cxgb3_ofld_send(tdev, skb); + iwch_cxgb3_ofld_send(tdev, skb); return; } @@ -172,8 +204,7 @@ int iwch_quiesce_tid(struct iwch_ep *ep) req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); skb->priority = CPL_PRIORITY_DATA; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } int iwch_resume_tid(struct iwch_ep *ep) @@ -194,8 +225,7 @@ int iwch_resume_tid(struct iwch_ep *ep) req->val = 0; skb->priority = CPL_PRIORITY_DATA; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } static void set_emss(struct iwch_ep *ep, u16 opt) @@ -382,7 +412,7 @@ static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) PDBG("%s t3cdev %p\n", __func__, dev); req->cmd = CPL_ABORT_NO_RST; - cxgb3_ofld_send(dev, skb); + iwch_cxgb3_ofld_send(dev, skb); } static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) @@ -402,8 +432,7 @@ static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) @@ -424,8 +453,7 @@ static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); req->cmd = CPL_ABORT_SEND_RST; - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int send_connect(struct iwch_ep *ep) @@ -469,8 +497,7 @@ static int send_connect(struct iwch_ep *ep) req->opt0l = htonl(opt0l); req->params = 0; req->opt2 = htonl(opt2); - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) @@ -527,7 +554,7 @@ static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) req->sndseq = htonl(ep->snd_seq); BUG_ON(ep->mpa_skb); ep->mpa_skb = skb; - l2t_send(ep->com.tdev, skb, ep->l2t); + iwch_l2t_send(ep->com.tdev, skb, ep->l2t); start_ep_timer(ep); state_set(&ep->com, MPA_REQ_SENT); return; @@ -578,8 +605,7 @@ static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) req->sndseq = htonl(ep->snd_seq); BUG_ON(ep->mpa_skb); ep->mpa_skb = skb; - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) @@ -630,8 +656,7 @@ static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) req->sndseq = htonl(ep->snd_seq); ep->mpa_skb = skb; state_set(&ep->com, MPA_REP_SENT); - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) @@ -795,7 +820,7 @@ static int update_rx_credits(struct iwch_ep *ep, u32 credits) OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); skb->priority = CPL_PRIORITY_ACK; - cxgb3_ofld_send(ep->com.tdev, skb); + iwch_cxgb3_ofld_send(ep->com.tdev, skb); return credits; } @@ -1203,8 +1228,7 @@ static int listen_start(struct iwch_listen_ep *ep) req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); skb->priority = 1; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) @@ -1237,8 +1261,7 @@ static int listen_stop(struct iwch_listen_ep *ep) req->cpu_idx = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); skb->priority = 1; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, @@ -1286,7 +1309,7 @@ static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) rpl->opt2 = htonl(opt2); rpl->rsvd = rpl->opt2; /* workaround for HW bug */ skb->priority = CPL_PRIORITY_SETUP; - l2t_send(ep->com.tdev, skb, ep->l2t); + iwch_l2t_send(ep->com.tdev, skb, ep->l2t); return; } @@ -1315,7 +1338,7 @@ static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip, rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); rpl->opt2 = 0; rpl->rsvd = rpl->opt2; - cxgb3_ofld_send(tdev, skb); + iwch_cxgb3_ofld_send(tdev, skb); } } @@ -1613,7 +1636,7 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); rpl->cmd = CPL_ABORT_NO_RST; - cxgb3_ofld_send(ep->com.tdev, rpl_skb); + iwch_cxgb3_ofld_send(ep->com.tdev, rpl_skb); out: if (release) release_ep_resources(ep); @@ -2017,8 +2040,11 @@ int iwch_destroy_listen(struct iw_cm_id *cm_id) ep->com.rpl_done = 0; ep->com.rpl_err = 0; err = listen_stop(ep); + if (err) + goto done; wait_event(ep->com.waitq, ep->com.rpl_done); cxgb3_free_stid(ep->com.tdev, ep->stid); +done: err = ep->com.rpl_err; cm_id->rem_ref(cm_id); put_ep(&ep->com); @@ -2030,12 +2056,22 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) int ret=0; unsigned long flags; int close = 0; + int fatal = 0; + struct t3cdev *tdev; + struct cxio_rdev *rdev; spin_lock_irqsave(&ep->com.lock, flags); PDBG("%s ep %p state %s, abrupt %d\n", __func__, ep, states[ep->com.state], abrupt); + tdev = (struct t3cdev *)ep->com.tdev; + rdev = (struct cxio_rdev *)tdev->ulp; + if (rdev->flags) { + fatal = 1; + close_complete_upcall(ep); + ep->com.state = DEAD; + } switch (ep->com.state) { case MPA_REQ_WAIT: case MPA_REQ_SENT: @@ -2075,7 +2111,11 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) ret = send_abort(ep, NULL, gfp); else ret = send_halfclose(ep, gfp); + if (ret) + fatal = 1; } + if (fatal) + release_ep_resources(ep); return ret; } diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index aa72d18..9324aa1 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -751,7 +751,7 @@ int iwch_post_zb_read(struct iwch_qp *qhp) wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid)| V_FW_RIWR_LEN(flit_cnt)); skb->priority = CPL_PRIORITY_DATA; - return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); + return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); } /* @@ -783,7 +783,7 @@ int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) V_FW_RIWR_FLAGS(T3_COMPLETION_FLAG | T3_NOTIFY_FLAG)); wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid)); skb->priority = CPL_PRIORITY_DATA; - return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); + return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); } /* From sashak at voltaire.com Tue Feb 17 14:09:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 00:09:33 +0200 Subject: [ofa-general] [PATCH] opensm: pre-scan command line for config file option Message-ID: <20090217220933.GT7189@sashak.voltaire.com> Scan command line for config file option and parse cofig file if found before processing other command line options. It makes prevents potential multiple set for options listed before '-F' (command line was rescanned anyway). Signed-off-by: Sasha Khapyorsky --- opensm/opensm/main.c | 37 ++++++++++++++++++++++--------------- 1 files changed, 22 insertions(+), 15 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index a8dc9e6..a632cd7 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -522,9 +522,8 @@ int main(int argc, char *argv[]) boolean_t run_once_flag = FALSE; int32_t vendor_debug = 0; uint32_t next_option; - char *conf_template = NULL; + char *conf_template = NULL, *config_file = NULL; uint32_t val; - unsigned config_file_done = 0; const char *const short_option = "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:"; @@ -608,7 +607,26 @@ int main(int argc, char *argv[]) osm_subn_set_default_opt(&opt); if (osm_subn_parse_conf_file(OSM_DEFAULT_CONFIG_FILE, &opt) < 0) - printf("\nosm_subn_parse_conf_file failed!\n"); + printf("\nFail to parse config file \'%s\'\n", + OSM_DEFAULT_CONFIG_FILE); + + do { + next_option = getopt_long_only(argc, argv, short_option, + long_option, NULL); + switch (next_option) { + case 'F': + config_file = optarg; + printf("Config file is `%s`:\n", config_file); + break; + default: + break; + } + } while (next_option != -1); + + optind = 0; /* reset command line */ + + if (config_file && osm_subn_parse_conf_file(config_file, &opt) < 0) + printf("\nFail to parse config file \'%s\'\n", config_file); printf("Command Line Arguments:\n"); do { @@ -619,16 +637,6 @@ int main(int argc, char *argv[]) exit(0); break; case 'F': - if (config_file_done) - break; - printf("Reloading config from `%s`:\n", optarg); - if (osm_subn_parse_conf_file(optarg, &opt)) { - printf("cannot parse config file.\n"); - exit(1); - } - printf("Rescaning command line:\n"); - config_file_done = 1; - optind = 0; break; case 'c': conf_template = optarg; @@ -936,8 +944,7 @@ int main(int argc, char *argv[]) default: /* something wrong */ abort(); } - } - while (next_option != -1); + } while (next_option != -1); if (opt.log_file != NULL) printf(" Log File: %s\n", opt.log_file); -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Tue Feb 17 14:25:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 00:25:19 +0200 Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: move parse and setup functions Message-ID: <20090217222519.GU7189@sashak.voltaire.com> Move options parse and setup functions above options rec struct initialization - eliminate prototyping, typedefs, etc. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_subnet.c | 421 +++++++++++++++++++++----------------------- 1 files changed, 204 insertions(+), 217 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 69937c1..f12685e 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -73,25 +73,217 @@ static const char null_str[] = "(null)"; #define OPT_OFFSET(opt) offsetof(osm_subn_opt_t, opt) -typedef void (setup_fn_t)(osm_subn_t *p_subn, void *p_val); -typedef void (parse_fn_t)(osm_subn_t *p_subn, char *p_key, char *p_val_str, - void *p_val, setup_fn_t *f); - typedef struct opt_rec { const char *name; unsigned long opt_offset; - parse_fn_t *parse_fn; - setup_fn_t *setup_fn; + void (*parse_fn)(osm_subn_t *p_subn, char *p_key, char *p_val_str, + void *p_val, void (*)(osm_subn_t *, void *)); + void (*setup_fn)(osm_subn_t *p_subn, void *p_val); int can_update; } opt_rec_t; -static parse_fn_t opts_parse_uint8, opts_parse_uint16, opts_parse_net16, - opts_parse_uint32, opts_parse_int32, opts_parse_net64, - opts_parse_charp, opts_parse_boolean; +static void log_report(const char *fmt, ...) +{ + char buf[128]; + va_list args; + va_start(args, fmt); + vsnprintf(buf, sizeof(buf), fmt, args); + va_end(args); + printf("%s", buf); + cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0); +} + +static void log_config_value(char *name, const char *fmt, ...) +{ + char buf[128]; + va_list args; + unsigned n; + va_start(args, fmt); + n = snprintf(buf, sizeof(buf), " Loading Cached Option:%s = ", name); + if (n > sizeof(buf)) + n = sizeof(buf); + n += vsnprintf(buf + n, sizeof(buf) - n, fmt, args); + if (n > sizeof(buf)) + n = sizeof(buf); + snprintf(buf + n, sizeof(buf) - n, "\n"); + va_end(args); + printf("%s", buf); + cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0); +} + +static void opts_setup_log_flags(osm_subn_t *p_subn, void *p_val) +{ + p_subn->p_osm->log.level = *((uint8_t *) p_val); +} + +static void opts_setup_force_log_flush(osm_subn_t *p_subn, void *p_val) +{ + p_subn->p_osm->log.flush = *((boolean_t *) p_val); +} + +static void opts_setup_accum_log_file(osm_subn_t *p_subn, void *p_val) +{ + p_subn->p_osm->log.accum_log_file = *((boolean_t *) p_val); +} + +static void opts_setup_log_max_size(osm_subn_t *p_subn, void *p_val) +{ + uint32_t log_max_size = *((uint32_t *) p_val); + + p_subn->p_osm->log.max_size = log_max_size << 20; /* convert from MB to bytes */ +} + +static void opts_setup_sminfo_polling_timeout(osm_subn_t *p_subn, void *p_val) +{ + osm_sm_t *p_sm = &p_subn->p_osm->sm; + uint32_t sminfo_polling_timeout = *((uint32_t *) p_val); + + cl_timer_stop(&p_sm->polling_timer); + cl_timer_start(&p_sm->polling_timer, sminfo_polling_timeout); +} + +static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val) +{ + osm_sm_t *p_sm = &p_subn->p_osm->sm; + uint8_t sm_priority = *((uint8_t *) p_val); + + osm_set_sm_priority(p_sm, sm_priority); +} + +static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + uint64_t *p_val = p_v; + uint64_t val = strtoull(p_val_str, NULL, 0); + + if (cl_hton64(val) != *p_val) { + log_config_value(p_key, "0x%016" PRIx64, val); + if (pfn) + pfn(p_subn, &val); + *p_val = cl_ntoh64(val); + } +} + +static void opts_parse_uint32(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + uint32_t *p_val = p_v; + uint32_t val = strtoul(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; + } +} + +static void opts_parse_int32(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + int32_t *p_val = p_v; + int32_t val = strtol(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%d", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; + } +} + +static void opts_parse_uint16(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + uint16_t *p_val = p_v; + uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; + } +} + +static void opts_parse_net16(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + uint16_t *p_val = p_v; + uint16_t val = strtoul(p_val_str, NULL, 0); + + CL_ASSERT(val < 0x10000); + if (cl_hton16(val) != *p_val) { + log_config_value(p_key, "0x%04x", val); + if (pfn) + pfn(p_subn, &val); + *p_val = cl_hton16(val); + } +} + +static void opts_parse_uint8(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + uint8_t *p_val = p_v; + uint8_t val = strtoul(p_val_str, NULL, 0); + + CL_ASSERT(val < 0x100); + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; + } +} + +static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + boolean_t *p_val = p_v; + boolean_t val; -static setup_fn_t opts_setup_log_flags, opts_setup_log_max_size, - opts_setup_force_log_flush, opts_setup_accum_log_file, - opts_setup_sminfo_polling_timeout, opts_setup_sm_priority; + if (!p_val_str) + return; + + if (strcmp("TRUE", p_val_str)) + val = FALSE; + else + val = TRUE; + + if (val != *p_val) { + log_config_value(p_key, "%s", p_val_str); + if (pfn) + pfn(p_subn, &val); + *p_val = val; + } +} + +static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key, + IN char *p_val_str, IN void *p_v, + void (*pfn)(osm_subn_t *, void *)) +{ + char **p_val = p_v; + const char *current_str = *p_val ? *p_val : null_str ; + + if (p_val_str && strcmp(p_val_str, current_str)) { + char *new; + log_config_value(p_key, "%s", p_val_str); + /* special case the "(null)" string */ + new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL; + if (pfn) + pfn(p_subn, new); + if (*p_val) + free(*p_val); + *p_val = new; + } +} static const opt_rec_t opt_tbl[] = { { "guid", OPT_OFFSET(guid), opts_parse_net64, NULL, 0 }, @@ -196,45 +388,6 @@ static const opt_rec_t opt_tbl[] = { {0} }; -static void opts_setup_log_flags(osm_subn_t *p_subn, void *p_val) -{ - p_subn->p_osm->log.level = *((uint8_t *) p_val); -} - -static void opts_setup_force_log_flush(osm_subn_t *p_subn, void *p_val) -{ - p_subn->p_osm->log.flush = *((boolean_t *) p_val); -} - -static void opts_setup_accum_log_file(osm_subn_t *p_subn, void *p_val) -{ - p_subn->p_osm->log.accum_log_file = *((boolean_t *) p_val); -} - -static void opts_setup_log_max_size(osm_subn_t *p_subn, void *p_val) -{ - uint32_t log_max_size = *((uint32_t *) p_val); - - p_subn->p_osm->log.max_size = log_max_size << 20; /* convert from MB to bytes */ -} - -static void opts_setup_sminfo_polling_timeout(osm_subn_t *p_subn, void *p_val) -{ - osm_sm_t *p_sm = &p_subn->p_osm->sm; - uint32_t sminfo_polling_timeout = *((uint32_t *) p_val); - - cl_timer_stop(&p_sm->polling_timer); - cl_timer_start(&p_sm->polling_timer, sminfo_polling_timeout); -} - -static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val) -{ - osm_sm_t *p_sm = &p_subn->p_osm->sm; - uint8_t sm_priority = *((uint8_t *) p_val); - - osm_set_sm_priority(p_sm, sm_priority); -} - /********************************************************************** **********************************************************************/ void osm_subn_construct(IN osm_subn_t * const p_subn) @@ -596,172 +749,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) /********************************************************************** **********************************************************************/ -static void log_report(const char *fmt, ...) -{ - char buf[128]; - va_list args; - va_start(args, fmt); - vsnprintf(buf, sizeof(buf), fmt, args); - va_end(args); - printf("%s", buf); - cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0); -} - -static void log_config_value(char *name, const char *fmt, ...) -{ - char buf[128]; - va_list args; - unsigned n; - va_start(args, fmt); - n = snprintf(buf, sizeof(buf), " Loading Cached Option:%s = ", name); - if (n > sizeof(buf)) - n = sizeof(buf); - n += vsnprintf(buf + n, sizeof(buf) - n, fmt, args); - if (n > sizeof(buf)) - n = sizeof(buf); - snprintf(buf + n, sizeof(buf) - n, "\n"); - va_end(args); - printf("%s", buf); - cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0); -} - -static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - uint64_t *p_val = p_v; - uint64_t val = strtoull(p_val_str, NULL, 0); - - if (cl_hton64(val) != *p_val) { - log_config_value(p_key, "0x%016" PRIx64, val); - if (pfn) - pfn(p_subn, &val); - *p_val = cl_ntoh64(val); - } -} - -static void opts_parse_uint32(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - uint32_t *p_val = p_v; - uint32_t val = strtoul(p_val_str, NULL, 0); - - if (val != *p_val) { - log_config_value(p_key, "%u", val); - if (pfn) - pfn(p_subn, &val); - *p_val = val; - } -} - -static void opts_parse_int32(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - int32_t *p_val = p_v; - int32_t val = strtol(p_val_str, NULL, 0); - - if (val != *p_val) { - log_config_value(p_key, "%d", val); - if (pfn) - pfn(p_subn, &val); - *p_val = val; - } -} - -static void opts_parse_uint16(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - uint16_t *p_val = p_v; - uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); - - if (val != *p_val) { - log_config_value(p_key, "%u", val); - if (pfn) - pfn(p_subn, &val); - *p_val = val; - } -} - -static void opts_parse_net16(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - uint16_t *p_val = p_v; - uint16_t val = strtoul(p_val_str, NULL, 0); - - CL_ASSERT(val < 0x10000); - if (cl_hton16(val) != *p_val) { - log_config_value(p_key, "0x%04x", val); - if (pfn) - pfn(p_subn, &val); - *p_val = cl_hton16(val); - } -} - -static void opts_parse_uint8(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - uint8_t *p_val = p_v; - uint8_t val = strtoul(p_val_str, NULL, 0); - - CL_ASSERT(val < 0x100); - if (val != *p_val) { - log_config_value(p_key, "%u", val); - if (pfn) - pfn(p_subn, &val); - *p_val = val; - } -} - -static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - boolean_t *p_val = p_v; - boolean_t val; - - if (!p_val_str) - return; - - if (strcmp("TRUE", p_val_str)) - val = FALSE; - else - val = TRUE; - - if (val != *p_val) { - log_config_value(p_key, "%s", p_val_str); - if (pfn) - pfn(p_subn, &val); - *p_val = val; - } -} - -static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, - IN setup_fn_t pfn) -{ - char **p_val = p_v; - const char *current_str = *p_val ? *p_val : null_str ; - - if (p_val_str && strcmp(p_val_str, current_str)) { - char *new; - log_config_value(p_key, "%s", p_val_str); - /* special case the "(null)" string */ - new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL; - if (pfn) - pfn(p_subn, new); - if (*p_val) - free(*p_val); - *p_val = new; - } -} - -/********************************************************************** - **********************************************************************/ static char *clean_val(char *val) { char *p = val; -- 1.6.1.2.319.gbd9e From rdreier at cisco.com Tue Feb 17 14:27:28 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Feb 2009 14:27:28 -0800 Subject: [ofa-general] Re: [PATCH] RDMA/nes: Inform hardware that asynchronous event has been handled In-Reply-To: <20090213212431.GA7092@ctung-MOBL> (Chien Tung's message of "Fri, 13 Feb 2009 15:24:31 -0600") References: <20090213212431.GA7092@ctung-MOBL> Message-ID: thanks,applied From sean.hefty at intel.com Tue Feb 17 14:27:35 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:27:35 -0800 Subject: [ofa-general] [PATCH 0/8] ib-mgmt: add support for WinOF Message-ID: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Enable IB management diagnostic tools to support both OFED and WinOF releases. Only 8 of the diags have been ported to both platforms. These changes allow the management.git tree to drop into the WinOF build environment. The following applies only to WinOF. The WinOF environment adds the following: src/ibdiag_windows.c - windows specific source file built as part of all diags (includes getopt.c) include/windows/ - directory for windows version of include files config.h - included by all diags as an 'OS independent' file mainly #defines to map stuff like foo to _foo ibdiag_version.h - defines IBDIAG_VERSION inttypes.h - empty include file unistd.h - empty include file netinet/in.h - empty include file cl_nodenammemap - was added to Windows user complib I'll submit patches to changes to the WinOF tree that touch areas outside of the tools/infiniband-diags directly separate to the ofw mail list. Signed-off-by: Sean Hefty From sean.hefty at intel.com Tue Feb 17 14:30:54 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:30:54 -0800 Subject: [ofa-general] [PATCH 1/8] [ib-diag] sminfo: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <875F4D297F9E4C0297A87743053C95C7@amr.corp.intel.com> Allow sminfo to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- Note: all patches are also available at: git://git.openfabrics.org/~shefty/ib-mgmt.git master infiniband-diags/src/ibdiag_common.c | 10 ++++------ infiniband-diags/src/sminfo.c | 10 +++++----- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index bda1efa..5f2472d 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -204,7 +204,7 @@ static const struct ibdiag_opt common_opts[] = { { "usage", 'u', 0, NULL, "usage message" }, { "help", 'h', 0, NULL, "help message" }, { "version", 'V', 0, NULL, "show version" }, - {} + { 0 } }; static void make_opt(struct option *l, const struct ibdiag_opt *o, @@ -254,11 +254,11 @@ static struct option *make_long_opts(const char *exclude_str, static void make_str_opts(const struct option *o, char *p, unsigned size) { - int i, n = 0; + unsigned i, n = 0; for (n = 0; o->name && n + 2 + o->has_arg < size; o++) { - p[n++] = o->val; - for (i = 0; i < o->has_arg; i++) + p[n++] = (char) o->val; + for (i = 0; i < (unsigned) o->has_arg; i++) p[n++] = ':'; } p[n] = '\0'; @@ -273,8 +273,6 @@ int ibdiag_process_opts(int argc, char * const argv[], void *cxt, char str_opts[1024]; const struct ibdiag_opt *o; - memset(opts_map, 0, sizeof(opts_map)); - prog_name = argv[0]; prog_args = usage_args; prog_examples = usage_examples; diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c index e96c782..549cb81 100644 --- a/infiniband-diags/src/sminfo.c +++ b/infiniband-diags/src/sminfo.c @@ -59,10 +59,10 @@ enum { }; char *statestr[] = { - [SMINFO_NOTACT] "SMINFO_NOTACT", - [SMINFO_DISCOVER] "SMINFO_DISCOVER", - [SMINFO_STANDBY] "SMINFO_STANDBY", - [SMINFO_MASTER] "SMINFO_MASTER", + "SMINFO_NOTACT", + "SMINFO_DISCOVER", + "SMINFO_STANDBY", + "SMINFO_MASTER", }; #define STATESTR(s) (((unsigned)(s)) < SMINFO_STATE_LAST ? statestr[s] : "???") @@ -100,7 +100,7 @@ int main(int argc, char **argv) { "state", 's', 1, "<0-3>", "set SM state"}, { "priority", 'p', 1, "<0-15>", "set SM priority"}, { "activity", 'a', 1, NULL, "set activity count"}, - { } + { 0 } }; char usage_args[] = " [modifier]"; From sean.hefty at intel.com Tue Feb 17 14:31:31 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:31:31 -0800 Subject: [ofa-general] [PATCH 2/8] [ib-diag] vendstat: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <056C849D67044206B1E98248885FAA00@amr.corp.intel.com> Allow vendstat to build and run on both Linux and Windows. Window build files are maintained in the WinOF repository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- infiniband-diags/src/vendstat.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index 7e8b162..db87e38 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -134,7 +134,7 @@ int main(int argc, char **argv) const struct ibdiag_opt opts[] = { { "N", 'N', 0, NULL, "show IS3 general information"}, { "w", 'w', 0, NULL, "show IS3 port xmit wait counters"}, - {} + { 0 } }; char usage_args[] = ""; const char *usage_examples[] = { From sean.hefty at intel.com Tue Feb 17 14:32:05 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:32:05 -0800 Subject: [ofa-general] [PATCH 3/8] [ib-diag] ibaddr: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <9C13C064B6594BDF8A52393F37267001@amr.corp.intel.com> Allow ibaddr to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- infiniband-diags/src/ibaddr.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index 88ad904..9098699 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -112,7 +112,7 @@ int main(int argc, char **argv) { "gid_show", 'g', 0, NULL, "show gid address only"}, { "lid_show", 'l', 0, NULL, "show lid range only"}, { "Lid_show", 'L', 0, NULL, "show lid range (in decimal) only"}, - {} + { 0 } }; char usage_args[] = "[]"; const char *usage_examples[] = { From sean.hefty at intel.com Tue Feb 17 14:32:38 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:32:38 -0800 Subject: [ofa-general] [PATCH 4/8] [ib-diag] perfquery: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: Allow perfquery to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- infiniband-diags/src/perfquery.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 8786027..6292743 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -353,7 +353,7 @@ int main(int argc, char **argv) { "loop_ports", 'l', 0, NULL, "iterate through each port" }, { "reset_after_read", 'r', 0, NULL, "reset counters after read" }, { "Reset_only", 'R', 0, NULL, "only reset counters" }, - { } + { 0 } }; char usage_args[] = " [ [[port] [reset_mask]]]"; const char *usage_examples[] = { From weiny2 at llnl.gov Tue Feb 17 14:28:59 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 14:28:59 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> Message-ID: <20090217142859.9e7a7e22.weiny2@llnl.gov> On Tue, 17 Feb 2009 16:12:12 -0500 Hal Rosenstock wrote: > On Tue, Feb 17, 2009 at 12:19 PM, wrote: > > Quoting Hal Rosenstock : > > > >> Sasha, > >> > >> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky > >> wrote: > >>> > >>> I looked at implementation of safe_*() functions (safe_smp_query, > >>> safe_smp_set and safe_ca_call) and found that they are not actually > >>> "safe" as declared by its names. The only thread-unsafe thing which > >>> is used there is static 'mad_portid' structure (from rpc.c), > >> > >> I'm not sure that the only thread unsafe thing in the mad rpc > >> mechanism is the portid. > >> > >>> but modification of this structure is not protected by same mutex > >>> (actually > >>> not protected at all). > >> > >> A first step would be removing the portid as static. If so, portid > >> would need to be a supplied parameter to various mad routines and the > >> existing ones relying on madrpc_portid would be deprecated. Does this > >> make sense to do ? Would you accept such a patch ? > >> > > > Don't we already have an interface like this with mad_rpc_open_port? > > I'm not sure this was carried all the way through (The basic building > blocks are there but I think some additional routines are needed). > > Shouldn't the in tree clients be converted over and the old routines > deprecated ? For utilities which run once through I think the old functions work just fine. However, it is pretty confusing which interface to use... [or even that there are 2 interfaces, but I digress] (see below) > > > I don't like the void * return but it is "struct ibmadb_port" under the hood. > > Is access into that currently opaque struct needed for something by > the clients of the library ? There is nothing the clients need to access but it would be much better to return some named data type. This along with some documentation would clarify what the difference between madrpc and mad_rpc really is. Furthermore, a named type will help to "self document" other functions like "mad_rpc". For example: void *mad_rpc(const ibmad_port_t *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); Oh now I found it... Check out smp_[query|set]_via... Here the interface changes the parameter name and one has no idea what the type is (without looking at the code that is! ;-) uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout, const void *srcport); ^^^^ uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout, const void *srcport); ^^^^ And here is one more... int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); > > > Are those calls which use it not thread safe? > > They look OK but I'm not 100% sure yet. Yea, they look thread safe but I am not sure either. :-( I would be in favor of making all the utils use mad_rpc_open_port but it is up to Shasha if we go down this path. Ira > > -- Hal > > > Ira > > > > > >> -- Hal > >> > >>> As far as I know nothing uses those safe_*() primitives right now outside > >>> libibmad, so I think it is better to remove this confused functions from > >>> API (with changing library version, etc.). > >>> > >>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to > >>> hidden static pthread mutex which is not controlled by caller > >>> application. I think that it will be more robust for multithreaded > >>> application to use its own synchronization methods (pthread mutex or any > >>> other) for better control. So let's remove madrpc_lock/unlock() too. > >>> > >>> Signed-off-by: Sasha Khapyorsky > >>> --- > >>> libibmad/include/infiniband/mad.h | 41 > >>> ------------------------------------- > >>> libibmad/libibmad.ver | 2 +- > >>> libibmad/src/libibmad.map | 2 - > >>> libibmad/src/rpc.c | 15 ------------- > >>> libibmad/src/sa.c | 5 ++- > >>> 5 files changed, 4 insertions(+), 61 deletions(-) > >>> > >>> diff --git a/libibmad/include/infiniband/mad.h > >>> b/libibmad/include/infiniband/mad.h > >>> index eff6738..89b4be5 100644 > >>> --- a/libibmad/include/infiniband/mad.h > >>> +++ b/libibmad/include/infiniband/mad.h > >>> @@ -703,8 +703,6 @@ void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t > >>> *dport, ib_rmpp_hdr_t *rmpp, > >>> void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > >>> int num_classes); > >>> void madrpc_save_mad(void *madbuf, int len); > >>> -void madrpc_lock(void); > >>> -void madrpc_unlock(void); > >>> void madrpc_show_errors(int set); > >>> > >>> void * mad_rpc_open_port(char *dev_name, int dev_port, int > >>> *mgmt_classes, > >>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t *id, > >>> unsigned attrid, > >>> uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid, > >>> unsigned mod, > >>> unsigned timeout, const void *srcport); > >>> > >>> -inline static uint8_t * > >>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid, > >>> unsigned mod, > >>> - unsigned timeout) > >>> -{ > >>> - uint8_t *p; > >>> - > >>> - madrpc_lock(); > >>> - p = smp_query(rcvbuf, portid, attrid, mod, timeout); > >>> - madrpc_unlock(); > >>> - > >>> - return p; > >>> -} > >>> - > >>> -inline static uint8_t * > >>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid, > >>> unsigned mod, > >>> - unsigned timeout) > >>> -{ > >>> - uint8_t *p; > >>> - > >>> - madrpc_lock(); > >>> - p = smp_set(rcvbuf, portid, attrid, mod, timeout); > >>> - madrpc_unlock(); > >>> - > >>> - return p; > >>> -} > >>> - > >>> /* sa.c */ > >>> uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, > >>> unsigned timeout); > >>> @@ -761,19 +733,6 @@ int ib_path_query(ibmad_gid_t srcgid, > >>> ibmad_gid_t destgid, ib_portid_t *sm_id, > >>> int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > >>> ibmad_gid_t destgid, ib_portid_t *sm_id, void > >>> *buf); > >>> > >>> -inline static uint8_t * > >>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, > >>> - unsigned timeout) > >>> -{ > >>> - uint8_t *p; > >>> - > >>> - madrpc_lock(); > >>> - p = sa_call(rcvbuf, portid, sa, timeout); > >>> - madrpc_unlock(); > >>> - > >>> - return p; > >>> -} > >>> - > >>> /* resolve.c */ > >>> int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); > >>> int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, > >>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver > >>> index 7e93c16..23d2dc2 100644 > >>> --- a/libibmad/libibmad.ver > >>> +++ b/libibmad/libibmad.ver > >>> @@ -6,4 +6,4 @@ > >>> # API_REV - advance on any added API > >>> # RUNNING_REV - advance any change to the vendor files > >>> # AGE - number of backward versions the API still supports > >>> -LIBVERSION=5:0:4 > >>> +LIBVERSION=2:0:0 > >>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > >>> index 927e51c..f944d86 100644 > >>> --- a/libibmad/src/libibmad.map > >>> +++ b/libibmad/src/libibmad.map > >>> @@ -72,14 +72,12 @@ IBMAD_1.3 { > >>> madrpc; > >>> madrpc_def_timeout; > >>> madrpc_init; > >>> - madrpc_lock; > >>> madrpc_portid; > >>> madrpc_rmpp; > >>> madrpc_save_mad; > >>> madrpc_set_retries; > >>> madrpc_set_timeout; > >>> madrpc_show_errors; > >>> - madrpc_unlock; > >>> ib_path_query; > >>> sa_call; > >>> sa_rpc_call; > >>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > >>> index 5226540..670a936 100644 > >>> --- a/libibmad/src/rpc.c > >>> +++ b/libibmad/src/rpc.c > >>> @@ -38,7 +38,6 @@ > >>> #include > >>> #include > >>> #include > >>> -#include > >>> #include > >>> #include > >>> > >>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, > >>> ib_rmpp_hdr_t *rmpp, void *data) > >>> return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); > >>> } > >>> > >>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; > >>> - > >>> -void > >>> -madrpc_lock(void) > >>> -{ > >>> - pthread_mutex_lock(&rpclock); > >>> -} > >>> - > >>> -void > >>> -madrpc_unlock(void) > >>> -{ > >>> - pthread_mutex_unlock(&rpclock); > >>> -} > >>> - > >>> void > >>> madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int > >>> num_classes) > >>> { > >>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > >>> index 27b9d52..c601254 100644 > >>> --- a/libibmad/src/sa.c > >>> +++ b/libibmad/src/sa.c > >>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport, ibmad_gid_t > >>> srcgid, ibmad_gid_t destgid, > >>> if (srcport) { > >>> p = sa_rpc_call (srcport, buf, sm_id, &sa, 0); > >>> } else { > >>> - p = safe_sa_call(buf, sm_id, &sa, 0); > >>> + p = sa_call(buf, sm_id, &sa, 0); > >>> } > >>> if (!p) { > >>> IBWARN("sa call path_query failed"); > >>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport, ibmad_gid_t > >>> srcgid, ibmad_gid_t destgid, > >>> mad_decode_field(p, IB_SA_PR_DLID_F, &dlid); > >>> return dlid; > >>> } > >>> + > >>> int > >>> ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t > >>> *sm_id, void *buf) > >>> { > >>> - return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf); > >>> + return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf); > >>> } > >>> -- > >>> 1.6.0.4.766.g6fc4a > >>> > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit http:// > >>> openib.org/mailman/listinfo/openib-general > >>> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http:// > >> openib.org/mailman/listinfo/openib-general > >> > >> > > > > > > > > > -- Ira Weiny From sean.hefty at intel.com Tue Feb 17 14:33:09 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:33:09 -0800 Subject: [ofa-general] [PATCH 5/8] [ib-diag] ibportstate: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <134CCD11D025456C86E0BB067B25A0CD@amr.corp.intel.com> Allow ibportstate to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- infiniband-diags/src/ibportstate.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index d1a112b..c0b9b34 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -311,12 +311,12 @@ int main(int argc, char **argv) /* Setup portid for peer port */ memcpy(&peerportid, &portid, sizeof(peerportid)); peerportid.drpath.cnt = 1; - peerportid.drpath.p[1] = portnum; + peerportid.drpath.p[1] = (uint8_t) portnum; /* Set DrSLID to local lid */ if (ib_resolve_self(&selfportid, &selfport, 0) < 0) IBERROR("could not resolve self"); - peerportid.drpath.drslid = selfportid.lid; + peerportid.drpath.drslid = (uint16_t) selfportid.lid; peerportid.drpath.drdlid = 0xffff; /* Get peer port NodeInfo to obtain peer port number */ From sean.hefty at intel.com Tue Feb 17 14:35:40 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:35:40 -0800 Subject: [ofa-general] [PATCH 6/8] [ib-diag] ibstat: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <74CAD5D6EE354A18A32A84C742458C90@amr.corp.intel.com> Allow ibstat to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- Patch is also attached. Given the lengths of the lines in the code, I'm guessing that my mailer may wrap the lines. Patch is also available through my ib-mgmt.git tree. infiniband-diags/src/ibstat.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 5add690..7985be1 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -62,8 +62,8 @@ ca_dump(umad_ca_t *ca) { if (!ca->node_type) return; - printf("%s '%s'\n", ((uint)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"), ca->ca_name); - printf("\t%s type: %s\n", ((uint)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"),ca->ca_type); + printf("%s '%s'\n", ((unsigned)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"), ca->ca_name); + printf("\t%s type: %s\n", ((unsigned)ca->node_type <= IB_NODE_MAX ? node_type_str[ca->node_type] : "???"),ca->ca_type); printf("\tNumber of ports: %d\n", ca->numports); printf("\tFirmware version: %s\n", ca->fw_ver); printf("\tHardware version: %s\n", ca->hw_ver); @@ -105,13 +105,13 @@ port_dump(umad_port_t *port, int alone) } printf("%sPort %d:\n", hdrpre, port->portnum); - printf("%sState: %s\n", pre, (uint)port->state <= 4 ? port_state_str[port->state] : "???"); - printf("%sPhysical state: %s\n", pre, (uint)port->state <= 7 ? port_phy_state_str[port->phys_state] : "???"); + printf("%sState: %s\n", pre, (unsigned)port->state <= 4 ? port_state_str[port->state] : "???"); + printf("%sPhysical state: %s\n", pre, (unsigned)port->state <= 7 ? port_phy_state_str[port->phys_state] : "???"); printf("%sRate: %d\n", pre, port->rate); printf("%sBase lid: %d\n", pre, port->base_lid); printf("%sLMC: %d\n", pre, port->lmc); printf("%sSM lid: %d\n", pre, port->sm_lid); - printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohl(port->capmask)); + printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port->capmask)); printf("%sPort GUID: 0x%016llx\n", pre, (long long unsigned)ntohll(port->port_guid)); return 0; } @@ -131,11 +131,11 @@ ca_stat(char *ca_name, int portnum, int no_ports) if (!no_ports && portnum >= 0) { if (portnum > ca.numports || !ca.ports[portnum]) { IBWARN("%s: '%s' has no port number %d - max (%d)", - ((uint)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"), + ((unsigned)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"), ca_name, portnum, ca.numports); return -1; } - printf("%s: '%s'\n", ((uint)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"), ca.ca_name); + printf("%s: '%s'\n", ((unsigned)ca.node_type <= IB_NODE_MAX ? node_type_str[ca.node_type] : "???"), ca.ca_name); port_dump(ca.ports[portnum], 1); return 0; } @@ -200,7 +200,7 @@ int main(int argc, char *argv[]) { "list_of_cas", 'l', 0, NULL, "list all IB devices" }, { "short", 's', 0, NULL, "short output" }, { "port_list", 'p', 0, NULL, "show port list" }, - { } + { 0 } }; char usage_args[] = " [portnum]"; const char *usage_examples[] = { -------------- next part -------------- A non-text attachment was scrubbed... Name: 06-win-ibstat Type: application/octet-stream Size: 3330 bytes Desc: not available URL: From sean.hefty at intel.com Tue Feb 17 14:36:20 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:36:20 -0800 Subject: [ofa-general] [PATCH 7/8] [ib-diags] smpdump: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: Allow smpdump to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- infiniband-diags/src/smpdump.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index 8618121..6c7f84c 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -102,7 +102,7 @@ drsmp_get_init(void *umad, DRPath *path, int attr, int mod) if (path) memcpy(smp->initial_path, path->path, path->hop_cnt+1); - smp->hop_cnt = path->hop_cnt; + smp->hop_cnt = (uint8_t) path->hop_cnt; } void @@ -146,7 +146,7 @@ drsmp_set_init(void *umad, DRPath *path, int attr, int mod, void *data) if (data) memcpy(smp->data, data, sizeof smp->data); - smp->hop_cnt = path->hop_cnt; + smp->hop_cnt = (uint8_t) path->hop_cnt; } char * @@ -172,7 +172,7 @@ str2DRPath(char *str, DRPath *path) while (str && *str) { if ((s = strchr(str, ','))) *s = 0; - path->path[++path->hop_cnt] = atoi(str); + path->path[++path->hop_cnt] = (char) atoi(str); if (!s) break; str = s+1; @@ -221,7 +221,7 @@ int main(int argc, char *argv[]) const struct ibdiag_opt opts[] = { { "sring", 's', 0, NULL, ""}, - { } + { 0 } }; char usage_args[] = " [mod]"; const char *usage_examples[] = { From sean.hefty at intel.com Tue Feb 17 14:37:28 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 14:37:28 -0800 Subject: [ofa-general] [PATCH 8/8] [ib-diags] smpquery: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com> Allow smpquery to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- infiniband-diags/src/smpquery.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 44280e1..2d3d91b 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -47,7 +47,7 @@ #include #include -#include +#include #include "ibdiag_common.h" @@ -191,7 +191,7 @@ pkey_table(ib_portid_t *dest, char **argv, int argc) } else mad_decode_field(data, IB_NODE_PARTITION_CAP_F, &n); - for (i = 0; i < (n + 31) / 32; i++) { + for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) { mod = i | (portnum << 16); if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0)) return "pkey table query failed"; @@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc) return "port info failed"; mad_decode_field(data, IB_PORT_GUID_CAP_F, &n); - for (i = 0; i < (n + 7) / 8; i++) { + for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) { mod = i; if (!smp_query(data, dest, IB_ATTR_GUID_INFO, mod, 0)) return "guid info query failed"; @@ -412,7 +412,7 @@ int main(int argc, char **argv) const struct ibdiag_opt opts[] = { { "combined", 'c', 0, NULL, "use Combined route address argument"}, { "node-name-map", 1, 1, "", "node name map file"}, - {} + { 0 } }; const char *usage_examples[] = { "portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier", From sashak at voltaire.com Tue Feb 17 14:45:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 00:45:27 +0200 Subject: [ofa-general] [PATCH] opensm: proper config file rescan Message-ID: <20090217224527.GV7189@sashak.voltaire.com> Now we have more config options (once it was QoS parameters only) which can be changed in OpenSM config file "on the fly". However this introduces the problem - unconditional config parameter rescanning from config file overwrites command line and console settings, which should have be a "higher priority" user interface. As result things like 'opensm -F ./config.file -v' may not work as expected and in this example '-v' will work only from OpenSM start up to first sweep start. This patch attempts to address this issue: First OpenSM will parse config file, then command line options and console commands will be able to overwrite those settings. When OpenSM will rescan config file again it will set only config parameters which were changed in the file (not everything as it is now) (for this last copy of config parameters parsed out from config file is stored). So a "last user intervention" becomes an active. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_subnet.h | 1 + opensm/opensm/main.c | 1 - opensm/opensm/osm_subnet.c | 92 ++++++++++++++++++++---------------- 3 files changed, 52 insertions(+), 42 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 8863e47..2dfccda 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -217,6 +217,7 @@ typedef struct osm_subn_opt { char *node_name_map_name; char *prefix_routes_file; boolean_t consolidate_ipv6_snm_req; + struct osm_subn_opt *file_opts; /* used for update */ } osm_subn_opt_t; /* * FIELDS diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index a632cd7..e22c2c4 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -508,7 +508,6 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) /********************************************************************** **********************************************************************/ #define SET_STR_OPT(opt, val) do { \ - if (opt) free(opt); \ opt = val ? strdup(val) : NULL ; \ } while (0) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index f12685e..01478be 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -77,7 +77,8 @@ typedef struct opt_rec { const char *name; unsigned long opt_offset; void (*parse_fn)(osm_subn_t *p_subn, char *p_key, char *p_val_str, - void *p_val, void (*)(osm_subn_t *, void *)); + void *p_val1, void *p_val2, + void (*)(osm_subn_t *, void *)); void (*setup_fn)(osm_subn_t *p_subn, void *p_val); int can_update; } opt_rec_t; @@ -151,102 +152,102 @@ static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val) } static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - uint64_t *p_val = p_v; + uint64_t *p_val1 = p_v1, *p_val2 = p_v2; uint64_t val = strtoull(p_val_str, NULL, 0); - if (cl_hton64(val) != *p_val) { + if (cl_hton64(val) != *p_val1) { log_config_value(p_key, "0x%016" PRIx64, val); if (pfn) pfn(p_subn, &val); - *p_val = cl_ntoh64(val); + *p_val1 = *p_val2 = cl_ntoh64(val); } } static void opts_parse_uint32(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - uint32_t *p_val = p_v; + uint32_t *p_val1 = p_v1, *p_val2 = p_v2; uint32_t val = strtoul(p_val_str, NULL, 0); - if (val != *p_val) { + if (val != *p_val1) { log_config_value(p_key, "%u", val); if (pfn) pfn(p_subn, &val); - *p_val = val; + *p_val1 = *p_val2 = val; } } static void opts_parse_int32(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - int32_t *p_val = p_v; + int32_t *p_val1 = p_v1, *p_val2 = p_v2; int32_t val = strtol(p_val_str, NULL, 0); - if (val != *p_val) { + if (val != *p_val1) { log_config_value(p_key, "%d", val); if (pfn) pfn(p_subn, &val); - *p_val = val; + *p_val1 = *p_val2 = val; } } static void opts_parse_uint16(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - uint16_t *p_val = p_v; + uint16_t *p_val1 = p_v1, *p_val2 = p_v2; uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); - if (val != *p_val) { + if (val != *p_val1) { log_config_value(p_key, "%u", val); if (pfn) pfn(p_subn, &val); - *p_val = val; + *p_val1 = *p_val2 = val; } } static void opts_parse_net16(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - uint16_t *p_val = p_v; + uint16_t *p_val1 = p_v1, *p_val2 = p_v2; uint16_t val = strtoul(p_val_str, NULL, 0); CL_ASSERT(val < 0x10000); - if (cl_hton16(val) != *p_val) { + if (cl_hton16(val) != *p_val1) { log_config_value(p_key, "0x%04x", val); if (pfn) pfn(p_subn, &val); - *p_val = cl_hton16(val); + *p_val1 = *p_val2 = cl_hton16(val); } } static void opts_parse_uint8(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - uint8_t *p_val = p_v; + uint8_t *p_val1 = p_v1, *p_val2 = p_v2; uint8_t val = strtoul(p_val_str, NULL, 0); CL_ASSERT(val < 0x100); - if (val != *p_val) { + if (val != *p_val1) { log_config_value(p_key, "%u", val); if (pfn) pfn(p_subn, &val); - *p_val = val; + *p_val1 = *p_val2 = val; } } static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - boolean_t *p_val = p_v; + boolean_t *p_val1 = p_v1, *p_val2 = p_v2; boolean_t val; if (!p_val_str) @@ -257,20 +258,20 @@ static void opts_parse_boolean(IN osm_subn_t *p_subn, IN char *p_key, else val = TRUE; - if (val != *p_val) { + if (val != *p_val1) { log_config_value(p_key, "%s", p_val_str); if (pfn) pfn(p_subn, &val); - *p_val = val; + *p_val1 = *p_val2 = val; } } static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key, - IN char *p_val_str, IN void *p_v, + IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) { - char **p_val = p_v; - const char *current_str = *p_val ? *p_val : null_str ; + char **p_val1 = p_v1, **p_val2 = p_v2; + const char *current_str = *p_val1 ? *p_val1 : null_str ; if (p_val_str && strcmp(p_val_str, current_str)) { char *new; @@ -279,9 +280,11 @@ static void opts_parse_charp(IN osm_subn_t *p_subn, IN char *p_key, new = strcmp(null_str, p_val_str) ? strdup(p_val_str) : NULL; if (pfn) pfn(p_subn, new); - if (*p_val) - free(*p_val); - *p_val = new; + if (*p_val1 && *p_val1 != *p_val2) + free(*p_val1); + if (*p_val2) + free(*p_val2); + *p_val1 = *p_val2 = new; } } @@ -1121,7 +1124,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) FILE *opts_file; char *p_key, *p_val; const opt_rec_t *r; - void *p_field; + void *p_field1, *p_field2; opts_file = fopen(file_name, "r"); if (!opts_file) { @@ -1136,6 +1139,9 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) cl_log_event("OpenSM", CL_LOG_INFO, line, NULL, 0); p_opts->config_file = file_name; + if (!p_opts->file_opts && !(p_opts->file_opts = malloc(sizeof(*p_opts)))) + return -1; + memcpy(p_opts->file_opts, p_opts, sizeof(*p_opts)); while (fgets(line, 1023, opts_file) != NULL) { /* get the first token */ @@ -1149,9 +1155,11 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) if (strcmp(r->name, p_key)) continue; - p_field = (void *)p_opts + r->opt_offset; + p_field1 = (void *)p_opts->file_opts + r->opt_offset; + p_field2 = (void *)p_opts + r->opt_offset; /* don't call setup function first time */ - r->parse_fn(NULL, p_key, p_val, p_field, NULL); + r->parse_fn(NULL, p_key, p_val, p_field1, p_field2, + NULL); break; } } @@ -1169,7 +1177,7 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) const opt_rec_t *r; FILE *opts_file; char *p_key, *p_val; - void *p_field; + void *p_field1, *p_field2; if (!p_opts->config_file) return 0; @@ -1202,8 +1210,10 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) if (!r->can_update || strcmp(r->name, p_key)) continue; - p_field = (void *)p_opts + r->opt_offset; - r->parse_fn(p_subn, p_key, p_val, p_field, r->setup_fn); + p_field1 = (void *)p_opts->file_opts + r->opt_offset; + p_field2 = (void *)p_opts + r->opt_offset; + r->parse_fn(p_subn, p_key, p_val, p_field1, p_field2, + r->setup_fn); break; } } -- 1.6.1.2.319.gbd9e From rdreier at cisco.com Tue Feb 17 14:54:36 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Feb 2009 14:54:36 -0800 Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp, do path_free only for newly-created paths In-Reply-To: <200902171701.36107.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 17 Feb 2009 17:01:35 +0200") References: <200902171701.36107.jackm@dev.mellanox.co.il> Message-ID: thanks, applied... > Signed-off-by: Jack Morgenstein > Signed-off-by: Moni Shua This doesn't make any sense... Moni was not involved in sending this patch at all, and in any case since you are sending the patch your s-o-b should be last. If you want to give credit to Moni then include it in the description as you did for Yossi. > I ran checkpatch.pl on this, and compiled it with Sparse. However, I would still like to continue > using KMail. If you have any editing/formatting problems with the patch, please let me know. > The patch was generated by git diff against your kernel git/master branch. Everything came through fine so no problem with your MUA. - R. From hal.rosenstock at gmail.com Tue Feb 17 14:56:26 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 17 Feb 2009 17:56:26 -0500 Subject: [ofa-general] ***SPAM*** opensm/osm_inform.c:__match_inf_rec question Message-ID: In opensm/osm_inform.c:__match_inf_rec, around line 123, there is: /* if inform_info.gid is not zero, ignore lid range */ if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid, sizeof(p_infr_rec->inform_record.inform_info.gid))) { Shouldn't this be if (memcmp) rather than if (!memcmp) ? -- Hal From hal.rosenstock at gmail.com Tue Feb 17 15:21:02 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 17 Feb 2009 18:21:02 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090217142859.9e7a7e22.weiny2@llnl.gov> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> Message-ID: On 2/17/09, Ira Weiny wrote: > On Tue, 17 Feb 2009 16:12:12 -0500 > Hal Rosenstock wrote: > >> On Tue, Feb 17, 2009 at 12:19 PM, wrote: >> > Quoting Hal Rosenstock : >> > >> >> Sasha, >> >> >> >> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky >> >> >> >> wrote: >> >>> >> >>> I looked at implementation of safe_*() functions (safe_smp_query, >> >>> safe_smp_set and safe_ca_call) and found that they are not actually >> >>> "safe" as declared by its names. The only thread-unsafe thing which >> >>> is used there is static 'mad_portid' structure (from rpc.c), >> >> >> >> I'm not sure that the only thread unsafe thing in the mad rpc >> >> mechanism is the portid. >> >> >> >>> but modification of this structure is not protected by same mutex >> >>> (actually >> >>> not protected at all). >> >> >> >> A first step would be removing the portid as static. If so, portid >> >> would need to be a supplied parameter to various mad routines and the >> >> existing ones relying on madrpc_portid would be deprecated. Does this >> >> make sense to do ? Would you accept such a patch ? >> >> >> >> > Don't we already have an interface like this with mad_rpc_open_port? >> >> I'm not sure this was carried all the way through (The basic building >> blocks are there but I think some additional routines are needed). >> >> Shouldn't the in tree clients be converted over and the old routines >> deprecated ? > > For utilities which run once through I think the old functions work just > fine. Well, sort of... Aren't mad_portid "collisions" possible when multiple programs are run concurrently ? > However, it is pretty confusing which interface to use... [or even that > there > are 2 interfaces, but I digress] (see below) I don't think the newer improved interfaces were ever documented. >> > I don't like the void * return but it is "struct ibmadb_port" under the >> > hood. >> >> Is access into that currently opaque struct needed for something by >> the clients of the library ? > > There is nothing the clients need to access but it would be much better to > return some named data type. This along with some documentation would > clarify > what the difference between madrpc and mad_rpc really is. Furthermore, a > named type will help to "self document" other functions like "mad_rpc". For > example: > > void *mad_rpc(const ibmad_port_t *ibmad_port, ib_rpc_t * rpc, ib_portid_t > * dport, > void *payload, void *rcvdata); > > Oh now I found it... Check out smp_[query|set]_via... Here the interface > changes the parameter name and one has no idea what the type is (without > looking at the code that is! ;-) > > uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, > unsigned mod, unsigned timeout, const void *srcport); > ^^^^ > > uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, > unsigned mod, > unsigned timeout, const void *srcport); > ^^^^ > And here is one more... > int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); Are you referring to how srcport is used to call either the old madrpc or the newer mad_rpc API and if the newer one srcport is really a pointer to a struct ibmad_port ? -- Hal >> > Are those calls which use it not thread safe? >> >> They look OK but I'm not 100% sure yet. > > Yea, they look thread safe but I am not sure either. :-( > > I would be in favor of making all the utils use mad_rpc_open_port but it is > up > to Shasha if we go down this path. > > Ira > >> >> -- Hal >> >> > Ira >> > >> > >> >> -- Hal >> >> >> >>> As far as I know nothing uses those safe_*() primitives right now >> >>> outside >> >>> libibmad, so I think it is better to remove this confused functions >> >>> from >> >>> API (with changing library version, etc.). >> >>> >> >>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to >> >>> hidden static pthread mutex which is not controlled by caller >> >>> application. I think that it will be more robust for multithreaded >> >>> application to use its own synchronization methods (pthread mutex or >> >>> any >> >>> other) for better control. So let's remove madrpc_lock/unlock() too. >> >>> >> >>> Signed-off-by: Sasha Khapyorsky >> >>> --- >> >>> libibmad/include/infiniband/mad.h | 41 >> >>> ------------------------------------- >> >>> libibmad/libibmad.ver | 2 +- >> >>> libibmad/src/libibmad.map | 2 - >> >>> libibmad/src/rpc.c | 15 ------------- >> >>> libibmad/src/sa.c | 5 ++- >> >>> 5 files changed, 4 insertions(+), 61 deletions(-) >> >>> >> >>> diff --git a/libibmad/include/infiniband/mad.h >> >>> b/libibmad/include/infiniband/mad.h >> >>> index eff6738..89b4be5 100644 >> >>> --- a/libibmad/include/infiniband/mad.h >> >>> +++ b/libibmad/include/infiniband/mad.h >> >>> @@ -703,8 +703,6 @@ void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t >> >>> *dport, ib_rmpp_hdr_t *rmpp, >> >>> void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, >> >>> int num_classes); >> >>> void madrpc_save_mad(void *madbuf, int len); >> >>> -void madrpc_lock(void); >> >>> -void madrpc_unlock(void); >> >>> void madrpc_show_errors(int set); >> >>> >> >>> void * mad_rpc_open_port(char *dev_name, int dev_port, int >> >>> *mgmt_classes, >> >>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t >> >>> *id, >> >>> unsigned attrid, >> >>> uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid, >> >>> unsigned mod, >> >>> unsigned timeout, const void *srcport); >> >>> >> >>> -inline static uint8_t * >> >>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid, >> >>> unsigned mod, >> >>> - unsigned timeout) >> >>> -{ >> >>> - uint8_t *p; >> >>> - >> >>> - madrpc_lock(); >> >>> - p = smp_query(rcvbuf, portid, attrid, mod, timeout); >> >>> - madrpc_unlock(); >> >>> - >> >>> - return p; >> >>> -} >> >>> - >> >>> -inline static uint8_t * >> >>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid, >> >>> unsigned mod, >> >>> - unsigned timeout) >> >>> -{ >> >>> - uint8_t *p; >> >>> - >> >>> - madrpc_lock(); >> >>> - p = smp_set(rcvbuf, portid, attrid, mod, timeout); >> >>> - madrpc_unlock(); >> >>> - >> >>> - return p; >> >>> -} >> >>> - >> >>> /* sa.c */ >> >>> uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t >> >>> *sa, >> >>> unsigned timeout); >> >>> @@ -761,19 +733,6 @@ int ib_path_query(ibmad_gid_t srcgid, >> >>> ibmad_gid_t destgid, ib_portid_t *sm_id, >> >>> int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, >> >>> ibmad_gid_t destgid, ib_portid_t *sm_id, void >> >>> *buf); >> >>> >> >>> -inline static uint8_t * >> >>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, >> >>> - unsigned timeout) >> >>> -{ >> >>> - uint8_t *p; >> >>> - >> >>> - madrpc_lock(); >> >>> - p = sa_call(rcvbuf, portid, sa, timeout); >> >>> - madrpc_unlock(); >> >>> - >> >>> - return p; >> >>> -} >> >>> - >> >>> /* resolve.c */ >> >>> int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); >> >>> int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, >> >>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver >> >>> index 7e93c16..23d2dc2 100644 >> >>> --- a/libibmad/libibmad.ver >> >>> +++ b/libibmad/libibmad.ver >> >>> @@ -6,4 +6,4 @@ >> >>> # API_REV - advance on any added API >> >>> # RUNNING_REV - advance any change to the vendor files >> >>> # AGE - number of backward versions the API still supports >> >>> -LIBVERSION=5:0:4 >> >>> +LIBVERSION=2:0:0 >> >>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map >> >>> index 927e51c..f944d86 100644 >> >>> --- a/libibmad/src/libibmad.map >> >>> +++ b/libibmad/src/libibmad.map >> >>> @@ -72,14 +72,12 @@ IBMAD_1.3 { >> >>> madrpc; >> >>> madrpc_def_timeout; >> >>> madrpc_init; >> >>> - madrpc_lock; >> >>> madrpc_portid; >> >>> madrpc_rmpp; >> >>> madrpc_save_mad; >> >>> madrpc_set_retries; >> >>> madrpc_set_timeout; >> >>> madrpc_show_errors; >> >>> - madrpc_unlock; >> >>> ib_path_query; >> >>> sa_call; >> >>> sa_rpc_call; >> >>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c >> >>> index 5226540..670a936 100644 >> >>> --- a/libibmad/src/rpc.c >> >>> +++ b/libibmad/src/rpc.c >> >>> @@ -38,7 +38,6 @@ >> >>> #include >> >>> #include >> >>> #include >> >>> -#include >> >>> #include >> >>> #include >> >>> >> >>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, >> >>> ib_rmpp_hdr_t *rmpp, void *data) >> >>> return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); >> >>> } >> >>> >> >>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; >> >>> - >> >>> -void >> >>> -madrpc_lock(void) >> >>> -{ >> >>> - pthread_mutex_lock(&rpclock); >> >>> -} >> >>> - >> >>> -void >> >>> -madrpc_unlock(void) >> >>> -{ >> >>> - pthread_mutex_unlock(&rpclock); >> >>> -} >> >>> - >> >>> void >> >>> madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int >> >>> num_classes) >> >>> { >> >>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c >> >>> index 27b9d52..c601254 100644 >> >>> --- a/libibmad/src/sa.c >> >>> +++ b/libibmad/src/sa.c >> >>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport, >> >>> ibmad_gid_t >> >>> srcgid, ibmad_gid_t destgid, >> >>> if (srcport) { >> >>> p = sa_rpc_call (srcport, buf, sm_id, &sa, 0); >> >>> } else { >> >>> - p = safe_sa_call(buf, sm_id, &sa, 0); >> >>> + p = sa_call(buf, sm_id, &sa, 0); >> >>> } >> >>> if (!p) { >> >>> IBWARN("sa call path_query failed"); >> >>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport, >> >>> ibmad_gid_t >> >>> srcgid, ibmad_gid_t destgid, >> >>> mad_decode_field(p, IB_SA_PR_DLID_F, &dlid); >> >>> return dlid; >> >>> } >> >>> + >> >>> int >> >>> ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t >> >>> *sm_id, void *buf) >> >>> { >> >>> - return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf); >> >>> + return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf); >> >>> } >> >>> -- >> >>> 1.6.0.4.766.g6fc4a >> >>> >> >>> _______________________________________________ >> >>> general mailing list >> >>> general at lists.openfabrics.org >> >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >>> >> >>> To unsubscribe, please visit http:// >> >>> openib.org/mailman/listinfo/openib-general >> >>> >> >> _______________________________________________ >> >> general mailing list >> >> general at lists.openfabrics.org >> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> >> >> To unsubscribe, please visit http:// >> >> openib.org/mailman/listinfo/openib-general >> >> >> >> >> > >> > >> > >> > >> > > > -- > Ira Weiny > From sean.hefty at intel.com Tue Feb 17 16:05:45 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 16:05:45 -0800 Subject: [ofa-general] [PATCH 9/8] [ib-diag] ibping: add support for WinOF In-Reply-To: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: Allow ibping to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. For portability, use complib to obtain time stamps. Signed-off-by: Sean Hefty --- Converted another diag this afternoon. I was able to build and execute this, but apparently I don't have anything on my fabric that responds to the pings. infiniband-diags/src/ibping.c | 22 +++++++--------------- 1 files changed, 7 insertions(+), 15 deletions(-) diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c index 29c98c2..1994eba 100644 --- a/infiniband-diags/src/ibping.c +++ b/infiniband-diags/src/ibping.c @@ -41,24 +41,16 @@ #include #include #include -#include #include #include +#include #include "ibdiag_common.h" static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE]; static char last_host[IB_VENDOR_RANGE2_DATA_SIZE]; -static uint64_t getcurrenttime(void) -{ - struct timeval tv; - - gettimeofday(&tv, 0); - return (uint64_t)tv.tv_sec * 1000000 + tv.tv_usec; -} - static void get_host_and_domain(char *data, int sz) { @@ -118,7 +110,7 @@ ibping(ib_portid_t *portid, int quiet) DEBUG("Ping.."); - start = getcurrenttime(); + start = cl_get_time_stamp(); call.method = IB_MAD_METHOD_GET; call.mgmt_class = IB_VENDOR_OPENIB_PING_CLASS; @@ -129,9 +121,9 @@ ibping(ib_portid_t *portid, int quiet) memset(&call.rmpp, 0, sizeof call.rmpp); if (!ib_vendor_call(data, portid, &call)) - return ~0llu; + return ~0ull; - rtt = getcurrenttime() - start; + rtt = cl_get_time_stamp() - start; if (!last_host[0]) memcpy(last_host, data, sizeof last_host); @@ -149,7 +141,7 @@ static ib_portid_t portid = {0}; void report(int sig) { - total_time = getcurrenttime() - start; + total_time = cl_get_time_stamp() - start; DEBUG("out due signal %d", sig); @@ -203,7 +195,7 @@ int main(int argc, char **argv) { "flood", 'f', 0, NULL, "flood destination" }, { "oui", 'o', 1, NULL, "use specified OUI number" }, { "Server", 'S', 0, NULL, "start in server mode" }, - { } + { 0 } }; char usage_args[] = ""; @@ -238,7 +230,7 @@ int main(int argc, char **argv) signal(SIGINT, report); signal(SIGTERM, report); - start = getcurrenttime(); + start = cl_get_time_stamp(); while (count-- > 0) { ntrans++; From sashak at voltaire.com Tue Feb 17 16:28:39 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 02:28:39 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090217142859.9e7a7e22.weiny2@llnl.gov> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> Message-ID: <20090218002839.GW7189@sashak.voltaire.com> On 14:28 Tue 17 Feb , Ira Weiny wrote: > > > > > Are those calls which use it not thread safe? > > > > They look OK but I'm not 100% sure yet. > > Yea, they look thread safe but I am not sure either. :-( Could you, Guys, be more explicit? Really... :) > I would be in favor of making all the utils use mad_rpc_open_port but it is up > to Shasha if we go down this path. The idea looks fine to me, let's review the patch. Sasha From sashak at voltaire.com Tue Feb 17 16:33:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 02:33:55 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> Message-ID: <20090218003355.GX7189@sashak.voltaire.com> On 18:21 Tue 17 Feb , Hal Rosenstock wrote: > > > > For utilities which run once through I think the old functions work just > > fine. > > Well, sort of... Aren't mad_portid "collisions" possible when multiple > programs are run concurrently ? No. > > However, it is pretty confusing which interface to use... [or even that > > there > > are 2 interfaces, but I digress] (see below) > > I don't think the newer improved interfaces were ever documented. The old interfaces were not documented too. So it is at least consistent :). Sasha From sashak at voltaire.com Tue Feb 17 16:39:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 02:39:57 +0200 Subject: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> Message-ID: <20090218003957.GY7189@sashak.voltaire.com> On 09:52 Mon 16 Feb , Hal Rosenstock wrote: > > A first step would be removing the portid as static. If so, portid > would need to be a supplied parameter to various mad routines and the > existing ones relying on madrpc_portid would be deprecated. Does this > make sense to do ? A first step would be converting all clients and internal usage in libibmad (if any) to use a newer interface. If this will go smoothly and things will not become overcomlicated, we could move forward - to deprecate old interface... etc.. Nothing new. Sasha From sean.hefty at intel.com Tue Feb 17 16:36:06 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Feb 2009 16:36:06 -0800 Subject: [ofa-general] RE: [PATCH 9/8] [ib-diag] ibping: add support for WinOF In-Reply-To: References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: > signal(SIGINT, report); > signal(SIGTERM, report); Btw - I worked around adding cdecl before main by disabling the warning. Since main must be cdecl by default, the compiler fixes it, but spits out a warning. For some reason unknown to me, the warning only occurs when building 32-bit apps. However, signal() requires that the function be cdecl as well. The above two calls fail to compile on 32-bit Windows platforms, so I'm still working on this. The simple approach of changing the compiler options doesn't work as easily as it looks like it should. The WDK build environment is 'special'. - Sean From weiny2 at llnl.gov Tue Feb 17 16:52:26 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 16:52:26 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> Message-ID: <20090217165226.e04949d8.weiny2@llnl.gov> On Tue, 17 Feb 2009 18:21:02 -0500 Hal Rosenstock wrote: > On 2/17/09, Ira Weiny wrote: > > On Tue, 17 Feb 2009 16:12:12 -0500 > > Hal Rosenstock wrote: > > > >> On Tue, Feb 17, 2009 at 12:19 PM, wrote: > >> > Quoting Hal Rosenstock : > >> > > >> >> Sasha, > >> >> > >> >> On Wed, Dec 31, 2008 at 12:04 PM, Sasha Khapyorsky > >> >> > >> >> wrote: > >> >>> > >> >>> I looked at implementation of safe_*() functions (safe_smp_query, > >> >>> safe_smp_set and safe_ca_call) and found that they are not actually > >> >>> "safe" as declared by its names. The only thread-unsafe thing which > >> >>> is used there is static 'mad_portid' structure (from rpc.c), > >> >> > >> >> I'm not sure that the only thread unsafe thing in the mad rpc > >> >> mechanism is the portid. > >> >> > >> >>> but modification of this structure is not protected by same mutex > >> >>> (actually > >> >>> not protected at all). > >> >> > >> >> A first step would be removing the portid as static. If so, portid > >> >> would need to be a supplied parameter to various mad routines and the > >> >> existing ones relying on madrpc_portid would be deprecated. Does this > >> >> make sense to do ? Would you accept such a patch ? > >> >> > >> > >> > Don't we already have an interface like this with mad_rpc_open_port? > >> > >> I'm not sure this was carried all the way through (The basic building > >> blocks are there but I think some additional routines are needed). > >> > >> Shouldn't the in tree clients be converted over and the old routines > >> deprecated ? > > > > For utilities which run once through I think the old functions work just > > fine. > > Well, sort of... Aren't mad_portid "collisions" possible when multiple > programs are run concurrently ? I was only thinking of threading but I guess you are right. > > > However, it is pretty confusing which interface to use... [or even that > > there > > are 2 interfaces, but I digress] (see below) > > I don't think the newer improved interfaces were ever documented. > > >> > I don't like the void * return but it is "struct ibmadb_port" under the > >> > hood. > >> > >> Is access into that currently opaque struct needed for something by > >> the clients of the library ? > > > > There is nothing the clients need to access but it would be much better to > > return some named data type. This along with some documentation would > > clarify > > what the difference between madrpc and mad_rpc really is. Furthermore, a > > named type will help to "self document" other functions like "mad_rpc". For > > example: > > > > void *mad_rpc(const ibmad_port_t *ibmad_port, ib_rpc_t * rpc, ib_portid_t > > * dport, > > void *payload, void *rcvdata); > > > > Oh now I found it... Check out smp_[query|set]_via... Here the interface > > changes the parameter name and one has no idea what the type is (without > > looking at the code that is! ;-) > > > > uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, > > unsigned mod, unsigned timeout, const void *srcport); > > ^^^^ > > > > uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, > > unsigned mod, > > unsigned timeout, const void *srcport); > > ^^^^ > > And here is one more... > > int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); > > Are you referring to how srcport is used to call either the old madrpc > or the newer mad_rpc API and if the newer one srcport is really a > pointer to a struct ibmad_port ? Ok, I did not catch that srcport could be NULL to use the old interface, but that could just be documented... Currently mad_rpc takes a void *ibmad_port. But ib_path_query_via takes a void *srcport. If you look under the covers they are the same type "struct ibmad_port", if you need them. mad_rpc names it ibmad_port which gives you some clue about the type however srcport is generic altogether. Whoa! Then you look in rpc.c and mad_rpc takes void *port_id. mad.h: void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); rpc.c: void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) I figured this all out for libibnetdisc since I am using the "mad_rpc" interface but I could see where someone could get very confused, or at best waste a lot of time looking at the code to figure out how to use the interface. Ira > -- Hal > > >> > Are those calls which use it not thread safe? > >> > >> They look OK but I'm not 100% sure yet. > > > > Yea, they look thread safe but I am not sure either. :-( > > > > I would be in favor of making all the utils use mad_rpc_open_port but it is > > up > > to Shasha if we go down this path. > > > > Ira > > > >> > >> -- Hal > >> > >> > Ira > >> > > >> > > >> >> -- Hal > >> >> > >> >>> As far as I know nothing uses those safe_*() primitives right now > >> >>> outside > >> >>> libibmad, so I think it is better to remove this confused functions > >> >>> from > >> >>> API (with changing library version, etc.). > >> >>> > >> >>> The primitives madrpc_lock() and madrpc_unlock() are just wrappers to > >> >>> hidden static pthread mutex which is not controlled by caller > >> >>> application. I think that it will be more robust for multithreaded > >> >>> application to use its own synchronization methods (pthread mutex or > >> >>> any > >> >>> other) for better control. So let's remove madrpc_lock/unlock() too. > >> >>> > >> >>> Signed-off-by: Sasha Khapyorsky > >> >>> --- > >> >>> libibmad/include/infiniband/mad.h | 41 > >> >>> ------------------------------------- > >> >>> libibmad/libibmad.ver | 2 +- > >> >>> libibmad/src/libibmad.map | 2 - > >> >>> libibmad/src/rpc.c | 15 ------------- > >> >>> libibmad/src/sa.c | 5 ++- > >> >>> 5 files changed, 4 insertions(+), 61 deletions(-) > >> >>> > >> >>> diff --git a/libibmad/include/infiniband/mad.h > >> >>> b/libibmad/include/infiniband/mad.h > >> >>> index eff6738..89b4be5 100644 > >> >>> --- a/libibmad/include/infiniband/mad.h > >> >>> +++ b/libibmad/include/infiniband/mad.h > >> >>> @@ -703,8 +703,6 @@ void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t > >> >>> *dport, ib_rmpp_hdr_t *rmpp, > >> >>> void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > >> >>> int num_classes); > >> >>> void madrpc_save_mad(void *madbuf, int len); > >> >>> -void madrpc_lock(void); > >> >>> -void madrpc_unlock(void); > >> >>> void madrpc_show_errors(int set); > >> >>> > >> >>> void * mad_rpc_open_port(char *dev_name, int dev_port, int > >> >>> *mgmt_classes, > >> >>> @@ -725,32 +723,6 @@ uint8_t * smp_query_via(void *buf, ib_portid_t > >> >>> *id, > >> >>> unsigned attrid, > >> >>> uint8_t * smp_set_via(void *buf, ib_portid_t *id, unsigned attrid, > >> >>> unsigned mod, > >> >>> unsigned timeout, const void *srcport); > >> >>> > >> >>> -inline static uint8_t * > >> >>> -safe_smp_query(void *rcvbuf, ib_portid_t *portid, unsigned attrid, > >> >>> unsigned mod, > >> >>> - unsigned timeout) > >> >>> -{ > >> >>> - uint8_t *p; > >> >>> - > >> >>> - madrpc_lock(); > >> >>> - p = smp_query(rcvbuf, portid, attrid, mod, timeout); > >> >>> - madrpc_unlock(); > >> >>> - > >> >>> - return p; > >> >>> -} > >> >>> - > >> >>> -inline static uint8_t * > >> >>> -safe_smp_set(void *rcvbuf, ib_portid_t *portid, unsigned attrid, > >> >>> unsigned mod, > >> >>> - unsigned timeout) > >> >>> -{ > >> >>> - uint8_t *p; > >> >>> - > >> >>> - madrpc_lock(); > >> >>> - p = smp_set(rcvbuf, portid, attrid, mod, timeout); > >> >>> - madrpc_unlock(); > >> >>> - > >> >>> - return p; > >> >>> -} > >> >>> - > >> >>> /* sa.c */ > >> >>> uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t > >> >>> *sa, > >> >>> unsigned timeout); > >> >>> @@ -761,19 +733,6 @@ int ib_path_query(ibmad_gid_t srcgid, > >> >>> ibmad_gid_t destgid, ib_portid_t *sm_id, > >> >>> int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > >> >>> ibmad_gid_t destgid, ib_portid_t *sm_id, void > >> >>> *buf); > >> >>> > >> >>> -inline static uint8_t * > >> >>> -safe_sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, > >> >>> - unsigned timeout) > >> >>> -{ > >> >>> - uint8_t *p; > >> >>> - > >> >>> - madrpc_lock(); > >> >>> - p = sa_call(rcvbuf, portid, sa, timeout); > >> >>> - madrpc_unlock(); > >> >>> - > >> >>> - return p; > >> >>> -} > >> >>> - > >> >>> /* resolve.c */ > >> >>> int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); > >> >>> int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, > >> >>> diff --git a/libibmad/libibmad.ver b/libibmad/libibmad.ver > >> >>> index 7e93c16..23d2dc2 100644 > >> >>> --- a/libibmad/libibmad.ver > >> >>> +++ b/libibmad/libibmad.ver > >> >>> @@ -6,4 +6,4 @@ > >> >>> # API_REV - advance on any added API > >> >>> # RUNNING_REV - advance any change to the vendor files > >> >>> # AGE - number of backward versions the API still supports > >> >>> -LIBVERSION=5:0:4 > >> >>> +LIBVERSION=2:0:0 > >> >>> diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > >> >>> index 927e51c..f944d86 100644 > >> >>> --- a/libibmad/src/libibmad.map > >> >>> +++ b/libibmad/src/libibmad.map > >> >>> @@ -72,14 +72,12 @@ IBMAD_1.3 { > >> >>> madrpc; > >> >>> madrpc_def_timeout; > >> >>> madrpc_init; > >> >>> - madrpc_lock; > >> >>> madrpc_portid; > >> >>> madrpc_rmpp; > >> >>> madrpc_save_mad; > >> >>> madrpc_set_retries; > >> >>> madrpc_set_timeout; > >> >>> madrpc_show_errors; > >> >>> - madrpc_unlock; > >> >>> ib_path_query; > >> >>> sa_call; > >> >>> sa_rpc_call; > >> >>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > >> >>> index 5226540..670a936 100644 > >> >>> --- a/libibmad/src/rpc.c > >> >>> +++ b/libibmad/src/rpc.c > >> >>> @@ -38,7 +38,6 @@ > >> >>> #include > >> >>> #include > >> >>> #include > >> >>> -#include > >> >>> #include > >> >>> #include > >> >>> > >> >>> @@ -286,20 +285,6 @@ madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, > >> >>> ib_rmpp_hdr_t *rmpp, void *data) > >> >>> return mad_rpc_rmpp(&port, rpc, dport, rmpp, data); > >> >>> } > >> >>> > >> >>> -static pthread_mutex_t rpclock = PTHREAD_MUTEX_INITIALIZER; > >> >>> - > >> >>> -void > >> >>> -madrpc_lock(void) > >> >>> -{ > >> >>> - pthread_mutex_lock(&rpclock); > >> >>> -} > >> >>> - > >> >>> -void > >> >>> -madrpc_unlock(void) > >> >>> -{ > >> >>> - pthread_mutex_unlock(&rpclock); > >> >>> -} > >> >>> - > >> >>> void > >> >>> madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int > >> >>> num_classes) > >> >>> { > >> >>> diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > >> >>> index 27b9d52..c601254 100644 > >> >>> --- a/libibmad/src/sa.c > >> >>> +++ b/libibmad/src/sa.c > >> >>> @@ -132,7 +132,7 @@ ib_path_query_via(const void *srcport, > >> >>> ibmad_gid_t > >> >>> srcgid, ibmad_gid_t destgid, > >> >>> if (srcport) { > >> >>> p = sa_rpc_call (srcport, buf, sm_id, &sa, 0); > >> >>> } else { > >> >>> - p = safe_sa_call(buf, sm_id, &sa, 0); > >> >>> + p = sa_call(buf, sm_id, &sa, 0); > >> >>> } > >> >>> if (!p) { > >> >>> IBWARN("sa call path_query failed"); > >> >>> @@ -142,8 +142,9 @@ ib_path_query_via(const void *srcport, > >> >>> ibmad_gid_t > >> >>> srcgid, ibmad_gid_t destgid, > >> >>> mad_decode_field(p, IB_SA_PR_DLID_F, &dlid); > >> >>> return dlid; > >> >>> } > >> >>> + > >> >>> int > >> >>> ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t > >> >>> *sm_id, void *buf) > >> >>> { > >> >>> - return ib_path_query_via (NULL, srcgid, destgid, sm_id, buf); > >> >>> + return ib_path_query_via(NULL, srcgid, destgid, sm_id, buf); > >> >>> } > >> >>> -- > >> >>> 1.6.0.4.766.g6fc4a > >> >>> > >> >>> _______________________________________________ > >> >>> general mailing list > >> >>> general at lists.openfabrics.org > >> >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >>> > >> >>> To unsubscribe, please visit http:// > >> >>> openib.org/mailman/listinfo/openib-general > >> >>> > >> >> _______________________________________________ > >> >> general mailing list > >> >> general at lists.openfabrics.org > >> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >> > >> >> To unsubscribe, please visit http:// > >> >> openib.org/mailman/listinfo/openib-general > >> >> > >> >> > >> > > >> > > >> > > >> > > >> > > > > > > -- > > Ira Weiny > > > -- Ira Weiny From sashak at voltaire.com Tue Feb 17 17:03:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 03:03:03 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <499AB068.2020205@dev.mellanox.co.il> References: <499AB068.2020205@dev.mellanox.co.il> Message-ID: <20090218010303.GZ7189@sashak.voltaire.com> Hi Yevgeny, On 14:41 Tue 17 Feb , Yevgeny Kliteynik wrote: > > This patch fixes bugzilla issue #1515: > > Topology: > |---------------| > | SW2 | > |---------------| > |x |y |z |v > |----| | | |----| > | | | | > | |----| |----| | > | | | | > a| b| c| d| > |---------------| |---------------| > | SW1 | | SW3 | > |---------------| |---------------| > | | > | | > HCA with SM HCA > > During the discovery: > > SM sends NodeInfo request to SW1 > SM sends NodeInfo request to SW2 through link a->x > SM discovers new node SW2: > - updates DR to SW2 to go through link a->x > - creates physp x And requests SwitchInfo from SW2, and on response sends PortInfo to all switch ports. PortInfo receiver will initialize all switch ports. Isn't it? Sasha > SM sends NodeInfo request to SW2 through link b->y > SM discovers a known node SW2 > - DOES NOT create physp y > - updates DR to SW2 to go through link b->y > > From now on, the DR to SW2 is going through port y, so OpenSM won't deal with > port y any more, leaving it uninitialized (no physp object for this port). > > The fix is to create physp for the newly discovered port of the known > switch node, same way as it is done for HCAs. > I also added one log message for the case that showed the problem - when > one of the link sides is uninitialized (no valid ports check). Perhaps > this log message should be an error message instead? > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_node_info_rcv.c | 24 +++++++++++++++++++++++- > 1 files changed, 23 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c > index c52c0d5..7da3103 100644 > --- a/opensm/opensm/osm_node_info_rcv.c > +++ b/opensm/opensm/osm_node_info_rcv.c > @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, > */ > if (!osm_node_link_has_valid_ports(p_node, port_num, > p_neighbor_node, > - p_ni_context->port_num)) > + p_ni_context->port_num)) { > + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", > + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); > goto _exit; > + } > > if (osm_node_link_exists(p_node, port_num, > p_neighbor_node, p_ni_context->port_num)) { > @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, > IN osm_node_t * const p_node, > IN const osm_madw_t * const p_madw) > { > + > + ib_smp_t *p_smp; > + ib_node_info_t *p_ni; > + uint8_t port_num; > + > OSM_LOG_ENTER(sm->p_log); > > + p_smp = osm_madw_get_smp_ptr(p_madw); > + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); > + port_num = ib_node_info_get_local_port_num(p_ni); > + > + if (!osm_node_get_physp_ptr(p_node, port_num)) { > + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > + "Creating physp for node GUID:0x%" > + PRIx64 ", port %u\n", > + cl_ntoh64(osm_node_get_node_guid(p_node)), > + port_num); > + osm_node_init_physp(p_node, p_madw); > + } > + > /* > If this switch has already been probed during this sweep, > then don't bother reprobing it. > -- > 1.5.1.4 > From sashak at voltaire.com Tue Feb 17 17:15:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 03:15:05 +0200 Subject: [ofa-general] ***SPAM*** opensm/osm_inform.c:__match_inf_rec question In-Reply-To: References: Message-ID: <20090218011457.GA7189@sashak.voltaire.com> On 17:56 Tue 17 Feb , Hal Rosenstock wrote: > In opensm/osm_inform.c:__match_inf_rec, around line 123, there is: > > /* if inform_info.gid is not zero, ignore lid range */ > if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid, > sizeof(p_infr_rec->inform_record.inform_info.gid))) { > > Shouldn't this be if (memcmp) rather than if (!memcmp) ? Yes, seems it should be without '!'. I can track it up to: commit ce7f839355b9674c8d806747169d404066194235 Author: Yevgeny Kliteynik Date: Mon Nov 27 16:08:42 2006 +0000 r10169: OpenSM: Comparing InformInfo records , where this code was introduced. Yevgeny! Do you remember was it just a typo? Sasha From weiny2 at llnl.gov Tue Feb 17 17:23:02 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 17:23:02 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090218002839.GW7189@sashak.voltaire.com> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> <20090218002839.GW7189@sashak.voltaire.com> Message-ID: <20090217172302.49a11f17.weiny2@llnl.gov> On Wed, 18 Feb 2009 02:28:39 +0200 Sasha Khapyorsky wrote: > On 14:28 Tue 17 Feb , Ira Weiny wrote: > > > > > > > Are those calls which use it not thread safe? > > > > > > They look OK but I'm not 100% sure yet. > > > > Yea, they look thread safe but I am not sure either. :-( > > Could you, Guys, be more explicit? Really... :) Neither interface is thread safe without the user implementing some sort of locking around the calls. > > > I would be in favor of making all the utils use mad_rpc_open_port but it is up > > to Shasha if we go down this path. > > The idea looks fine to me, let's review the patch. Working on a patch series now... This is mainly to clean up the interface. Thread safety is a 2nd consideration... Ira From sashak at voltaire.com Tue Feb 17 17:49:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 03:49:55 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090217172302.49a11f17.weiny2@llnl.gov> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> <20090218002839.GW7189@sashak.voltaire.com> <20090217172302.49a11f17.weiny2@llnl.gov> Message-ID: <20090218014955.GB7189@sashak.voltaire.com> On 17:23 Tue 17 Feb , Ira Weiny wrote: > > Neither interface is thread safe without the user implementing some > sort of locking around the calls. Really? What about this: int plus_three(int a) { return a + 3; } We could extrapolate of course. Sasha From hal.rosenstock at gmail.com Tue Feb 17 18:18:32 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 17 Feb 2009 21:18:32 -0500 Subject: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090218003957.GY7189@sashak.voltaire.com> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090218003957.GY7189@sashak.voltaire.com> Message-ID: On Tue, Feb 17, 2009 at 7:39 PM, Sasha Khapyorsky wrote: > On 09:52 Mon 16 Feb , Hal Rosenstock wrote: >> >> A first step would be removing the portid as static. If so, portid >> would need to be a supplied parameter to various mad routines and the >> existing ones relying on madrpc_portid would be deprecated. Does this >> make sense to do ? > > A first step would be converting all clients and internal usage in > libibmad (if any) to use a newer interface. If this will go smoothly > and things will not become overcomlicated, we could move forward - > to deprecate old interface... etc.. Nothing new. Why nothing new ? I think there are higher level support functions which need to support the newer API. -- Hal > Sasha > From weiny2 at llnl.gov Tue Feb 17 20:38:58 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 20:38:58 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090218014955.GB7189@sashak.voltaire.com> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> <20090218002839.GW7189@sashak.voltaire.com> <20090217172302.49a11f17.weiny2@llnl.gov> <20090218014955.GB7189@sashak.voltaire.com> Message-ID: <20090217203858.46abf45a.weiny2@llnl.gov> On Wed, 18 Feb 2009 03:49:55 +0200 Sasha Khapyorsky wrote: > On 17:23 Tue 17 Feb , Ira Weiny wrote: > > > > Neither interface is thread safe without the user implementing some > > sort of locking around the calls. > > Really? What about this: > > int plus_three(int a) > { > return a + 3; > } > > We could extrapolate of course. > I don't get it? Having static data like: static int mad_portid = -1; static int class_agent[MAX_CLASS]; Makes some functions dangerous. Then the other interface does not provide any locking... Oh I guess you are saying that "a" is ibmad_port thingy... Then yes if you don't have threads modifying it at the same time you will be ok. Ira From weiny2 at llnl.gov Tue Feb 17 21:06:39 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:39 -0800 Subject: [ofa-general] [PATCH 0/8] libibmad/infiniband-diags -- begin converting to "new" interface. Message-ID: <20090217210639.9ef74a75.weiny2@llnl.gov> Here are 8 patches which move a long way toward using just the new interface. ibping caused some new functions to be created. Like I said before this has less to do with thread safeness than it does with creating a common clean interface. If nothing else it moves toward a more complete "new" interface. Let me know what you think, Ira -- Ira Weiny From weiny2 at llnl.gov Tue Feb 17 21:06:42 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:42 -0800 Subject: [ofa-general] [PATCH 1/8] Clean up "new" interface Message-ID: <20090217210642.41c64624.weiny2@llnl.gov> >From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 17:32:15 -0800 Subject: [PATCH] Clean up "new" interface type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * Create new mad_rpc_portid(struct ibmad_port *srcport) function which mirrors madrpc_portid(void) Signed-off-by: weiny2 at llnl.gov --- libibmad/include/infiniband/mad.h | 58 ++++++++++++++++++++++-------------- libibmad/src/gs.c | 19 ++++++------ libibmad/src/libibmad.map | 1 + libibmad/src/resolve.c | 10 ++++-- libibmad/src/rpc.c | 29 +++++++++--------- libibmad/src/sa.c | 4 +- libibmad/src/smp.c | 4 +- 7 files changed, 71 insertions(+), 54 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 1aaaa1b..56b87e6 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt) } /* rpc.c */ +/* Depricated interface */ MAD_EXPORT int madrpc_portid(void); -MAD_EXPORT int madrpc_set_retries(int retries); -MAD_EXPORT int madrpc_set_timeout(int timeout); void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data); MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); void madrpc_save_mad(void *madbuf, int len); -MAD_EXPORT void madrpc_show_errors(int set); -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, +/* New interface */ +MAD_EXPORT void madrpc_show_errors(int set); +MAD_EXPORT int madrpc_set_retries(int retries); +MAD_EXPORT int madrpc_set_timeout(int timeout); +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); -void mad_rpc_close_port(void *ibmad_port); -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, +void mad_rpc_close_port(struct ibmad_port *srcport); +void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, +void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data); +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); /* smp.c */ MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout); MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout); + +/* smp.c new interface */ MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, - unsigned mod, unsigned timeout, const void *srcport); + unsigned mod, unsigned timeout, const struct ibmad_port *srcport); uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, - unsigned timeout, const void *srcport); + unsigned timeout, const struct ibmad_port *srcport); /* sa.c */ uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, unsigned timeout); -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, - ib_sa_call_t * sa, unsigned timeout); MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, + +/* sa.c new interface */ +uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid, + ib_sa_call_t * sa, unsigned timeout); +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* resolve.c */ @@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid); -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); +/* resolve.c new interface */ +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport); int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, - ib_portid_t * sm_id, int timeout, const void *srcport); + ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport); int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest, ib_portid_t * sm_id, - const void *srcport); + const struct ibmad_port *srcport); int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, - const void *srcport); + const struct ibmad_port *srcport); /* gs.c */ MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, @@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout); +/* gs.c new interface */ uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); + const struct ibmad_port *srcport); uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, - unsigned timeout, const void *srcport); + unsigned timeout, const struct ibmad_port *srcport); uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout, - const void *srcport); + const struct ibmad_port *srcport); uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); + const struct ibmad_port *srcport); uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, - unsigned timeout, const void *srcport); + unsigned timeout, + const struct ibmad_port *srcport); uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); + const struct ibmad_port *srcport); uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); + const struct ibmad_port *srcport); /* dump.c */ MAD_EXPORT ib_mad_dump_fn mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index d2c4574..e302caf 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -47,7 +47,7 @@ static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, unsigned id, - const void *srcport) + const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; int lid = dest->lid; @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO, srcport); @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, } uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, - unsigned timeout, const void *srcport) + unsigned timeout, const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_COUNTERS, srcport); @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port, static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout, - unsigned id, const void *srcport) + unsigned id, const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; int lid = dest->lid; @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return performance_reset_via(rcvbuf, dest, port, mask, timeout, IB_GSI_PORT_COUNTERS, srcport); @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_COUNTERS_EXT, srcport); @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, - unsigned timeout, const void *srcport) + unsigned timeout, + const struct ibmad_port *srcport) { return performance_reset_via(rcvbuf, dest, port, mask, timeout, IB_GSI_PORT_COUNTERS_EXT, srcport); @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_SAMPLES_CONTROL, srcport); @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_SAMPLES_RESULT, srcport); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index f944d86..94d7762 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -69,6 +69,7 @@ IBMAD_1.3 { mad_rpc_close_port; mad_rpc; mad_rpc_rmpp; + mad_rpc_portid; madrpc; madrpc_def_timeout; madrpc_init; diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 553949d..3291f43 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -45,7 +45,8 @@ #undef DEBUG #define DEBUG if (ibdebug) IBWARN -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport) +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport) { ib_portid_t self = { 0 }; uint8_t portinfo[64]; @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) } int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, - ib_portid_t * sm_id, int timeout, const void *srcport) + ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport) { ib_portid_t sm_portid; char buf[IB_SA_DATA_SIZE] = { 0 }; @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest_type, ib_portid_t * sm_id, - const void *srcport) + const struct ibmad_port *srcport) { uint64_t guid; int lid; @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, } int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, - const void *srcport) + const struct ibmad_port *srcport) { ib_portid_t self = { 0 }; uint8_t portinfo[64]; diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index e811526..d47873b 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -100,6 +100,11 @@ int madrpc_portid(void) return mad_portid; } +int mad_rpc_portid(struct ibmad_port *srcport) +{ + return (srcport->port_id); +} + static int _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, return -1; } -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) { - const struct ibmad_port *p = port_id; int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) return 0; - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, - p->class_agents[rpc->mgtclass], + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, + port->class_agents[rpc->mgtclass], len, rpc->timeout)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return 0; @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, return rcvdata; } -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) { - const struct ibmad_port *p = port_id; int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) return 0; - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, - p->class_agents[rpc->mgtclass], + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, + port->class_agents[rpc->mgtclass], len, rpc->timeout)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return 0; @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) } } -void *mad_rpc_open_port(char *dev_name, int dev_port, +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) { struct ibmad_port *p; @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port, return p; } -void mad_rpc_close_port(void *port_id) +void mad_rpc_close_port(struct ibmad_port *port) { - struct ibmad_port *p = port_id; - - umad_close_port(p->port_id); - free(p); + umad_close_port(port->port_id); + free(port); } uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c index 7403d4f..ddeb152 100644 --- a/libibmad/src/sa.c +++ b/libibmad/src/sa.c @@ -44,7 +44,7 @@ #undef DEBUG #define DEBUG if (ibdebug) IBWARN -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, unsigned timeout) { ib_rpc_t rpc = { 0 }; @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, IB_PR_COMPMASK_SGID |\ IB_PR_COMPMASK_NUMBPATH) -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf) { int npath; diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c index fad263c..e5489b3 100644 --- a/libibmad/src/smp.c +++ b/libibmad/src/smp.c @@ -45,7 +45,7 @@ #define DEBUG if (ibdebug) IBWARN uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid, - unsigned mod, unsigned timeout, const void *srcport) + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid, } uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid, - unsigned mod, unsigned timeout, const void *srcport) + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; -- 1.5.4.5 From weiny2 at llnl.gov Tue Feb 17 21:06:45 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:45 -0800 Subject: [ofa-general] [PATCH 2/8] Remove unused function madrpc_save_mad Message-ID: <20090217210645.e4762c94.weiny2@llnl.gov> >From 17ff2ea4947b64453e00b93ab1dfd639a69a7c35 Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 17:40:01 -0800 Subject: [PATCH] Remove unused function madrpc_save_mad including the internal data used by it. Signed-off-by: weiny2 at llnl.gov --- libibmad/include/infiniband/mad.h | 1 - libibmad/src/libibmad.map | 1 - libibmad/src/rpc.c | 14 -------------- 3 files changed, 0 insertions(+), 16 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 56b87e6..5806e70 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -731,7 +731,6 @@ void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data); MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); -void madrpc_save_mad(void *madbuf, int len); /* New interface */ MAD_EXPORT void madrpc_show_errors(int set); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 94d7762..6f0c0b5 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -75,7 +75,6 @@ IBMAD_1.3 { madrpc_init; madrpc_portid; madrpc_rmpp; - madrpc_save_mad; madrpc_set_retries; madrpc_set_timeout; madrpc_show_errors; diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index d47873b..20eeb89 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -57,8 +57,6 @@ static int iberrs; static int madrpc_retries = MAD_DEF_RETRIES; static int def_madrpc_timeout = MAD_DEF_TIMEOUT_MS; -static void *save_mad; -static int save_mad_len = 256; #undef DEBUG #define DEBUG if (ibdebug) IBWARN @@ -71,12 +69,6 @@ void madrpc_show_errors(int set) iberrs = set; } -void madrpc_save_mad(void *madbuf, int len) -{ - save_mad = madbuf; - save_mad_len = len; -} - int madrpc_set_retries(int retries) { if (retries > 0) @@ -121,12 +113,6 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, xdump(stderr, "send buf\n", sndbuf, umad_size() + len); } - if (save_mad) { - memcpy(save_mad, umad_get_mad(sndbuf), - save_mad_len < len ? save_mad_len : len); - save_mad = 0; - } - trid = (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); -- 1.5.4.5 From weiny2 at llnl.gov Tue Feb 17 21:06:46 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:46 -0800 Subject: [ofa-general] [PATCH 3/8] Convert ibaddr to "new" ibmad interface Message-ID: <20090217210646.5e74b9ed.weiny2@llnl.gov> >From 5bdf4bdf8ccba45f1a9a56b1c617fc711e73300d Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 17:56:12 -0800 Subject: [PATCH] Convert ibaddr to "new" ibmad interface Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/src/ibaddr.c | 17 ++++++++++++----- 1 files changed, 12 insertions(+), 5 deletions(-) diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index 88ad904..fa62dbc 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -45,6 +45,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static int ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid) { @@ -55,10 +57,10 @@ ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid) ibmad_gid_t gid; int lmc; - if (!smp_query(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0, srcport)) return -1; - if (!smp_query(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_query_via(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return -1; mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid); @@ -137,17 +139,22 @@ int main(int argc, char **argv) if (!show_lid && !show_gid) show_lid = show_gid = 1; - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { - if (ib_resolve_self(&portid, &port, 0) < 0) + if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0) IBERROR("can't resolve self port %s", argv[0]); } if (ib_resolve_addr(&portid, port, show_lid, show_gid) < 0) IBERROR("can't resolve requested address"); + + mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Tue Feb 17 21:06:48 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:48 -0800 Subject: [ofa-general] [PATCH 4/8] convert ibping to "new" ibmad interface Message-ID: <20090217210648.403309e0.weiny2@llnl.gov> >From d109788f46b5839698f3f4a1f75bcbfe22a3b46d Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 20:08:53 -0800 Subject: [PATCH] convert ibping to "new" ibmad interface To do this I needed the following additional functions mad_register_client_via mad_register_server_via mad_send_via mad_receive_via mad_respond_via ib_vendor_call_via Also further mark some functions as depricated and clean up interface a bit more. Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/src/ibping.c | 21 +++++++++---- libibmad/include/infiniband/mad.h | 29 ++++++++++++++++++ libibmad/src/libibmad.map | 5 +++ libibmad/src/mad_internal.h | 44 ++++++++++++++++++++++++++++ libibmad/src/register.c | 58 ++++++++++++++++++++++++++++++------- libibmad/src/rpc.c | 8 +---- libibmad/src/serv.c | 39 +++++++++++++++++++++++-- libibmad/src/vendor.c | 15 ++++++++- 8 files changed, 190 insertions(+), 29 deletions(-) create mode 100644 libibmad/src/mad_internal.h diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c index 29c98c2..7d458bf 100644 --- a/infiniband-diags/src/ibping.c +++ b/infiniband-diags/src/ibping.c @@ -48,6 +48,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE]; static char last_host[IB_VENDOR_RANGE2_DATA_SIZE]; @@ -90,7 +92,7 @@ ibping_serv(void) DEBUG("starting to serve..."); - while ((umad = mad_receive(0, -1))) { + while ((umad = mad_receive_via(0, -1, srcport))) { mad = umad_get_mad(umad); data = (char *)mad + IB_VENDOR_RANGE2_DATA_OFFS; @@ -99,7 +101,7 @@ ibping_serv(void) DEBUG("Pong: %s", data); - if (mad_respond(umad, 0, 0) < 0) + if (mad_respond_via(umad, 0, 0, srcport) < 0) DEBUG("respond failed"); mad_free(umad); @@ -128,7 +130,7 @@ ibping(ib_portid_t *portid, int quiet) call.timeout = 0; memset(&call.rmpp, 0, sizeof call.rmpp); - if (!ib_vendor_call(data, portid, &call)) + if (!ib_vendor_call_via(data, portid, &call, srcport)) return ~0llu; rtt = getcurrenttime() - start; @@ -216,10 +218,12 @@ int main(int argc, char **argv) if (!argc && !server) ibdiag_show_usage(); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (server) { - if (mad_register_server(ping_class, 0, 0, oui) < 0) + if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0) IBERROR("can't serve class %d on this port", ping_class); get_host_and_domain(host_and_domain, sizeof host_and_domain); @@ -229,10 +233,11 @@ int main(int argc, char **argv) exit(0); } - if (mad_register_client(ping_class, 0) < 0) + if (mad_register_client_via(ping_class, 0, srcport) < 0) IBERROR("can't register ping class %d on this port", ping_class); - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); signal(SIGINT, report); @@ -260,5 +265,7 @@ int main(int argc, char **argv) report(0); + mad_rpc_close_port(srcport); + exit(-1); } diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 5806e70..8e61395 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -650,6 +650,7 @@ enum MAD_NODE_TYPE { }; /******************************************************************************/ +struct ibmad_port; /* portid.c */ MAD_EXPORT char *portid2str(ib_portid_t * portid); @@ -692,26 +693,50 @@ MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data); /* register.c */ +/* depricated */ MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version); MAD_EXPORT int mad_register_server(int mgmt, uint8_t rmpp_version, long method_mask[16 / sizeof(long)], uint32_t class_oui); + +/* register.c new interface */ +MAD_EXPORT int mad_register_client_via(int mgmt, uint8_t rmpp_version, + struct ibmad_port *srcport); +MAD_EXPORT int mad_register_server_via(int mgmt, uint8_t rmpp_version, + long method_mask[16 / sizeof(long)], + uint32_t class_oui, + struct ibmad_port *srcport); MAD_EXPORT int mad_class_agent(int mgmt); MAD_EXPORT int mad_agent_class(int agent); /* serv.c */ +/* depricated */ MAD_EXPORT int mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data); MAD_EXPORT void *mad_receive(void *umad, int timeout); MAD_EXPORT int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus); + +/* serv.c new interface */ +MAD_EXPORT int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, + ib_rmpp_hdr_t * rmpp, void *data, + struct ibmad_port *srcport); +MAD_EXPORT void *mad_receive_via(void *umad, int timeout, + struct ibmad_port *srcport); +MAD_EXPORT int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus, + struct ibmad_port *srcport); MAD_EXPORT void *mad_alloc(void); MAD_EXPORT void mad_free(void *umad); /* vendor.c */ +/* depricated */ MAD_EXPORT uint8_t *ib_vendor_call(void *data, ib_portid_t * portid, ib_vendor_call_t * call); +/* vendor.c new interface */ +MAD_EXPORT uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid, + ib_vendor_call_t * call, + struct ibmad_port *srcport); static inline int mad_is_vendor_range1(int mgmt) { @@ -746,6 +771,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); /* smp.c */ +/* depricated */ MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout); MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, @@ -758,6 +784,7 @@ uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout, const struct ibmad_port *srcport); /* sa.c */ +/* depricated */ uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, unsigned timeout); MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ @@ -769,6 +796,7 @@ int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* resolve.c */ +/* depricated */ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout); @@ -790,6 +818,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, const struct ibmad_port *srcport); /* gs.c */ +/* depricated */ MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout); MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 6f0c0b5..ee1804a 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -60,6 +60,8 @@ IBMAD_1.3 { mad_class_agent; mad_register_client; mad_register_server; + mad_register_client_via; + mad_register_server_via; ib_resolve_guid; ib_resolve_portid_str; ib_resolve_self; @@ -85,10 +87,13 @@ IBMAD_1.3 { mad_free; mad_receive; mad_respond; + mad_receive_via; + mad_respond_via; mad_send; smp_query; smp_set; ib_vendor_call; + ib_vendor_call_via; smp_query_via; smp_set_via; ib_path_query_via; diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h new file mode 100644 index 0000000..9afe7a9 --- /dev/null +++ b/libibmad/src/mad_internal.h @@ -0,0 +1,44 @@ +/* + * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _MAD_INTERNAL_H_ +#define _MAD_INTERNAL_H_ + +#define MAX_CLASS 256 + +struct ibmad_port { + int port_id; /* file descriptor returned by umad_open() */ + int class_agents[MAX_CLASS]; /* class2agent mapper */ +}; + +#endif /* _MAD_INTERNAL_H_ */ diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 4d91ff8..4aabd7c 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -43,10 +43,11 @@ #include #include +#include "mad_internal.h" + #undef DEBUG #define DEBUG if (ibdebug) IBWARN -#define MAX_CLASS 256 #define MAX_AGENTS 256 static int class_agent[MAX_CLASS]; @@ -136,22 +137,57 @@ int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version) int mad_register_client(int mgmt, uint8_t rmpp_version) { + int rc = 0; + struct ibmad_port port; + + port.port_id = madrpc_portid(); + rc = mad_register_client_via(mgmt, rmpp_version, &port); + if (rc < 0) + return rc; + return register_agent(port.class_agents[mgmt], mgmt); +} + +int mad_register_client_via(int mgmt, uint8_t rmpp_version, + struct ibmad_port *srcport) +{ int agent; - agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version); + if (!srcport) + return -1; + + agent = mad_register_port_client(mad_rpc_portid(srcport), mgmt, rmpp_version); if (agent < 0) return agent; - return register_agent(agent, mgmt); + srcport->class_agents[mgmt] = agent; + return 0; } int mad_register_server(int mgmt, uint8_t rmpp_version, long method_mask[], uint32_t class_oui) { + int rc = 0; + struct ibmad_port port; + + port.port_id = madrpc_portid(); + port.class_agents[mgmt] = class_agent[mgmt]; + rc = mad_register_server_via(mgmt, rmpp_version, + method_mask, class_oui, + &port); + if (rc < 0) + return rc; + return register_agent(port.class_agents[mgmt], mgmt); +} + +int +mad_register_server_via(int mgmt, uint8_t rmpp_version, + long method_mask[], uint32_t class_oui, + struct ibmad_port *srcport) +{ long class_method_mask[16 / sizeof(long)]; uint8_t oui[3]; - int agent, vers, mad_portid; + int agent, vers; if (method_mask) memcpy(class_method_mask, method_mask, @@ -159,11 +195,12 @@ mad_register_server(int mgmt, uint8_t rmpp_version, else memset(class_method_mask, 0xff, sizeof(class_method_mask)); - if ((mad_portid = madrpc_portid()) < 0) + if (!srcport) return -1; - if (class_agent[mgmt] >= 0) { - DEBUG("Class 0x%x already registered", mgmt); + if (srcport->class_agents[mgmt] >= 0) { + DEBUG("Class 0x%x already registered %d", + mgmt, srcport->class_agents[mgmt]); return -1; } if ((vers = mgmt_class_vers(mgmt)) <= 0) { @@ -175,19 +212,18 @@ mad_register_server(int mgmt, uint8_t rmpp_version, oui[0] = (class_oui >> 16) & 0xff; oui[1] = (class_oui >> 8) & 0xff; oui[2] = class_oui & 0xff; - if ((agent = umad_register_oui(mad_portid, mgmt, rmpp_version, + if ((agent = umad_register_oui(srcport->port_id, mgmt, rmpp_version, oui, class_method_mask)) < 0) { DEBUG("Can't register agent for class %d", mgmt); return -1; } - } else if ((agent = umad_register(mad_portid, mgmt, vers, rmpp_version, + } else if ((agent = umad_register(srcport->port_id, mgmt, vers, rmpp_version, class_method_mask)) < 0) { DEBUG("Can't register agent for class %d", mgmt); return -1; } - if (register_agent(agent, mgmt) < 0) - return -1; + srcport->class_agents[mgmt] = agent; return agent; } diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index 20eeb89..bcb0a75 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -43,12 +43,7 @@ #include #include -#define MAX_CLASS 256 - -struct ibmad_port { - int port_id; /* file descriptor returned by umad_open() */ - int class_agents[MAX_CLASS]; /* class2agent mapper */ -}; +#include "mad_internal.h" int ibdebug; @@ -325,6 +320,7 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, return NULL; } + memset(p->class_agents, 0xff, sizeof p->class_agents); while (num_classes--) { uint8_t rmpp_version = 0; int mgmt = *mgmt_classes++; diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index c7631bb..0ce1660 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -42,12 +42,25 @@ #include #include +#include "mad_internal.h" + #undef DEBUG #define DEBUG if (ibdebug) IBWARN int mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) { + struct ibmad_port port; + + port.port_id = madrpc_portid(); + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); + return mad_send_via(rpc, dport, rmpp, data, &port); +} + +int +mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data, + struct ibmad_port *srcport) +{ uint8_t pktbuf[1024]; void *umad = pktbuf; @@ -64,7 +77,7 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) (char *)umad_get_mad(umad) + rpc->dataoffs, rpc->datasz); } - if (umad_send(madrpc_portid(), mad_class_agent(rpc->mgtclass), + if (umad_send(srcport->port_id, srcport->class_agents[rpc->mgtclass], umad, IB_MAD_SIZE, rpc->timeout, 0) < 0) { IBWARN("send failed; %m"); return -1; @@ -75,6 +88,18 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus) { + int i = 0; + struct ibmad_port port; + + port.port_id = madrpc_portid(); + for (i = 1; i < MAX_CLASS; i++) + port.class_agents[i] = mad_class_agent(i); + return mad_respond_via(umad, portid, rstatus, &port); +} + +int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus, + struct ibmad_port *srcport) +{ uint8_t *mad = umad_get_mad(umad); ib_mad_addr_t *mad_addr; ib_rpc_t rpc = { 0 }; @@ -138,7 +163,7 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus) if (ibdebug > 1) xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE); - if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad, + if (umad_send(srcport->port_id, srcport->class_agents[rpc.mgtclass], umad, IB_MAD_SIZE, rpc.timeout, 0) < 0) { DEBUG("send failed; %m"); return -1; @@ -149,11 +174,19 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus) void *mad_receive(void *umad, int timeout) { + struct ibmad_port port; + + port.port_id = madrpc_portid(); + return mad_receive_via(umad, timeout, &port); +} + +void *mad_receive_via(void *umad, int timeout, struct ibmad_port *srcport) +{ void *mad = umad ? umad : umad_alloc(1, umad_size() + IB_MAD_SIZE); int agent; int length = IB_MAD_SIZE; - if ((agent = umad_recv(madrpc_portid(), mad, &length, timeout)) < 0) { + if ((agent = umad_recv(srcport->port_id, mad, &length, timeout)) < 0) { if (!umad) umad_free(mad); DEBUG("recv failed: %m"); diff --git a/libibmad/src/vendor.c b/libibmad/src/vendor.c index 50a878e..1a129e5 100644 --- a/libibmad/src/vendor.c +++ b/libibmad/src/vendor.c @@ -40,6 +40,7 @@ #include #include +#include "mad_internal.h" #undef DEBUG #define DEBUG if (ibdebug) IBWARN @@ -53,6 +54,16 @@ static inline int response_expected(int method) uint8_t *ib_vendor_call(void *data, ib_portid_t * portid, ib_vendor_call_t * call) { + struct ibmad_port port; + + port.port_id = madrpc_portid(); + return ib_vendor_call_via(data, portid, call, &port); +} + +uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid, + ib_vendor_call_t * call, + struct ibmad_port *srcport) +{ ib_rpc_t rpc = { 0 }; int range1 = 0, resp_expected; @@ -90,7 +101,7 @@ uint8_t *ib_vendor_call(void *data, ib_portid_t * portid, portid->qkey = IB_DEFAULT_QP1_QKEY; if (resp_expected) - return madrpc_rmpp(&rpc, portid, 0, data); /* FIXME: no RMPP for now */ + return mad_rpc_rmpp(srcport, &rpc, portid, 0, data); /* FIXME: no RMPP for now */ - return mad_send(&rpc, portid, 0, data) < 0 ? 0 : data; /* FIXME: no RMPP for now */ + return mad_send_via(&rpc, portid, 0, data, srcport) < 0 ? 0 : data; /* FIXME: no RMPP for now */ } -- 1.5.4.5 From weiny2 at llnl.gov Tue Feb 17 21:06:50 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:50 -0800 Subject: [ofa-general] [PATCH 5/8] Convert ibportstate to "new" ibmad interface Message-ID: <20090217210650.3397dd72.weiny2@llnl.gov> >From dacabed9a22d308d9f61beb6f4906f2414a5ee29 Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 20:21:14 -0800 Subject: [PATCH] Convert ibportstate to "new" ibmad interface Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/src/ibportstate.c | 16 +++++++++++----- 1 files changed, 11 insertions(+), 5 deletions(-) diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index d1a112b..4edafd0 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -46,6 +46,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + /*******************************************/ static int @@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data) { int node_type; - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) return -1; node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); @@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) char buf[2048]; char val[64]; - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return -1; if (port_op != 4) { @@ -223,9 +225,12 @@ int main(int argc, char **argv) if (argc < 2) ibdiag_show_usage(); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); /* First, make sure it is a switch port if it is a "set" */ @@ -314,7 +319,8 @@ int main(int argc, char **argv) peerportid.drpath.p[1] = portnum; /* Set DrSLID to local lid */ - if (ib_resolve_self(&selfportid, &selfport, 0) < 0) + if (ib_resolve_self_via(&selfportid, + &selfport, 0, srcport) < 0) IBERROR("could not resolve self"); peerportid.drpath.drslid = selfportid.lid; peerportid.drpath.drdlid = 0xffff; -- 1.5.4.5 From weiny2 at llnl.gov Tue Feb 17 21:06:53 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:53 -0800 Subject: [ofa-general] [PATCH 6/8] Convert ibroute to "new" ibmad interface Message-ID: <20090217210653.9c88786f.weiny2@llnl.gov> >From 2edbb6ec9d7828bfd75777dbaab8918675d3bd06 Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 20:28:21 -0800 Subject: [PATCH] Convert ibroute to "new" ibmad interface Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/src/ibroute.c | 30 +++++++++++++++++++----------- 1 files changed, 19 insertions(+), 11 deletions(-) diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 144d1b2..60bfdd8 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -49,6 +49,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static int brief, dump_all, multicast; /*******************************************/ @@ -61,12 +63,12 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid, int type; DEBUG("checking node type"); - if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, 0)) { + if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, 0, srcport)) { xdump(stderr, "nodeinfo\n", ni, sizeof ni); return "node info failed: valid addr?"; } - if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, 0)) + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, 0, srcport)) return "node desc failed"; mad_decode_field(ni, IB_NODE_TYPE_F, &type); @@ -77,7 +79,7 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid, mad_decode_field(ni, IB_NODE_NPORTS_F, nports); mad_decode_field(ni, IB_NODE_GUID_F, guid); - if (!smp_query(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0)) + if (!smp_query_via(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0, srcport)) return "switch info failed: is a switch node?"; return 0; @@ -195,7 +197,8 @@ dump_multicast_tables(ib_portid_t *portid, int startlid, int endlid) mod = (block - IB_MIN_MCAST_LID/IB_MLIDS_IN_BLOCK) | (j << 28); DEBUG("reading block %x chunk %d mod %x", block, j, mod); - if (!smp_query(mft + j, portid, IB_ATTR_MULTICASTFORWTBL, mod, 0)) + if (!smp_query_via(mft + j, portid, + IB_ATTR_MULTICASTFORWTBL, mod, 0, srcport)) return "multicast forwarding table get failed"; } @@ -259,9 +262,9 @@ dump_lid(char *str, int strlen, int lid, int valid) portguid = 0; lidport.lid = lid; - if (!smp_query(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100) || - !smp_query(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100) || - !smp_query(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100)) + if (!smp_query_via(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100, srcport) || + !smp_query_via(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100, srcport) || + !smp_query_via(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100, srcport)) return snprintf(str, strlen, ": (unknown node and type)"); mad_decode_field(ni, IB_NODE_PORT_GUID_F, &portguid); @@ -316,7 +319,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid) endblock = ALIGN(endlid, IB_SMP_DATA_SIZE) / IB_SMP_DATA_SIZE; for (block = startblock; block <= endblock; block++) { DEBUG("reading block %d", block); - if (!smp_query(lft, portid, IB_ATTR_LINEARFORWTBL, block, 0)) + if (!smp_query_via(lft, portid, IB_ATTR_LINEARFORWTBL, block, + 0, srcport)) return "linear forwarding table get failed"; i = block * IB_SMP_DATA_SIZE; e = i + IB_SMP_DATA_SIZE; @@ -403,12 +407,15 @@ int main(int argc, char **argv) if (argc > 2) endlid = strtoul(argv[2], 0, 0); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (!argc) { - if (ib_resolve_self(&portid, 0, 0) < 0) + if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0) IBERROR("can't resolve self addr"); - } else if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + } else if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[1]); if (multicast) @@ -419,5 +426,6 @@ int main(int argc, char **argv) if (err) IBERROR("dump tables: %s", err); + mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Tue Feb 17 21:06:54 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:54 -0800 Subject: [ofa-general] [PATCH 7/8] Convert ibsendtrap to "new" ibmad interface Message-ID: <20090217210654.b70a38d3.weiny2@llnl.gov> >From ac3d76c8ed77ab406a3297c1ba15598ae7cc15d2 Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 20:45:16 -0800 Subject: [PATCH] Convert ibsendtrap to "new" ibmad interface also make mad_send_via public to do the conversion Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/src/ibsendtrap.c | 13 +++++++++---- libibmad/src/libibmad.map | 1 + 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index ba6aa8b..d038dff 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -47,6 +47,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static int send_144_node_desc_update(void) { ib_portid_t sm_port; @@ -55,10 +57,10 @@ static int send_144_node_desc_update(void) ib_rpc_t trap_rpc; ib_mad_notice_attr_t notice; - if (ib_resolve_self(&selfportid, &selfport, NULL)) + if (ib_resolve_self_via(&selfportid, &selfport, NULL, srcport)) IBERROR("can't resolve self"); - if (ib_resolve_smlid(&sm_port, 0)) + if (ib_resolve_smlid_via(&sm_port, 0, srcport)) IBERROR("can't resolve SM destination port"); memset(&trap_rpc, 0, sizeof(trap_rpc)); @@ -80,7 +82,7 @@ static int send_144_node_desc_update(void) notice.data_details.ntc_144.change_flgs = TRAP_144_MASK_NODE_DESCRIPTION_CHANGE; - return (mad_send(&trap_rpc, &sm_port, NULL, ¬ice)); + return (mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport)); } typedef struct _trap_def { @@ -137,7 +139,10 @@ int main(int argc, char **argv) } madrpc_show_errors(1); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); + + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); return (send_trap(trap_name)); } diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index ee1804a..4a44f02 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -90,6 +90,7 @@ IBMAD_1.3 { mad_receive_via; mad_respond_via; mad_send; + mad_send_via; smp_query; smp_set; ib_vendor_call; -- 1.5.4.5 From weiny2 at llnl.gov Tue Feb 17 21:06:56 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 17 Feb 2009 21:06:56 -0800 Subject: [ofa-general] [PATCH 8/8] Convert ibtracert to "new" ibmad interface Message-ID: <20090217210656.598be400.weiny2@llnl.gov> >From 69db58d3e525031f5a975403574b36d6b9b3adf2 Mon Sep 17 00:00:00 2001 From: weiny2 at llnl.gov Date: Tue, 17 Feb 2009 20:56:40 -0800 Subject: [PATCH] Convert ibtracert to "new" ibmad interface Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/src/ibtracert.c | 36 ++++++++++++++++++++++++------------ 1 files changed, 24 insertions(+), 12 deletions(-) diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index ea5662b..1965aa0 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -50,6 +50,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + #define MAXHOPS 63 static char *node_type_str[] = { @@ -116,10 +118,10 @@ get_node(Node *node, Port *port, ib_portid_t *portid) void *pi = port->portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; char *s, *e; - if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout)) + if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport)) return -1; - if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout)) + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport)) return -1; for (s = nd, e = s + 64; s < e; s++) { @@ -129,7 +131,7 @@ get_node(Node *node, Port *port, ib_portid_t *portid) *s = ' '; } - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout)) + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport)) return -1; mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid); @@ -151,7 +153,7 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid) { void *si = sw->switchinfo, *fdb = sw->fdb; - if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) + if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport)) return -1; mad_decode_field(si, IB_SW_LINEAR_FDB_CAP_F, &sw->linearcap); @@ -160,7 +162,8 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid) if (lid > sw->linearcap && lid > sw->linearFDBtop) return -1; - if (!smp_query(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64, timeout)) + if (!smp_query_via(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64, + timeout, srcport)) return -1; DEBUG("portid %s: forward lid %d to port %d", @@ -382,7 +385,8 @@ get_port(Port *port, int portnum, ib_portid_t *portid) port->portnum = portnum; - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout, + srcport)) return -1; mad_decode_field(pi, IB_PORT_LID_F, &port->lid); @@ -439,7 +443,7 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map) memset(map, 0, 256); - if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) + if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport)) return -1; mlid -= 0xc000; @@ -453,8 +457,8 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map) maxsets = (node->numports + 15) / 16; /* round up */ for (set = 0; set < maxsets; set++) { - if (!smp_query(mdb, portid, IB_ATTR_MULTICASTFORWTBL, - block | (set << 28), timeout)) + if (!smp_query_via(mdb, portid, IB_ATTR_MULTICASTFORWTBL, + block | (set << 28), timeout, srcport)) return -1; for (i = 0; i < 16; i++, map++) { @@ -746,13 +750,18 @@ int main(int argc, char **argv) if (ibd_timeout) timeout = ibd_timeout; - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + node_name_map = open_node_name_map(node_name_map_file); - if (ib_resolve_portid_str(&src_portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&src_portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve source port %s", argv[0]); - if (ib_resolve_portid_str(&dest_portid, argv[1], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&dest_portid, argv[1], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[1]); if (ibd_dest_type == IB_DEST_DRPATH) { @@ -796,5 +805,8 @@ int main(int argc, char **argv) dump_mcpath(endnode, dumplevel); close_node_name_map(node_name_map); + + mad_rpc_close_port(srcport); + exit(0); } -- 1.5.4.5 From jackm at dev.mellanox.co.il Tue Feb 17 23:13:15 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 18 Feb 2009 09:13:15 +0200 Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp, do path_free only for newly-created paths In-Reply-To: References: <200902171701.36107.jackm@dev.mellanox.co.il> Message-ID: <200902180913.16171.jackm@dev.mellanox.co.il> On Wednesday 18 February 2009 00:54, Roland Dreier wrote: >  > Signed-off-by: Jack Morgenstein >  > Signed-off-by: Moni Shua > > This doesn't make any sense... Moni was not involved in sending this > patch at all, and in any case since you are sending the patch your s-o-b > should be last.  If you want to give credit to Moni then include it in > the description as you did for Yossi. > Yossi identified the problem flow. I wrote and tested the actual patch. Moni reviewed it, and I wrote the final version. I always thought that the first s-o-b was for the patch writer. Next time, I'll do it right. From monis at Voltaire.COM Wed Feb 18 00:07:08 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 18 Feb 2009 10:07:08 +0200 Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp, do path_free only for newly-created paths In-Reply-To: References: <200902171701.36107.jackm@dev.mellanox.co.il> Message-ID: <499BC1AC.6010908@Voltaire.COM> Roland Dreier wrote: > thanks, applied... > > > Signed-off-by: Jack Morgenstein > > Signed-off-by: Moni Shua > > This doesn't make any sense... Moni was not involved in sending this > patch at all, and in any case since you are sending the patch your s-o-b > should be last. If you want to give credit to Moni then include it in > the description as you did for Yossi. > This is fine with me (if it's still relevant) From kliteyn at dev.mellanox.co.il Wed Feb 18 01:15:02 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 18 Feb 2009 11:15:02 +0200 Subject: [ofa-general] ***SPAM*** opensm/osm_inform.c:__match_inf_rec question In-Reply-To: <20090218011457.GA7189@sashak.voltaire.com> References: <20090218011457.GA7189@sashak.voltaire.com> Message-ID: <499BD196.8070504@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > On 17:56 Tue 17 Feb , Hal Rosenstock wrote: >> In opensm/osm_inform.c:__match_inf_rec, around line 123, there is: >> >> /* if inform_info.gid is not zero, ignore lid range */ >> if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid, >> sizeof(p_infr_rec->inform_record.inform_info.gid))) { >> >> Shouldn't this be if (memcmp) rather than if (!memcmp) ? > > Yes, seems it should be without '!'. I can track it up to: > > commit ce7f839355b9674c8d806747169d404066194235 > Author: Yevgeny Kliteynik > Date: Mon Nov 27 16:08:42 2006 +0000 > > r10169: OpenSM: Comparing InformInfo records > > , where this code was introduced. > > Yevgeny! Do you remember was it just a typo? Can't think of any reason for the '!' to be there. Looks like a typo. -- Yevgeny > Sasha > From kliteyn at dev.mellanox.co.il Wed Feb 18 01:31:07 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 18 Feb 2009 11:31:07 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <20090218010303.GZ7189@sashak.voltaire.com> References: <499AB068.2020205@dev.mellanox.co.il> <20090218010303.GZ7189@sashak.voltaire.com> Message-ID: <499BD55B.3090606@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 14:41 Tue 17 Feb , Yevgeny Kliteynik wrote: >> This patch fixes bugzilla issue #1515: >> >> Topology: >> |---------------| >> | SW2 | >> |---------------| >> |x |y |z |v >> |----| | | |----| >> | | | | >> | |----| |----| | >> | | | | >> a| b| c| d| >> |---------------| |---------------| >> | SW1 | | SW3 | >> |---------------| |---------------| >> | | >> | | >> HCA with SM HCA >> >> During the discovery: >> >> SM sends NodeInfo request to SW1 >> SM sends NodeInfo request to SW2 through link a->x >> SM discovers new node SW2: >> - updates DR to SW2 to go through link a->x >> - creates physp x > > And requests SwitchInfo from SW2, and on response sends PortInfo to all > switch ports. PortInfo receiver will initialize all switch ports. Isn't > it? Links are created only by getting NodeInfo response. W/o the fix, when SW1 gets NodeInfo from SW2 through link b->y, it doesn't initialize physp for y, hence the link can't be created. So the only chance for the link to be created is when SW2 will send NodeInfo request to SW1 through link y->b. But this isn't happening, because DR for SW2 is updated to contain this link, so SM doesn't probe the remote side of y to avoid loop. BTW, thing happens with every other link that connects same nodes. In the example above, link v<->d will be missing as well. -- Yevgeny > Sasha > >> SM sends NodeInfo request to SW2 through link b->y >> SM discovers a known node SW2 >> - DOES NOT create physp y >> - updates DR to SW2 to go through link b->y >> >> From now on, the DR to SW2 is going through port y, so OpenSM won't deal with >> port y any more, leaving it uninitialized (no physp object for this port). >> >> The fix is to create physp for the newly discovered port of the known >> switch node, same way as it is done for HCAs. >> I also added one log message for the case that showed the problem - when >> one of the link sides is uninitialized (no valid ports check). Perhaps >> this log message should be an error message instead? >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/osm_node_info_rcv.c | 24 +++++++++++++++++++++++- >> 1 files changed, 23 insertions(+), 1 deletions(-) >> >> diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c >> index c52c0d5..7da3103 100644 >> --- a/opensm/opensm/osm_node_info_rcv.c >> +++ b/opensm/opensm/osm_node_info_rcv.c >> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, >> */ >> if (!osm_node_link_has_valid_ports(p_node, port_num, >> p_neighbor_node, >> - p_ni_context->port_num)) >> + p_ni_context->port_num)) { >> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", >> + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); >> goto _exit; >> + } >> >> if (osm_node_link_exists(p_node, port_num, >> p_neighbor_node, p_ni_context->port_num)) { >> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, >> IN osm_node_t * const p_node, >> IN const osm_madw_t * const p_madw) >> { >> + >> + ib_smp_t *p_smp; >> + ib_node_info_t *p_ni; >> + uint8_t port_num; >> + >> OSM_LOG_ENTER(sm->p_log); >> >> + p_smp = osm_madw_get_smp_ptr(p_madw); >> + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); >> + port_num = ib_node_info_get_local_port_num(p_ni); >> + >> + if (!osm_node_get_physp_ptr(p_node, port_num)) { >> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> + "Creating physp for node GUID:0x%" >> + PRIx64 ", port %u\n", >> + cl_ntoh64(osm_node_get_node_guid(p_node)), >> + port_num); >> + osm_node_init_physp(p_node, p_madw); >> + } >> + >> /* >> If this switch has already been probed during this sweep, >> then don't bother reprobing it. >> -- >> 1.5.1.4 >> > From sashak at voltaire.com Wed Feb 18 01:52:30 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 11:52:30 +0200 Subject: [ofa-general] Re: [PATCH 8/8] [ib-diags] smpquery: add support for WinOF In-Reply-To: <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com> Message-ID: <20090218095230.GC7189@sashak.voltaire.com> Hi Sean, On 14:37 Tue 17 Feb , Sean Hefty wrote: > Allow smpquery to build and run on both Linux and Windows. Window > build files are maintained in the WinOF respository. These changes > allow dropping the infiniband-diags into the WinOF build environment. > > Signed-off-by: Sean Hefty > --- > > infiniband-diags/src/smpquery.c | 8 ++++---- > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c > index 44280e1..2d3d91b 100644 > --- a/infiniband-diags/src/smpquery.c > +++ b/infiniband-diags/src/smpquery.c > @@ -47,7 +47,7 @@ > > #include > #include > -#include > +#include Is it needed? Rest tools use similar path with leading 'infiniband'. > > #include "ibdiag_common.h" > > @@ -191,7 +191,7 @@ pkey_table(ib_portid_t *dest, char **argv, int argc) > } else > mad_decode_field(data, IB_NODE_PARTITION_CAP_F, &n); > > - for (i = 0; i < (n + 31) / 32; i++) { > + for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) { Wouldn't it be better to make declare i, j, k as int? Width 32 doesn't make any sense here. > mod = i | (portnum << 16); > if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0)) > return "pkey table query failed"; > @@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc) > return "port info failed"; > mad_decode_field(data, IB_PORT_GUID_CAP_F, &n); > > - for (i = 0; i < (n + 7) / 8; i++) { > + for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) { Ditto. Sasha > mod = i; > if (!smp_query(data, dest, IB_ATTR_GUID_INFO, mod, 0)) > return "guid info query failed"; > @@ -412,7 +412,7 @@ int main(int argc, char **argv) > const struct ibdiag_opt opts[] = { > { "combined", 'c', 0, NULL, "use Combined route address argument"}, > { "node-name-map", 1, 1, "", "node name map file"}, > - {} > + { 0 } > }; > const char *usage_examples[] = { > "portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier", > > > From sashak at voltaire.com Wed Feb 18 01:54:15 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 11:54:15 +0200 Subject: [ofa-general] Re: [PATCH 7/8] [ib-diags] smpdump: add support for WinOF In-Reply-To: References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <20090218095415.GD7189@sashak.voltaire.com> On 14:36 Tue 17 Feb , Sean Hefty wrote: > Allow smpdump to build and run on both Linux and Windows. Window > build files are maintained in the WinOF respository. These changes > allow dropping the infiniband-diags into the WinOF build environment. > > Signed-off-by: Sean Hefty Applied (patches 1-7). Thanks. Sasha From sashak at voltaire.com Wed Feb 18 02:00:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 12:00:08 +0200 Subject: [ofa-general] Re: [PATCH 9/8] [ib-diag] ibping: add support for WinOF In-Reply-To: References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <20090218100008.GE7189@sashak.voltaire.com> On 16:05 Tue 17 Feb , Sean Hefty wrote: > Allow ibping to build and run on both Linux and Windows. Window > build files are maintained in the WinOF respository. These changes > allow dropping the infiniband-diags into the WinOF build environment. > > For portability, use complib to obtain time stamps. > > Signed-off-by: Sean Hefty Applied. Thanks. > --- > Converted another diag this afternoon. I was able to build and execute this, > but apparently I don't have anything on my fabric that responds to the pings. You need to run ibping server ('ibping -S') on one side and then run ibping . Sasha From sashak at voltaire.com Wed Feb 18 02:17:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 12:17:21 +0200 Subject: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090218003957.GY7189@sashak.voltaire.com> Message-ID: <20090218101721.GF7189@sashak.voltaire.com> On 21:18 Tue 17 Feb , Hal Rosenstock wrote: > On Tue, Feb 17, 2009 at 7:39 PM, Sasha Khapyorsky wrote: > > On 09:52 Mon 16 Feb , Hal Rosenstock wrote: > >> > >> A first step would be removing the portid as static. If so, portid > >> would need to be a supplied parameter to various mad routines and the > >> existing ones relying on madrpc_portid would be deprecated. Does this > >> make sense to do ? > > > > A first step would be converting all clients and internal usage in > > libibmad (if any) to use a newer interface. If this will go smoothly > > and things will not become overcomlicated, we could move forward - > > to deprecate old interface... etc.. Nothing new. > > Why nothing new ? I think there are higher level support functions > which need to support the newer API. Meant "nothing new" in API replace/upgrade procedure. Sasha From sashak at voltaire.com Wed Feb 18 02:30:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 12:30:18 +0200 Subject: [ofa-general] Re: [PATCH 9/8] [ib-diag] ibping: add support for WinOF In-Reply-To: References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> Message-ID: <20090218103018.GG7189@sashak.voltaire.com> On 16:36 Tue 17 Feb , Sean Hefty wrote: > > signal(SIGINT, report); > > signal(SIGTERM, report); > > Btw - I worked around adding cdecl before main by disabling the warning. Since > main must be cdecl by default, the compiler fixes it, but spits out a warning. > For some reason unknown to me, the warning only occurs when building 32-bit > apps. > > However, signal() requires that the function be cdecl as well. Guess it is about report() function. Why to not make everything cdecl (by using compiler/linker flag or some super-#pragma in config.h or so)? > The above two > calls fail to compile on 32-bit Windows platforms, so I'm still working on this. > The simple approach of changing the compiler options doesn't work as easily as > it looks like it should. The WDK build environment is 'special'. Ugh, I really fail to understand why WinOF cannot evaluate an option of using less "special" build tools for WDK insensitive code (such as user-space programs ported from linux) - it would solve all those issues just magically. And we are not entered yet a more complicated porting areas such as pthreads... Sasha From volker.jaenisch at inqbus.de Wed Feb 18 02:47:04 2009 From: volker.jaenisch at inqbus.de (Dr. Volker Jaenisch) Date: Wed, 18 Feb 2009 11:47:04 +0100 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny Message-ID: <499BE728.8080002@inqbus.de> Hello Ofa-List! Compiling the ofa-kernel modules from OFED-1.4 on Debian Lenny Kernel 2.6.26 (on amd64) gives me the following trace: [..] /usr/bin/make -f scripts/Makefile.build obj=/usr/src/modules/ofa-kernel-source/drivers/scsi gcc-4.1 -Wp,-MD,/usr/src/modules/ofa-kernel-source/drivers/scsi/.scsi_transport_iscsi.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.1.3/include -D__KERNEL__ \ -include include/linux/autoconf.h \ -include /usr/src/modules/ofa-kernel-source/include/linux/autoconf.h \ -I/usr/src/modules/ofa-kernel-source/kernel_addons/backport/2.6.26/include/ \ \ \ -I/usr/src/modules/ofa-kernel-source/include \ -I/usr/src/modules/ofa-kernel-source/drivers/infiniband/debug \ -I/usr/local/include/scst \ -I/usr/src/modules/ofa-kernel-source/drivers/infiniband/ulp/srpt \ -I/usr/src/modules/ofa-kernel-source/drivers/net/cxgb3 \ -Iinclude \ \ -I/usr/src/linux-headers-2.6.26-1-amd64/arch/x86_64/include \ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Os -fno-stack-protector -m64 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default -fomit-frame-pointer -g -Wdeclaration-after-statement -Wno-pointer-sign -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(scsi_transport_iscsi)" -D"KBUILD_MODNAME=KBUILD_STR(scsi_transport_iscsi)" -c -o /usr/src/modules/ofa-kernel-source/drivers/scsi/.tmp_scsi_transport_iscsi.o /usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c /usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c: In function ‘iscsi_create_endpoint’: /usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:174: warning: passing argument 3 of ‘class_find_device’ from incompatible pointer type /usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:174: error: too many arguments to function ‘class_find_device’ /usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c: In function ‘iscsi_lookup_endpoint’: /usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:226: warning: passing argument 3 of ‘class_find_device’ from incompatible pointer type /usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.c:226: error: too many arguments to function ‘class_find_device’ make[5]: *** [/usr/src/modules/ofa-kernel-source/drivers/scsi/scsi_transport_iscsi.o] Fehler 1 make[4]: *** [/usr/src/modules/ofa-kernel-source/drivers/scsi] Fehler 2 make[3]: *** [_module_/usr/src/modules/ofa-kernel-source] Fehler 2 make[3]: Leaving directory `/usr/src/linux-headers-2.6.26-1-amd64' make[2]: *** [kernel] Fehler 2 make[2]: Leaving directory `/usr/src/modules/ofa-kernel-source' make[1]: *** [binary-modules] Fehler 2 make[1]: Leaving directory `/usr/src/modules/ofa-kernel-source' make: *** [kdist_build] Fehler 2 The code is backported correctly to 2.6.26 [..] for templ in `ls debian/*.modules.in` ; do \ test -e ${templ%.modules.in}.backup || cp ${templ%.modules.in} ${templ%.modules.in}.backup 2>/dev/null || true; \ sed -e 's/##KVERS##/2.6.26-1-amd64/g ;s/#KVERS#/2.6.26-1-amd64/g ; s/_KVERS_/2.6.26-1-amd64/g ; s/##KDREV##/2.6.26-13/g ; s/#KDREV#/2.6.2 6-13/g ; s/_KDREV_/2.6.26-13/g ' < $templ > ${templ%.modules.in}; \ done ./ofed_scripts/ofed_patch.sh --kernel-version=2.6.26 mkdir -p /usr/src/modules/ofa-kernel-source/patches [..] At google I found this thread http://groups.google.com/group/open-iscsi/browse_thread/thread/9bdb0cf059c1b3d3 that describes a similiar problem. But in that case there are too few parameter not to many. The complete trace you may find at http://www.inqbus-hosting.de/ofa-kernel-source.buildlog.2.6.26-1-amd64.1234948099 Any help welcome Volker Jaenisch -- ==================================================== inqbus it-consulting +49 ( 341 ) 5643800 Dr. Volker Jaenisch http://www.inqbus.de Herloßsohnstr. 12 0 4 1 5 5 Leipzig N O T - F Ä L L E +49 ( 170 ) 3113748 ==================================================== From vlad at lists.openfabrics.org Wed Feb 18 03:17:29 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 18 Feb 2009 03:17:29 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090218-0200 daily build status Message-ID: <20090218111729.A7559E60E43@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From ogerlitz at voltaire.com Wed Feb 18 05:35:41 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Feb 2009 15:35:41 +0200 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny In-Reply-To: <499BE728.8080002@inqbus.de> References: <499BE728.8080002@inqbus.de> Message-ID: <499C0EAD.7040604@voltaire.com> Dr. Volker Jaenisch wrote: > Hello Ofa-List! Compiling the ofa-kernel modules from OFED-1.4 on > Debian Lenny Kernel 2.6.26 (on amd64) gives me the following trace: First, this list is related to the development of the Linux RDMA stack not, please refer with ofed issues to ewg at lists.openfabrics.org Second, what makes you want to replace the IB stack that comes with Debian and not update the distro? Or. From kovlensky at interia.pl Wed Feb 18 06:22:03 2009 From: kovlensky at interia.pl (kovlensky at interia.pl) Date: 18 Feb 2009 15:22:03 +0100 Subject: [ofa-general] ***SPAM*** ofed 1.2.5.5 for SLES10 SP2? Message-ID: <20090218142203.EB64D1A3E02@f05.poczta.interia.pl> Hi all, Are there any plans for making ofed 1.2.5.5 compile on SLES10 SP2? In backport directory I can see 2.6.16_sles10 and 2.6.16_sles10_sp1 only. Compiling ib kernel modules from ofed 1.2.5.5 on SP2 makes compilation process use directory 2.6.16_sles10_sp1, which is in disagreement about few typedefs and compilation process fails. The problem lies in kernel version change - 2.6.16.46-0.12-smp from SP1 was changed to 2.6.16.60-0.21-smp in SP2 and the latter one has few typedefs changed. Regards, Kovlensky Vladimir ---------------------------------------------------------------------- Promocja w Speak Up. Angielski 50% gratis! Liczba miejsc ograniczona. Sprawdź!>> http://link.interia.pl/f205c From tziporet at dev.mellanox.co.il Wed Feb 18 06:34:08 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 18 Feb 2009 16:34:08 +0200 Subject: [ofa-general] ***SPAM*** ofed 1.2.5.5 for SLES10 SP2? In-Reply-To: <20090218142203.EB64D1A3E02@f05.poczta.interia.pl> References: <20090218142203.EB64D1A3E02@f05.poczta.interia.pl> Message-ID: <499C1C60.3090501@mellanox.co.il> kovlensky at interia.pl wrote: > Hi all, > > Are there any plans for making ofed 1.2.5.5 compile on SLES10 SP2? In backport directory I can see 2.6.16_sles10 and 2.6.16_sles10_sp1 only. Compiling ib kernel modules from ofed 1.2.5.5 on SP2 makes compilation process use directory 2.6.16_sles10_sp1, which is in disagreement about few typedefs and compilation process fails. The problem lies in kernel version change - 2.6.16.46-0.12-smp from SP1 was changed to 2.6.16.60-0.21-smp in SP2 and the latter one has few typedefs changed. > > No plan like this Please use 1.3.1 or 1.4 for SLES 10 SP2. Of course you can add the backports yourself. Tziporet From tziporet at dev.mellanox.co.il Wed Feb 18 06:40:26 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 18 Feb 2009 16:40:26 +0200 Subject: [ofa-general] IB function calls in kernel module fail In-Reply-To: <1234893143.21802.96.camel@pc.interlinx.bc.ca> References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> <49994BB2.3010206@mellanox.co.il> <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com> <499A8A20.1090507@mellanox.co.il> <1234893143.21802.96.camel@pc.interlinx.bc.ca> Message-ID: <499C1DDA.3060601@mellanox.co.il> Brian J. Murrell wrote: > Ahhh. But should he just include /src/openib/include/ or > also > /src/openib/kernel_addons/backport//include/ > (as described in /src/openib/ofed_patch.mk as well? > > And in what order should these be specified in? > > You need both Order not important Tziporet From hnrose at comcast.net Wed Feb 18 07:10:15 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:10:15 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec Message-ID: <20090218151015.GA6482@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_inform.c b/opensm/opensm/osm_inform.c index 4c773f6..6763a2a 100644 --- a/opensm/opensm/osm_inform.c +++ b/opensm/opensm/osm_inform.c @@ -121,7 +121,7 @@ __match_inf_rec(IN const cl_list_item_t * const p_list_item, IN void *context) memset(&all_zero_gid, 0, sizeof(ib_gid_t)); /* if inform_info.gid is not zero, ignore lid range */ - if (!memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid, + if (memcmp(&p_infr_rec->inform_record.inform_info.gid, &all_zero_gid, sizeof(p_infr_rec->inform_record.inform_info.gid))) { if (memcmp(&p_infr->inform_record.inform_info.gid, &p_infr_rec->inform_record.inform_info.gid, From swise at opengridcomputing.com Wed Feb 18 07:12:46 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 18 Feb 2009 09:12:46 -0600 Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: logical-/bit-or confusion? In-Reply-To: <499BD470.4080705@gmail.com> References: <499BD470.4080705@gmail.com> Message-ID: <499C256E.7050004@opengridcomputing.com> Roel Kluin wrote: > Please review. > --------------------------->8-------------8<------------------------------ > Logical-/bit-or typo > > Signed-off-by: Roel Kluin > --- > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c > index 44e936e..61889e6 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c > @@ -890,7 +890,7 @@ static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) > */ > state_set(&ep->com, FPDU_MODE); > ep->mpa_attr.initiator = 1; > - ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; > + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) || crc_enabled ? 1 : 0; > ep->mpa_attr.recv_marker_enabled = markers_enabled; > ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; > ep->mpa_attr.version = mpa_rev; > This is a typo, but the logic behaves the same either way, which is why it wasn't detected I guess. But it should really be ||. Reviewed-by: Steve Wise From hal.rosenstock at gmail.com Wed Feb 18 07:20:13 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:20:13 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090218003355.GX7189@sashak.voltaire.com> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> <20090218003355.GX7189@sashak.voltaire.com> Message-ID: On Tue, Feb 17, 2009 at 7:33 PM, Sasha Khapyorsky wrote: > On 18:21 Tue 17 Feb , Hal Rosenstock wrote: >> > >> > For utilities which run once through I think the old functions work just >> > fine. >> >> Well, sort of... Aren't mad_portid "collisions" possible when multiple >> programs are run concurrently ? > > No. With the old API, mad_portid can be overwritten by another process or thread. Another thread is not an expected use case but it is possible. >> > However, it is pretty confusing which interface to use... [or even that >> > there >> > are 2 interfaces, but I digress] (see below) >> >> I don't think the newer improved interfaces were ever documented. > > The old interfaces were not documented too. So it is at least consistent > :). There are no man pages but there is a doc (libibmad.txt) which is somewhat out of date as it was never updated for the new interfaces. -- Hal > Sasha From hnrose at comcast.net Wed Feb 18 07:29:13 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:29:13 -0500 Subject: [ofa-general] [PATCH] opensm/man/opensm.8.in: Indicate ROUTER_EXP deprecated Message-ID: <20090218152913.GC8489@comcast.net> Signed-off-by: Hal Rosenstock --- opensm/man/opensm.8.in | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index 7690980..6a5d833 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -569,8 +569,8 @@ opensm will return the path to the first available matching router. A configuration file with a single line where both prefix and GUID are wild-carded means that a path record query specifying any off-subnet DGID should return a path to the first available router. -This configuration yields the same behaviour formerly achieved by -compiling opensm with -DROUTER_EXP. +This configuration yields the same behavior formerly achieved by +compiling opensm with -DROUTER_EXP which has been deprecated. .SH ROUTING .PP -- 1.5.6.4 From hnrose at comcast.net Wed Feb 18 07:32:27 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:32:27 -0500 Subject: [ofa-general] ***SPAM*** opensm/osm_console.c: Improve perfmgr print_counters error message Message-ID: <20090218153227.GF8489@comcast.net> Signed-off-by: Hal Rosenstock --- opensm/opensm/osm_console.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 00e2a94..da66ee5 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1158,7 +1158,7 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) p_cmd, out); } else { fprintf(out, - "print_counters requires a node name to be specified\n"); + "print_counters requires a node name or node GUID to be specified\n"); } } else if (strcmp(p_cmd, "sweep_time") == 0) { p_cmd = next_token(p_last); -- 1.5.6.4 From hnrose at comcast.net Wed Feb 18 07:30:16 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:30:16 -0500 Subject: [ofa-general] [PATCH] Add pkey table support to osm_get_all_port_attrs Message-ID: <20090218153016.GD8489@comcast.net> Only supported in osm_vendor_ibumad.c (separate patch for other vendor layers) Also, update applications using this (osmtest, opensm) Signed-off-by: Hal Rosenstock --- opensm/libvendor/osm_vendor_ibumad.c | 24 +++++++++++++++++++----- opensm/opensm/main.c | 6 ++++++ opensm/osmtest/main.c | 11 +++++++++++ opensm/osmtest/osmtest.c | 7 +++++++ 4 files changed, 43 insertions(+), 5 deletions(-) diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c index 734a860..861bfbe 100644 --- a/opensm/libvendor/osm_vendor_ibumad.c +++ b/opensm/libvendor/osm_vendor_ibumad.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, umad_ca_t ca; ib_port_attr_t *attr = p_attr_array; unsigned done = 0; - int r, i, j; + int r, i, j, k; OSM_LOG_ENTER(p_vend->p_log); CL_ASSERT(p_vend && p_num_ports); + r = 0; if (!*p_num_ports) { r = IB_INVALID_PARAMETER; OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: " @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, } for (i = 0; i < p_vend->ca_count && !done; i++) { - /* - * For each CA, retrieve the port guids - */ + /* For each CA, retrieve the port attributes */ if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) { if (ca.node_type < 1 || ca.node_type > 3) continue; @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, attr->port_num = ca.ports[j]->portnum; attr->sm_lid = ca.ports[j]->sm_lid; attr->link_state = ca.ports[j]->state; + attr->num_pkeys = ca.ports[j]->pkeys_size; + if (attr->num_pkeys && attr->p_pkey_table) { + if (attr->num_pkeys < ca.ports[j]->pkeys_size) { + r = IB_INSUFFICIENT_MEMORY; + OSM_LOG(p_vend->p_log, + OSM_LOG_ERROR, + "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", + j, + ca.ports[j]->pkeys_size); + goto Exit; + } + for (k = 0; k < attr->num_pkeys; k++) + attr->p_pkey_table[k] = + cl_hton16(ca.ports[j]->pkeys[k]); + } attr++; if (attr - p_attr_array > *p_num_ports) { done = 1; @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, } *p_num_ports = attr - p_attr_array; - r = 0; Exit: OSM_LOG_EXIT(p_vend->p_log); diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 73a6274..503d7fa 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) uint32_t i, choice = 0; ib_api_status_t status; + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* Call the transport layer for a list of local port GUID values */ status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array, &num_ports); diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c index b360af6..83c1e13 100644 --- a/opensm/osmtest/main.c +++ b/opensm/osmtest/main.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt) ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; int i; + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* Call the transport layer for a list of local port GUID values. @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid) ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; int i; + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* Call the transport layer for a list of local port GUID values. diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c index a7b343f..986a8d2 100644 --- a/opensm/osmtest/osmtest.c +++ b/opensm/osmtest/osmtest.c @@ -2,6 +2,7 @@ * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt, ib_api_status_t status; uint32_t num_ports = MAX_LOCAL_IBPORTS; ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; + int i; OSM_LOG_ENTER(&p_osmt->log); + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* * Call the transport layer for a list of local port * GUID values. -- 1.5.6.4 From hnrose at comcast.net Wed Feb 18 07:31:32 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:31:32 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/libvendor: Add pkey table request handling in osm_get_all_port_attrs Message-ID: <20090218153132.GE8489@comcast.net> in all other (than osm_vendor_ibumad) OpenSM vendor layers Done by code inspection; not even compile tested Signed-off-by: Hal Rosenstock --- opensm/libvendor/osm_vendor_al.c | 4 ++++ opensm/libvendor/osm_vendor_mlx_hca.c | 4 ++++ opensm/libvendor/osm_vendor_mlx_hca_anafa.c | 5 ++++- opensm/libvendor/osm_vendor_mlx_hca_pfs.c | 4 ++++ opensm/libvendor/osm_vendor_mlx_hca_sim.c | 4 ++++ opensm/libvendor/osm_vendor_mlx_sa.c | 7 +++++++ opensm/libvendor/osm_vendor_mtl_hca_guid.c | 9 +++++++++ 7 files changed, 36 insertions(+), 1 deletions(-) diff --git a/opensm/libvendor/osm_vendor_al.c b/opensm/libvendor/osm_vendor_al.c index d5d78c9..2bcbf9f 100644 --- a/opensm/libvendor/osm_vendor_al.c +++ b/opensm/libvendor/osm_vendor_al.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -670,6 +671,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, num_ports = osm_ca_info_get_num_ports(p_ca_info); for (port_num = 0; port_num < num_ports; port_num++) { + if (p_attr_array[port_count].num_pkeys && + p_attr_array[port_count].p_pkey_table) + status = IB_UNSUPPORTED; p_attr_array[port_count] = *__osm_ca_info_get_port_attr_ptr(p_ca_info, port_num); diff --git a/opensm/libvendor/osm_vendor_mlx_hca.c b/opensm/libvendor/osm_vendor_mlx_hca.c index e98e272..554fd87 100644 --- a/opensm/libvendor/osm_vendor_mlx_hca.c +++ b/opensm/libvendor/osm_vendor_mlx_hca.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -367,6 +368,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, num_ports = p_ca_infos[ca].p_attr->num_ports; for (port_num = 0; port_num < num_ports; port_num++) { + if (p_attr_array[port_count].num_pkeys && + p_attr_array[port_count].p_pkey_table) + status = IB_UNSUPPORTED; p_attr_array[port_count] = *__osm_ca_info_get_port_attr_ptr(&p_ca_infos [ca], diff --git a/opensm/libvendor/osm_vendor_mlx_hca_anafa.c b/opensm/libvendor/osm_vendor_mlx_hca_anafa.c index 81506e4..d1b11e5 100644 --- a/opensm/libvendor/osm_vendor_mlx_hca_anafa.c +++ b/opensm/libvendor/osm_vendor_mlx_hca_anafa.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -182,8 +183,10 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, *p_num_ports = 1; - p_attr_array[0] = ca_info.attr.p_port_attr[0]; /* anafa has only one port */ status = IB_SUCCESS; + if (p_attr_array[0].num_pkeys && p_attr_array[0].p_pkey_table) + status = IB_UNSUPPORTED; + p_attr_array[0] = ca_info.attr.p_port_attr[0]; /* anafa has only one port */ Exit: diff --git a/opensm/libvendor/osm_vendor_mlx_hca_pfs.c b/opensm/libvendor/osm_vendor_mlx_hca_pfs.c index 512b7bf..8c879a9 100644 --- a/opensm/libvendor/osm_vendor_mlx_hca_pfs.c +++ b/opensm/libvendor/osm_vendor_mlx_hca_pfs.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -649,6 +650,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, num_ports = p_ca_infos[caIdx - 1].p_attr->num_ports; for (port_num = 0; port_num < num_ports; port_num++) { + if (p_attr_array[port_count].num_pkeys && + p_attr_array[port_count].p_pkey_table) + status = IB_UNSUPPORTED; p_attr_array[port_count] = *__osm_ca_info_get_port_attr_ptr(&p_ca_infos [caIdx - diff --git a/opensm/libvendor/osm_vendor_mlx_hca_sim.c b/opensm/libvendor/osm_vendor_mlx_hca_sim.c index b6c0193..d46b869 100644 --- a/opensm/libvendor/osm_vendor_mlx_hca_sim.c +++ b/opensm/libvendor/osm_vendor_mlx_hca_sim.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -695,6 +696,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, num_ports = p_ca_infos[caIdx - 1].p_attr->num_ports; for (port_num = 0; port_num < num_ports; port_num++) { + if (p_attr_array[port_count].num_pkeys && + p_attr_array[port_count].p_pkey_table) + status = IB_UNSUPPORTED; p_attr_array[port_count] = *__osm_ca_info_get_port_attr_ptr(&p_ca_infos [caIdx - diff --git a/opensm/libvendor/osm_vendor_mlx_sa.c b/opensm/libvendor/osm_vendor_mlx_sa.c index 7bd5aea..a76c330 100644 --- a/opensm/libvendor/osm_vendor_mlx_sa.c +++ b/opensm/libvendor/osm_vendor_mlx_sa.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -242,6 +243,7 @@ __osmv_get_lid_and_sm_lid_by_port_guid(IN osm_vendor_t * const p_vend, ib_port_attr_t *p_attr_array; uint32_t num_ports; uint32_t port_num; + int i; OSM_LOG_ENTER(p_vend->p_log); @@ -278,6 +280,11 @@ __osmv_get_lid_and_sm_lid_by_port_guid(IN osm_vendor_t * const p_vend, p_attr_array = (ib_port_attr_t *) malloc(sizeof(ib_port_attr_t) * num_ports); + for (i = 0; i < num_ports; i++) { + p_attr_array[i].num_pkeys = 0; + p_attr_array[i].p_pkey_table = NULL; + } + /* obtain the attributes */ status = osm_vendor_get_all_port_attr(p_vend, p_attr_array, &num_ports); if (status != IB_SUCCESS) { diff --git a/opensm/libvendor/osm_vendor_mtl_hca_guid.c b/opensm/libvendor/osm_vendor_mtl_hca_guid.c index 58d961a..c48d9db 100644 --- a/opensm/libvendor/osm_vendor_mtl_hca_guid.c +++ b/opensm/libvendor/osm_vendor_mtl_hca_guid.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -389,6 +390,9 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, num_ports = osm_ca_info_get_num_ports(p_ca_info); for (port_num = 0; port_num < num_ports; port_num++) { + if (p_attr_array[port_count].num_pkeys && + p_attr_array[port_count].p_pkey_table) + status = IB_UNSUPPORTED; p_attr_array[port_count] = *__osm_ca_info_get_port_attr_ptr(p_ca_info, port_num); @@ -571,6 +575,11 @@ ib_net64_t get_port_guid() p_vend = &vend; p_vend->p_log = p_osm_log; + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* * Call the transport layer for a list of local port * GUID values. -- 1.5.6.4 From hnrose at comcast.net Wed Feb 18 07:28:16 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:28:16 -0500 Subject: [ofa-general] [PATCH] infiniband-diags/smpdump.c: Free allocated umad prior to exit Message-ID: <20090218152816.GB8489@comcast.net> Signed-off-by: Hal Rosenstock --- infiniband-diags/src/smpdump.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index 35fcb81..6731546 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -289,7 +289,7 @@ int main(int argc, char *argv[]) xdump(stdout, 0, smp->data, 64); if (smp->status) fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status)); - return 0; + goto Exit; } desc = smp->data; @@ -301,5 +301,8 @@ int main(int argc, char *argv[]) putchar('\n'); if (smp->status) fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status)); + +Exit: + umad_free(umad); return 0; } -- 1.5.6.4 From roel.kluin at gmail.com Wed Feb 18 01:27:12 2009 From: roel.kluin at gmail.com (Roel Kluin) Date: Wed, 18 Feb 2009 10:27:12 +0100 Subject: [ofa-general] ***SPAM*** [PATCH] RDMA/cxgb3: logical-/bit-or confusion? Message-ID: <499BD470.4080705@gmail.com> Please review. --------------------------->8-------------8<------------------------------ Logical-/bit-or typo Signed-off-by: Roel Kluin --- diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 44e936e..61889e6 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -890,7 +890,7 @@ static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) */ state_set(&ep->com, FPDU_MODE); ep->mpa_attr.initiator = 1; - ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) || crc_enabled ? 1 : 0; ep->mpa_attr.recv_marker_enabled = markers_enabled; ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; ep->mpa_attr.version = mpa_rev; From leonid at mellanox.co.il Wed Feb 18 03:24:22 2009 From: leonid at mellanox.co.il (Leonid Keller) Date: Wed, 18 Feb 2009 13:24:22 +0200 Subject: [ofa-general] [ofw][patch][WinVerbs tests] fix IPv6 related connection problem Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01CDE206@mtlexch01.mtl.com> All WinVerbs test (client part) require host IP address as a parameter. We used to use IPv4 address as it is more comfortable. But if IPv6 protocol is installed, which is default for Win2008, the connection code in the tests doesn't work right. This patch suggest a fix, that limiting the usage of host addresses to IPv4 only. The same limitation exists today also in tools\perftests. Index: tests/perftest/rdma_bw/rdma_bw.c =================================================================== --- tests/perftest/rdma_bw/rdma_bw.c (revision 1976) +++ tests/perftest/rdma_bw/rdma_bw.c (working copy) @@ -215,6 +215,8 @@ rdma_ack_cm_event(event); } else { for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { @@ -382,6 +384,8 @@ rdma_ack_cm_event(event); } else { for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: tests/perftest/rdma_lat/rdma_lat.c =================================================================== --- tests/perftest/rdma_lat/rdma_lat.c (revision 1976) +++ tests/perftest/rdma_lat/rdma_lat.c (working copy) @@ -294,6 +294,8 @@ rdma_ack_cm_event(event); } else { for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { @@ -437,6 +439,8 @@ rdma_ack_cm_event(event); } else { for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: tests/perftest/read_bw/read_bw.c =================================================================== --- tests/perftest/read_bw/read_bw.c (revision 1976) +++ tests/perftest/read_bw/read_bw.c (working copy) @@ -126,6 +126,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -206,6 +208,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: tests/perftest/read_lat/read_lat.c =================================================================== --- tests/perftest/read_lat/read_lat.c (revision 1976) +++ tests/perftest/read_lat/read_lat.c (working copy) @@ -201,6 +201,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -250,6 +252,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: tests/perftest/send_bw/send_bw.c =================================================================== --- tests/perftest/send_bw/send_bw.c (revision 1976) +++ tests/perftest/send_bw/send_bw.c (working copy) @@ -142,6 +142,8 @@ for (t = res; t; t = t->ai_next) { sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (t->ai_family != AF_INET) + continue; if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) break; @@ -221,6 +223,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: tests/perftest/send_lat/send_lat.c =================================================================== --- tests/perftest/send_lat/send_lat.c (revision 1976) +++ tests/perftest/send_lat/send_lat.c (working copy) @@ -212,6 +212,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -261,6 +263,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: tests/perftest/write_bw/write_bw.c =================================================================== --- tests/perftest/write_bw/write_bw.c (revision 1976) +++ tests/perftest/write_bw/write_bw.c (working copy) @@ -135,6 +135,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -215,6 +217,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: tests/perftest/write_bw_postlist/write_bw_postlist.c =================================================================== --- tests/perftest/write_bw_postlist/write_bw_postlist.c (revision 1976) +++ tests/perftest/write_bw_postlist/write_bw_postlist.c (working copy) @@ -138,6 +138,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -218,6 +220,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { n = 1; Index: tests/perftest/write_lat/write_lat.c =================================================================== --- tests/perftest/write_lat/write_lat.c (revision 1976) +++ tests/perftest/write_lat/write_lat.c (working copy) @@ -198,6 +198,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -247,6 +249,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: ulp/libibverbs/examples/rc_pingpong/rc_pingpong.c =================================================================== --- ulp/libibverbs/examples/rc_pingpong/rc_pingpong.c (revision 1976) +++ ulp/libibverbs/examples/rc_pingpong/rc_pingpong.c (working copy) @@ -137,6 +137,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -205,6 +207,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: ulp/libibverbs/examples/srq_pingpong/srq_pingpong.c =================================================================== --- ulp/libibverbs/examples/srq_pingpong/srq_pingpong.c (revision 1976) +++ ulp/libibverbs/examples/srq_pingpong/srq_pingpong.c (working copy) @@ -162,6 +162,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -246,6 +248,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { n = 1; Index: ulp/libibverbs/examples/uc_pingpong/uc_pingpong.c =================================================================== --- ulp/libibverbs/examples/uc_pingpong/uc_pingpong.c (revision 1976) +++ ulp/libibverbs/examples/uc_pingpong/uc_pingpong.c (working copy) @@ -124,6 +124,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -192,6 +194,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; Index: ulp/libibverbs/examples/ud_pingpong/ud_pingpong.c =================================================================== --- ulp/libibverbs/examples/ud_pingpong/ud_pingpong.c (revision 1976) +++ ulp/libibverbs/examples/ud_pingpong/ud_pingpong.c (working copy) @@ -126,6 +126,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) @@ -193,6 +195,8 @@ } for (t = res; t; t = t->ai_next) { + if (t->ai_family != AF_INET) + continue; sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd != INVALID_SOCKET) { n = 1; -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: wv_tests.patch Type: application/octet-stream Size: 9663 bytes Desc: wv_tests.patch URL: From hnrose at comcast.net Wed Feb 18 07:27:28 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:27:28 -0500 Subject: [ofa-general] [PATCH] management/libibmad.txt: Remove madrpc_lock/unlock Message-ID: <20090218152728.GA8489@comcast.net> Signed-off-by: Hal Rosenstock --- doc/libibmad.txt | 20 -------------------- 1 files changed, 0 insertions(+), 20 deletions(-) diff --git a/doc/libibmad.txt b/doc/libibmad.txt index 42a61d4..9fb74c3 100644 --- a/doc/libibmad.txt +++ b/doc/libibmad.txt @@ -143,26 +143,6 @@ packets, this function has to be called repeatedly after each RPC operation. Bugs: Not applicable to mad_receive -madrpc_lock: - -Synopsis: - void madrpc_lock(void); - -Description: Locks the mad RPC mechanism until madrpc_unlock() is called. Calls -to this function while the RPC mechanism is already locked cause the calling -process to be blocked until madrpc_unlock(). This function should be used -only by multiple-threaded applications. - -See also: - madrpc_unlock - -madrpc_unlock: - -Synopsis: - void madrpc_unlock(void); - -Description: Unlock the mad RPC mechanism. See madrpc_lock() for details. - madrpc_show_errors: Synopsis: -- 1.5.6.4 From hal.rosenstock at gmail.com Wed Feb 18 07:40:34 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:40:34 -0500 Subject: ***SPAM*** Re: [ofa-general] [RFC] OpenSM vendor layer In-Reply-To: <20090214152533.GG14416@sashak.voltaire.com> References: <20090207123355.GP17713@sashak.voltaire.com> <20090212200025.GC14416@sashak.voltaire.com> <20090214152533.GG14416@sashak.voltaire.com> Message-ID: Sasha, On 2/14/09, Sasha Khapyorsky wrote: > Hi Hal, > > On 19:41 Thu 12 Feb , Hal Rosenstock wrote: >> > >> > It is already supplied by libibumad - by umad_get_ca() >> > (ca.ports[i]->pkeys). I think you just need to copy this to >> > ib_port_attr_t structure. >> >> Yes but rather than using supplied pointers (as inputs for the per >> port pkey/guid tables), the other vendor layers require a large enough >> buffer for these tables and set the port pointers appropriately (on >> output) rather than supplying these pointers as input parameters. So >> if we use these as input, then we definitely break the other vendor >> layers. > > Ok, if you already have an usage example, this is even simpler - just > alloc mem and copy pkey table. I ended up going with the original approach. -- Hal > Sasha > From hnrose at comcast.net Wed Feb 18 07:55:37 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 10:55:37 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/smpdump.c: Fix usage examples Message-ID: <20090218155537.GA8762@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index e224909..7a8119f 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -226,11 +226,11 @@ int main(int argc, char *argv[]) char usage_args[] = " [mod]"; const char *usage_examples[] = { " -- DR routed examples:", - "%s -D 0,1,2,3,5 16 # NODE DESC", - "%s -D 0,1,2 0x15 2 # PORT INFO, port 2", + "-D 0,1,2,3,5 16 # NODE DESC", + "-D 0,1,2 0x15 2 # PORT INFO, port 2", " -- LID routed examples:", - "%s 3 0x15 2 # PORT INFO, lid 3 port 2", - "%s 0xa0 0x11 # NODE INFO, lid 0xa0", + "3 0x15 2 # PORT INFO, lid 3 port 2", + "0xa0 0x11 # NODE INFO, lid 0xa0", NULL }; From sean.hefty at intel.com Wed Feb 18 08:50:33 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 08:50:33 -0800 Subject: [ofa-general] [PATCH 1/8] Clean up "new" interface In-Reply-To: <20090217210642.41c64624.weiny2@llnl.gov> References: <20090217210642.41c64624.weiny2@llnl.gov> Message-ID: <65FCCB3936BC48DBBA5AAFAD1B4FA683@amr.corp.intel.com> > type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > Create new mad_rpc_portid(struct ibmad_port *srcport) function > which mirrors madrpc_portid(void) If you're planning on having someone use the new functions, they need to have MAD_EXPORT added in front of them. (Where MAD_EXPORT doesn't exist in mad.h probably means that there isn't a user of that call, or we just haven't ported the user that does use it to Windows yet.) Do you have a published git tree with these patches? - Sean From sean.hefty at intel.com Wed Feb 18 08:51:58 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 08:51:58 -0800 Subject: [ofa-general] [PATCH 3/8] Convert ibaddr to "new" ibmad interface In-Reply-To: <20090217210646.5e74b9ed.weiny2@llnl.gov> References: <20090217210646.5e74b9ed.weiny2@llnl.gov> Message-ID: >+ srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); >+ if (!srcport) >+ IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > >+ >+ mad_rpc_close_port(srcport); > exit(0); need MAD_EXPORT From sean.hefty at intel.com Wed Feb 18 08:57:09 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 08:57:09 -0800 Subject: [ofa-general] [PATCH 5/8] Convert ibportstate to "new" ibmad interface In-Reply-To: <20090217210650.3397dd72.weiny2@llnl.gov> References: <20090217210650.3397dd72.weiny2@llnl.gov> Message-ID: <6A0B953C20FB428691700974E2C86B0C@amr.corp.intel.com> >- if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < >0) >+ if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, >+ ibd_sm_id, srcport) < 0) needs MAD_EXPORT >- if (ib_resolve_self(&selfportid, &selfport, 0) < >0) >+ if (ib_resolve_self_via(&selfportid, >+ &selfport, 0, srcport) < 0) ditto From sean.hefty at intel.com Wed Feb 18 09:06:10 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 09:06:10 -0800 Subject: [ofa-general] [PATCH] infiniband-diags/smpdump.c: Free allocated umad prior to exit In-Reply-To: <20090218152816.GB8489@comcast.net> References: <20090218152816.GB8489@comcast.net> Message-ID: <0B9EDF52FC0F4125864FA7B968F9FDD3@amr.corp.intel.com> >- return 0; >+ goto Exit; > } > > desc = smp->data; >@@ -301,5 +301,8 @@ int main(int argc, char *argv[]) > putchar('\n'); > if (smp->status) > fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status)); >+ >+Exit: nit: can we use all lowercase From hal.rosenstock at gmail.com Wed Feb 18 09:07:15 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Feb 2009 12:07:15 -0500 Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/8] Clean up "new" interface In-Reply-To: <20090217210642.41c64624.weiny2@llnl.gov> References: <20090217210642.41c64624.weiny2@llnl.gov> Message-ID: On Wed, Feb 18, 2009 at 12:06 AM, Ira Weiny wrote: > > From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001 > From: weiny2 at llnl.gov > Date: Tue, 17 Feb 2009 17:32:15 -0800 > Subject: [PATCH] Clean up "new" interface > > type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > Create new mad_rpc_portid(struct ibmad_port *srcport) function > which mirrors madrpc_portid(void) > > Signed-off-by: weiny2 at llnl.gov > --- > libibmad/include/infiniband/mad.h | 58 ++++++++++++++++++++++-------------- > libibmad/src/gs.c | 19 ++++++------ > libibmad/src/libibmad.map | 1 + > libibmad/src/resolve.c | 10 ++++-- > libibmad/src/rpc.c | 29 +++++++++--------- > libibmad/src/sa.c | 4 +- > libibmad/src/smp.c | 4 +- > 7 files changed, 71 insertions(+), 54 deletions(-) > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 1aaaa1b..56b87e6 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt) > } > > /* rpc.c */ > +/* Depricated interface */ typo - Deprecated > MAD_EXPORT int madrpc_portid(void); > -MAD_EXPORT int madrpc_set_retries(int retries); > -MAD_EXPORT int madrpc_set_timeout(int timeout); I thought initially we weren't going to remove APIs but move over to the new ones ? A subsequent step would be to deprecate the old APIs and then eventually remove the old APIs. -- Hal > void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); > void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, > void *data); > MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > void madrpc_save_mad(void *madbuf, int len); > -MAD_EXPORT void madrpc_show_errors(int set); > > -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > +/* New interface */ > +MAD_EXPORT void madrpc_show_errors(int set); > +MAD_EXPORT int madrpc_set_retries(int retries); > +MAD_EXPORT int madrpc_set_timeout(int timeout); > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > -void mad_rpc_close_port(void *ibmad_port); > -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > +void mad_rpc_close_port(struct ibmad_port *srcport); > +void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > void *payload, void *rcvdata); > -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > +void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > ib_rmpp_hdr_t * rmpp, void *data); > +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); > > /* smp.c */ > MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, > unsigned mod, unsigned timeout); > MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, > unsigned mod, unsigned timeout); > + > +/* smp.c new interface */ > MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport); > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport); > uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, > - unsigned timeout, const void *srcport); > + unsigned timeout, const struct ibmad_port *srcport); > > /* sa.c */ > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > unsigned timeout); > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > - ib_sa_call_t * sa, unsigned timeout); > MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > + > +/* sa.c new interface */ > +uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid, > + ib_sa_call_t * sa, unsigned timeout); > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); > > /* resolve.c */ > @@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, > ibmad_gid_t * gid); > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); > +/* resolve.c new interface */ > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport); > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout, const void *srcport); > + ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport); > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > enum MAD_DEST dest, ib_portid_t * sm_id, > - const void *srcport); > + const struct ibmad_port *srcport); > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > - const void *srcport); > + const struct ibmad_port *srcport); > > /* gs.c */ > MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, > @@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, > MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout); > > +/* gs.c new interface */ > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > + const struct ibmad_port *srcport); > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > - unsigned timeout, const void *srcport); > + unsigned timeout, const struct ibmad_port *srcport); > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned mask, unsigned timeout, > - const void *srcport); > + const struct ibmad_port *srcport); > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > + const struct ibmad_port *srcport); > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout, const void *srcport); > + unsigned timeout, > + const struct ibmad_port *srcport); > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > + const struct ibmad_port *srcport); > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > + const struct ibmad_port *srcport); > /* dump.c */ > MAD_EXPORT ib_mad_dump_fn > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c > index d2c4574..e302caf 100644 > --- a/libibmad/src/gs.c > +++ b/libibmad/src/gs.c > @@ -47,7 +47,7 @@ > > static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned timeout, unsigned id, > - const void *srcport) > + const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > int lid = dest->lid; > @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO, > srcport); > @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, > } > > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > - unsigned timeout, const void *srcport) > + unsigned timeout, const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_COUNTERS, srcport); > @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port, > > static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, unsigned timeout, > - unsigned id, const void *srcport) > + unsigned id, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > int lid = dest->lid; > @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned mask, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > IB_GSI_PORT_COUNTERS, srcport); > @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_COUNTERS_EXT, srcport); > @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout, const void *srcport) > + unsigned timeout, > + const struct ibmad_port *srcport) > { > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > IB_GSI_PORT_COUNTERS_EXT, srcport); > @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_SAMPLES_CONTROL, srcport); > @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_SAMPLES_RESULT, srcport); > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > index f944d86..94d7762 100644 > --- a/libibmad/src/libibmad.map > +++ b/libibmad/src/libibmad.map > @@ -69,6 +69,7 @@ IBMAD_1.3 { > mad_rpc_close_port; > mad_rpc; > mad_rpc_rmpp; > + mad_rpc_portid; > madrpc; > madrpc_def_timeout; > madrpc_init; > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > index 553949d..3291f43 100644 > --- a/libibmad/src/resolve.c > +++ b/libibmad/src/resolve.c > @@ -45,7 +45,8 @@ > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport) > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport) > { > ib_portid_t self = { 0 }; > uint8_t portinfo[64]; > @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) > } > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout, const void *srcport) > + ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport) > { > ib_portid_t sm_portid; > char buf[IB_SA_DATA_SIZE] = { 0 }; > @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > enum MAD_DEST dest_type, ib_portid_t * sm_id, > - const void *srcport) > + const struct ibmad_port *srcport) > { > uint64_t guid; > int lid; > @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > } > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > - const void *srcport) > + const struct ibmad_port *srcport) > { > ib_portid_t self = { 0 }; > uint8_t portinfo[64]; > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index e811526..d47873b 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -100,6 +100,11 @@ int madrpc_portid(void) > return mad_portid; > } > > +int mad_rpc_portid(struct ibmad_port *srcport) > +{ > + return (srcport->port_id); > +} > + > static int > _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > int timeout) > @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > return -1; > } > > -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > void *payload, void *rcvdata) > { > - const struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > return 0; > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > - p->class_agents[rpc->mgtclass], > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > + port->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > return rcvdata; > } > > -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > ib_rmpp_hdr_t * rmpp, void *data) > { > - const struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > return 0; > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > - p->class_agents[rpc->mgtclass], > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > + port->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) > } > } > > -void *mad_rpc_open_port(char *dev_name, int dev_port, > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, > int *mgmt_classes, int num_classes) > { > struct ibmad_port *p; > @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port, > return p; > } > > -void mad_rpc_close_port(void *port_id) > +void mad_rpc_close_port(struct ibmad_port *port) > { > - struct ibmad_port *p = port_id; > - > - umad_close_port(p->port_id); > - free(p); > + umad_close_port(port->port_id); > + free(port); > } > > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > index 7403d4f..ddeb152 100644 > --- a/libibmad/src/sa.c > +++ b/libibmad/src/sa.c > @@ -44,7 +44,7 @@ > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN > > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid, > ib_sa_call_t * sa, unsigned timeout) > { > ib_rpc_t rpc = { 0 }; > @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > IB_PR_COMPMASK_SGID |\ > IB_PR_COMPMASK_NUMBPATH) > > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf) > { > int npath; > diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c > index fad263c..e5489b3 100644 > --- a/libibmad/src/smp.c > +++ b/libibmad/src/smp.c > @@ -45,7 +45,7 @@ > #define DEBUG if (ibdebug) IBWARN > > uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport) > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > > @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid, > } > > uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport) > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > > -- > 1.5.4.5 > > From sean.hefty at intel.com Wed Feb 18 09:17:31 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 09:17:31 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/8] Clean up "new" interface In-Reply-To: References: <20090217210642.41c64624.weiny2@llnl.gov> Message-ID: <400686E659F44509B54DCF2CAF9732E0@amr.corp.intel.com> >> MAD_EXPORT int madrpc_portid(void); >> -MAD_EXPORT int madrpc_set_retries(int retries); >> -MAD_EXPORT int madrpc_set_timeout(int timeout); > >I thought initially we weren't going to remove APIs but move over to >the new ones ? A subsequent step would be to deprecate the old APIs >and then eventually remove the old APIs. He moved these down in the code >> +MAD_EXPORT int madrpc_set_retries(int retries); >> +MAD_EXPORT int madrpc_set_timeout(int timeout); probably so that they aren't listed under a 'deprecated' section. - Sean From hal.rosenstock at gmail.com Wed Feb 18 09:22:19 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Feb 2009 12:22:19 -0500 Subject: [ofa-general] [PATCH] infiniband-diags/smpdump.c: Free allocated umad prior to exit In-Reply-To: <0B9EDF52FC0F4125864FA7B968F9FDD3@amr.corp.intel.com> References: <20090218152816.GB8489@comcast.net> <0B9EDF52FC0F4125864FA7B968F9FDD3@amr.corp.intel.com> Message-ID: On Wed, Feb 18, 2009 at 12:06 PM, Sean Hefty wrote: >>- return 0; >>+ goto Exit; >> } >> >> desc = smp->data; >>@@ -301,5 +301,8 @@ int main(int argc, char *argv[]) >> putchar('\n'); >> if (smp->status) >> fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status)); >>+ >>+Exit: > > nit: can we use all lowercase Sure; v2 patch shortly. -- Hal > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hnrose at comcast.net Wed Feb 18 09:19:32 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 18 Feb 2009 12:19:32 -0500 Subject: [ofa-general] ***SPAM*** [PATCHv2] infiniband-diags/smpdump.c: Release umad resources on exit Message-ID: <20090218171932.GA15139@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index 6c7f84c..414975c 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -289,7 +289,7 @@ int main(int argc, char *argv[]) xdump(stdout, 0, smp->data, 64); if (smp->status) fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status)); - return 0; + goto exit; } desc = smp->data; @@ -301,5 +301,8 @@ int main(int argc, char *argv[]) putchar('\n'); if (smp->status) fprintf(stdout, "SMP status: 0x%x\n", ntohs(smp->status)); + +exit: + umad_free(umad); return 0; } From weiny2 at llnl.gov Wed Feb 18 09:27:34 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 18 Feb 2009 09:27:34 -0800 Subject: [ofa-general] Re: [PATCH 1/8] Clean up "new" interface In-Reply-To: References: <20090217210642.41c64624.weiny2@llnl.gov> Message-ID: <20090218092734.31ca1062.weiny2@llnl.gov> On Wed, 18 Feb 2009 12:07:15 -0500 Hal Rosenstock wrote: > On Wed, Feb 18, 2009 at 12:06 AM, Ira Weiny wrote: > > > > From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001 > > From: weiny2 at llnl.gov > > Date: Tue, 17 Feb 2009 17:32:15 -0800 > > Subject: [PATCH] Clean up "new" interface > > > > type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > > Create new mad_rpc_portid(struct ibmad_port *srcport) function > > which mirrors madrpc_portid(void) > > > > Signed-off-by: weiny2 at llnl.gov > > --- > > libibmad/include/infiniband/mad.h | 58 ++++++++++++++++++++++-------------- > > libibmad/src/gs.c | 19 ++++++------ > > libibmad/src/libibmad.map | 1 + > > libibmad/src/resolve.c | 10 ++++-- > > libibmad/src/rpc.c | 29 +++++++++--------- > > libibmad/src/sa.c | 4 +- > > libibmad/src/smp.c | 4 +- > > 7 files changed, 71 insertions(+), 54 deletions(-) > > > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > > index 1aaaa1b..56b87e6 100644 > > --- a/libibmad/include/infiniband/mad.h > > +++ b/libibmad/include/infiniband/mad.h > > @@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt) > > } > > > > /* rpc.c */ > > +/* Depricated interface */ > > typo - Deprecated Some day I will learn to spell this... :-( > > > MAD_EXPORT int madrpc_portid(void); > > -MAD_EXPORT int madrpc_set_retries(int retries); > > -MAD_EXPORT int madrpc_set_timeout(int timeout); > > I thought initially we weren't going to remove APIs but move over to > the new ones ? A subsequent step would be to deprecate the old APIs > and then eventually remove the old APIs. They were not removed... [see below] > > -- Hal > > > void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); > > void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, > > void *data); > > MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > > int num_classes); > > void madrpc_save_mad(void *madbuf, int len); > > -MAD_EXPORT void madrpc_show_errors(int set); > > > > -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > > +/* New interface */ > > +MAD_EXPORT void madrpc_show_errors(int set); > > +MAD_EXPORT int madrpc_set_retries(int retries); > > +MAD_EXPORT int madrpc_set_timeout(int timeout); ... but moved down here to indicate they were _not_ deprecated. We could deprecate them and make 'retries' and 'timeout' associated with each ibmad_port but I thought those were pretty global to the instance of the lib. Ira > > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > > int num_classes); > > -void mad_rpc_close_port(void *ibmad_port); > > -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > > +void mad_rpc_close_port(struct ibmad_port *srcport); > > +void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > > void *payload, void *rcvdata); > > -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > > +void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > > ib_rmpp_hdr_t * rmpp, void *data); > > +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); > > > > /* smp.c */ > > MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, > > unsigned mod, unsigned timeout); > > MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, > > unsigned mod, unsigned timeout); > > + > > +/* smp.c new interface */ > > MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, > > - unsigned mod, unsigned timeout, const void *srcport); > > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport); > > uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, > > - unsigned timeout, const void *srcport); > > + unsigned timeout, const struct ibmad_port *srcport); > > > > /* sa.c */ > > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > > unsigned timeout); > > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > - ib_sa_call_t * sa, unsigned timeout); > > MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ > > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > > + > > +/* sa.c new interface */ > > +uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid, > > + ib_sa_call_t * sa, unsigned timeout); > > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); > > > > /* resolve.c */ > > @@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > > MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, > > ibmad_gid_t * gid); > > > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); > > +/* resolve.c new interface */ > > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport); > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > - ib_portid_t * sm_id, int timeout, const void *srcport); > > + ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport); > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > enum MAD_DEST dest, ib_portid_t * sm_id, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > > > /* gs.c */ > > MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, > > @@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, > > MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout); > > > > +/* gs.c new interface */ > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > > - unsigned timeout, const void *srcport); > > + unsigned timeout, const struct ibmad_port *srcport); > > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > > unsigned mask, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned mask, > > - unsigned timeout, const void *srcport); > > + unsigned timeout, > > + const struct ibmad_port *srcport); > > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > /* dump.c */ > > MAD_EXPORT ib_mad_dump_fn > > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c > > index d2c4574..e302caf 100644 > > --- a/libibmad/src/gs.c > > +++ b/libibmad/src/gs.c > > @@ -47,7 +47,7 @@ > > > > static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port, > > unsigned timeout, unsigned id, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > int lid = dest->lid; > > @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, > > > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO, > > srcport); > > @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, > > } > > > > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > > - unsigned timeout, const void *srcport) > > + unsigned timeout, const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_COUNTERS, srcport); > > @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port, > > > > static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned mask, unsigned timeout, > > - unsigned id, const void *srcport) > > + unsigned id, const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > int lid = dest->lid; > > @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > > unsigned mask, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > > IB_GSI_PORT_COUNTERS, srcport); > > @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_COUNTERS_EXT, srcport); > > @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned mask, > > - unsigned timeout, const void *srcport) > > + unsigned timeout, > > + const struct ibmad_port *srcport) > > { > > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > > IB_GSI_PORT_COUNTERS_EXT, srcport); > > @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_SAMPLES_CONTROL, srcport); > > @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_SAMPLES_RESULT, srcport); > > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > > index f944d86..94d7762 100644 > > --- a/libibmad/src/libibmad.map > > +++ b/libibmad/src/libibmad.map > > @@ -69,6 +69,7 @@ IBMAD_1.3 { > > mad_rpc_close_port; > > mad_rpc; > > mad_rpc_rmpp; > > + mad_rpc_portid; > > madrpc; > > madrpc_def_timeout; > > madrpc_init; > > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > > index 553949d..3291f43 100644 > > --- a/libibmad/src/resolve.c > > +++ b/libibmad/src/resolve.c > > @@ -45,7 +45,8 @@ > > #undef DEBUG > > #define DEBUG if (ibdebug) IBWARN > > > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport) > > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport) > > { > > ib_portid_t self = { 0 }; > > uint8_t portinfo[64]; > > @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) > > } > > > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > - ib_portid_t * sm_id, int timeout, const void *srcport) > > + ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport) > > { > > ib_portid_t sm_portid; > > char buf[IB_SA_DATA_SIZE] = { 0 }; > > @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > enum MAD_DEST dest_type, ib_portid_t * sm_id, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > uint64_t guid; > > int lid; > > @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > > } > > > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > ib_portid_t self = { 0 }; > > uint8_t portinfo[64]; > > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > > index e811526..d47873b 100644 > > --- a/libibmad/src/rpc.c > > +++ b/libibmad/src/rpc.c > > @@ -100,6 +100,11 @@ int madrpc_portid(void) > > return mad_portid; > > } > > > > +int mad_rpc_portid(struct ibmad_port *srcport) > > +{ > > + return (srcport->port_id); > > +} > > + > > static int > > _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > > int timeout) > > @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > > return -1; > > } > > > > -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > > void *payload, void *rcvdata) > > { > > - const struct ibmad_port *p = port_id; > > int status, len; > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > > return 0; > > > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > - p->class_agents[rpc->mgtclass], > > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > > + port->class_agents[rpc->mgtclass], > > len, rpc->timeout)) < 0) { > > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > > return 0; > > @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > return rcvdata; > > } > > > > -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > > ib_rmpp_hdr_t * rmpp, void *data) > > { > > - const struct ibmad_port *p = port_id; > > int status, len; > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > > return 0; > > > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > - p->class_agents[rpc->mgtclass], > > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > > + port->class_agents[rpc->mgtclass], > > len, rpc->timeout)) < 0) { > > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > > return 0; > > @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) > > } > > } > > > > -void *mad_rpc_open_port(char *dev_name, int dev_port, > > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, > > int *mgmt_classes, int num_classes) > > { > > struct ibmad_port *p; > > @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port, > > return p; > > } > > > > -void mad_rpc_close_port(void *port_id) > > +void mad_rpc_close_port(struct ibmad_port *port) > > { > > - struct ibmad_port *p = port_id; > > - > > - umad_close_port(p->port_id); > > - free(p); > > + umad_close_port(port->port_id); > > + free(port); > > } > > > > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > > index 7403d4f..ddeb152 100644 > > --- a/libibmad/src/sa.c > > +++ b/libibmad/src/sa.c > > @@ -44,7 +44,7 @@ > > #undef DEBUG > > #define DEBUG if (ibdebug) IBWARN > > > > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > ib_sa_call_t * sa, unsigned timeout) > > { > > ib_rpc_t rpc = { 0 }; > > @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > IB_PR_COMPMASK_SGID |\ > > IB_PR_COMPMASK_NUMBPATH) > > > > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf) > > { > > int npath; > > diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c > > index fad263c..e5489b3 100644 > > --- a/libibmad/src/smp.c > > +++ b/libibmad/src/smp.c > > @@ -45,7 +45,7 @@ > > #define DEBUG if (ibdebug) IBWARN > > > > uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid, > > - unsigned mod, unsigned timeout, const void *srcport) > > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > > > @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid, > > } > > > > uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid, > > - unsigned mod, unsigned timeout, const void *srcport) > > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > > > -- > > 1.5.4.5 > > > > -- Ira Weiny From weiny2 at llnl.gov Wed Feb 18 09:28:18 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 18 Feb 2009 09:28:18 -0800 Subject: [ofa-general] [PATCH 1/8] Clean up "new" interface In-Reply-To: <65FCCB3936BC48DBBA5AAFAD1B4FA683@amr.corp.intel.com> References: <20090217210642.41c64624.weiny2@llnl.gov> <65FCCB3936BC48DBBA5AAFAD1B4FA683@amr.corp.intel.com> Message-ID: <20090218092818.3e931fe3.weiny2@llnl.gov> On Wed, 18 Feb 2009 08:50:33 -0800 "Sean Hefty" wrote: > > type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > > Create new mad_rpc_portid(struct ibmad_port *srcport) function > > which mirrors madrpc_portid(void) > > If you're planning on having someone use the new functions, they need to have > MAD_EXPORT added in front of them. (Where MAD_EXPORT doesn't exist in mad.h > probably means that there isn't a user of that call, or we just haven't ported > the user that does use it to Windows yet.) > > Do you have a published git tree with these patches? Not published no... I will clean up with MAD_EXPORT on all the new functions and fix my spelling errors from Hal's comment. Ira > > - Sean > -- Ira Weiny From hal.rosenstock at gmail.com Wed Feb 18 09:31:06 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Feb 2009 12:31:06 -0500 Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/8] Clean up "new" interface In-Reply-To: <400686E659F44509B54DCF2CAF9732E0@amr.corp.intel.com> References: <20090217210642.41c64624.weiny2@llnl.gov> <400686E659F44509B54DCF2CAF9732E0@amr.corp.intel.com> Message-ID: On Wed, Feb 18, 2009 at 12:17 PM, Sean Hefty wrote: >>> MAD_EXPORT int madrpc_portid(void); >>> -MAD_EXPORT int madrpc_set_retries(int retries); >>> -MAD_EXPORT int madrpc_set_timeout(int timeout); >> >>I thought initially we weren't going to remove APIs but move over to >>the new ones ? A subsequent step would be to deprecate the old APIs >>and then eventually remove the old APIs. > > He moved these down in the code Missed that. It was a general comment. I think there are many (old) routines which end up in the 'to be deprecated' category. -- Hal >>> +MAD_EXPORT int madrpc_set_retries(int retries); >>> +MAD_EXPORT int madrpc_set_timeout(int timeout); > > probably so that they aren't listed under a 'deprecated' section. > > - Sean > > From sean.hefty at intel.com Wed Feb 18 09:32:30 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 09:32:30 -0800 Subject: [ofa-general] RE: [PATCH 8/8] [ib-diags] smpquery: add support for WinOF In-Reply-To: <20090218095230.GC7189@sashak.voltaire.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com> <20090218095230.GC7189@sashak.voltaire.com> Message-ID: <3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com> >> #include >> #include >> -#include >> +#include > >Is it needed? Rest tools use similar path with leading 'infiniband'. That directory path doesn't exist in Windows. I think this makes sense. Complib is a separate library, independent of infiniband. >> - for (i = 0; i < (n + 31) / 32; i++) { >> + for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) { > >Wouldn't it be better to make declare i, j, k as int? Width 32 doesn't >make any sense here. > >> mod = i | (portnum << 16); >> if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0)) >> return "pkey table query failed"; >> @@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc) >> return "port info failed"; >> mad_decode_field(data, IB_PORT_GUID_CAP_F, &n); >> >> - for (i = 0; i < (n + 7) / 8; i++) { >> + for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) { fixed From sashak at voltaire.com Wed Feb 18 09:42:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 19:42:18 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <499BD55B.3090606@dev.mellanox.co.il> References: <499AB068.2020205@dev.mellanox.co.il> <20090218010303.GZ7189@sashak.voltaire.com> <499BD55B.3090606@dev.mellanox.co.il> Message-ID: <20090218174218.GT5910@sashak.voltaire.com> Hi Yevgeny, On 11:31 Wed 18 Feb , Yevgeny Kliteynik wrote: > Hi Sasha, > > Sasha Khapyorsky wrote: >> Hi Yevgeny, >> On 14:41 Tue 17 Feb , Yevgeny Kliteynik wrote: >>> This patch fixes bugzilla issue #1515: >>> >>> Topology: >>> |---------------| >>> | SW2 | >>> |---------------| >>> |x |y |z |v >>> |----| | | |----| >>> | | | | >>> | |----| |----| | >>> | | | | >>> a| b| c| d| >>> |---------------| |---------------| >>> | SW1 | | SW3 | >>> |---------------| |---------------| >>> | | >>> | | >>> HCA with SM HCA >>> >>> During the discovery: >>> >>> SM sends NodeInfo request to SW1 >>> SM sends NodeInfo request to SW2 through link a->x >>> SM discovers new node SW2: >>> - updates DR to SW2 to go through link a->x >>> - creates physp x >> And requests SwitchInfo from SW2, and on response sends PortInfo to all >> switch ports. PortInfo receiver will initialize all switch ports. Isn't >> it? > > Links are created only by getting NodeInfo response. W/o the > fix, when SW1 gets NodeInfo from SW2 through link b->y, it > doesn't initialize physp for y, hence the link can't be created. > So the only chance for the link to be created is when > SW2 will send NodeInfo request to SW1 through link y->b. > But this isn't happening, because DR for SW2 is updated > to contain this link, so SM doesn't probe the remote side > of y to avoid loop. Ok, so whole story should be caused by race between SW2 SwitchInfo receiving (using a->x) and SW2 NodeInfo (using b->y). As far as I can see only in this case SW2 port 0 path will be altered (and PortInfo will be requested using new path). Right? > BTW, thing happens with every other link that connects > same nodes. In the example above, link v<->d will be > missing as well. Hmm, I was not able to reproduce this using two switch setup. But if it is resulted by race it also should not be 100% reproducible. Basically I'm not against proposed physp initialization, but want to understand the problem better. Sasha > > -- Yevgeny > >> Sasha >>> SM sends NodeInfo request to SW2 through link b->y >>> SM discovers a known node SW2 >>> - DOES NOT create physp y >>> - updates DR to SW2 to go through link b->y >>> >>> From now on, the DR to SW2 is going through port y, so OpenSM won't deal >>> with >>> port y any more, leaving it uninitialized (no physp object for this >>> port). >>> >>> The fix is to create physp for the newly discovered port of the known >>> switch node, same way as it is done for HCAs. >>> I also added one log message for the case that showed the problem - when >>> one of the link sides is uninitialized (no valid ports check). Perhaps >>> this log message should be an error message instead? >>> >>> Signed-off-by: Yevgeny Kliteynik >>> --- >>> opensm/opensm/osm_node_info_rcv.c | 24 +++++++++++++++++++++++- >>> 1 files changed, 23 insertions(+), 1 deletions(-) >>> >>> diff --git a/opensm/opensm/osm_node_info_rcv.c >>> b/opensm/opensm/osm_node_info_rcv.c >>> index c52c0d5..7da3103 100644 >>> --- a/opensm/opensm/osm_node_info_rcv.c >>> +++ b/opensm/opensm/osm_node_info_rcv.c >>> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, >>> */ >>> if (!osm_node_link_has_valid_ports(p_node, port_num, >>> p_neighbor_node, >>> - p_ni_context->port_num)) >>> + p_ni_context->port_num)) { >>> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >>> + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", >>> + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); >>> goto _exit; >>> + } >>> >>> if (osm_node_link_exists(p_node, port_num, >>> p_neighbor_node, p_ni_context->port_num)) { >>> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * >>> sm, >>> IN osm_node_t * const p_node, >>> IN const osm_madw_t * const p_madw) >>> { >>> + >>> + ib_smp_t *p_smp; >>> + ib_node_info_t *p_ni; >>> + uint8_t port_num; >>> + >>> OSM_LOG_ENTER(sm->p_log); >>> >>> + p_smp = osm_madw_get_smp_ptr(p_madw); >>> + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); >>> + port_num = ib_node_info_get_local_port_num(p_ni); >>> + >>> + if (!osm_node_get_physp_ptr(p_node, port_num)) { >>> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >>> + "Creating physp for node GUID:0x%" >>> + PRIx64 ", port %u\n", >>> + cl_ntoh64(osm_node_get_node_guid(p_node)), >>> + port_num); >>> + osm_node_init_physp(p_node, p_madw); >>> + } >>> + >>> /* >>> If this switch has already been probed during this sweep, >>> then don't bother reprobing it. >>> -- >>> 1.5.1.4 >>> > From sean.hefty at intel.com Wed Feb 18 09:50:21 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 09:50:21 -0800 Subject: [ofa-general] RE: [PATCH 9/8] [ib-diag] ibping: add support for WinOF In-Reply-To: <20090218103018.GG7189@sashak.voltaire.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> <20090218103018.GG7189@sashak.voltaire.com> Message-ID: <8302DC6B01C6408D8EE72B0D10AFEDB4@amr.corp.intel.com> >Guess it is about report() function. Why to not make everything cdecl >(by using compiler/linker flag or some super-#pragma in config.h or so)? The WDK build environment uses stdcall by default. Visual Studio uses cdecl. I have not yet figured out how to override the WDK using stdcall. Simply adding a switch (/Gd or whatever it is) doesn't work, nor did the other 50 things that I tried. Top personnel are working on the issue. Please stand by. Thank you for your continued patience. We apologize for any inconvenience. *cue hold music* >Ugh, I really fail to understand why WinOF cannot evaluate an option of >using less "special" build tools for WDK insensitive code (such as >user-space programs ported from linux) - it would solve all those issues >just magically. And we are not entered yet a more complicated porting >areas such as pthreads... I have no problem with it. But it does require two build environments. The current WinOF setup uses a single build environment to build the drivers and related userspace libraries and applications. This is a fairly common practice. I don't know that this is all that different than how OFED packages everything together. My plan for more complicated porting areas is to use complib, and fix any issues that arise. That's what it was designed for. - Sean From sashak at voltaire.com Wed Feb 18 09:57:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 19:57:03 +0200 Subject: [ofa-general] Re: [PATCH 8/8] [ib-diags] smpquery: add support for WinOF In-Reply-To: <3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com> <20090218095230.GC7189@sashak.voltaire.com> <3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com> Message-ID: <20090218175703.GV5910@sashak.voltaire.com> On 09:32 Wed 18 Feb , Sean Hefty wrote: > >> #include > >> #include > >> -#include > >> +#include > > > >Is it needed? Rest tools use similar path with leading 'infiniband'. > > That directory path doesn't exist in Windows. I think this makes sense. > Complib is a separate library, independent of infiniband. This is not so in Linux. complib headers are installed under infiniband (don't know why, but historically it is so). Hmm, actually it is not really matter since complib headers by itself are using things like #include . So ok, I think we can change it in diag tools too. > > >> - for (i = 0; i < (n + 31) / 32; i++) { > >> + for (i = 0; i < (uint32_t) ((n + 31) / 32); i++) { > > > >Wouldn't it be better to make declare i, j, k as int? Width 32 doesn't > >make any sense here. > > > >> mod = i | (portnum << 16); > >> if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0)) > >> return "pkey table query failed"; > >> @@ -353,7 +353,7 @@ guid_info(ib_portid_t *dest, char **argv, int argc) > >> return "port info failed"; > >> mad_decode_field(data, IB_PORT_GUID_CAP_F, &n); > >> > >> - for (i = 0; i < (n + 7) / 8; i++) { > >> + for (i = 0; i < (uint32_t) ((n + 7) / 8); i++) { > > fixed Thanks. Just repost the patch. I will apply. Sasha From sean.hefty at intel.com Wed Feb 18 10:00:21 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 10:00:21 -0800 Subject: [ofa-general] [PATCH v2] [ib-diags] smpquery: add support for WinOF In-Reply-To: <20090218175703.GV5910@sashak.voltaire.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com> <20090218095230.GC7189@sashak.voltaire.com> <3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com> <20090218175703.GV5910@sashak.voltaire.com> Message-ID: <905F24B8D493487CB5E91C02E68E3799@amr.corp.intel.com> Allow smpquery to build and run on both Linux and Windows. Window build files are maintained in the WinOF respository. These changes allow dropping the infiniband-diags into the WinOF build environment. Signed-off-by: Sean Hefty --- changes from v1: declared variables as int, versus casting expressions to (uint32_t) infiniband-diags/src/smpquery.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 44280e1..bf1626d 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -47,7 +47,7 @@ #include #include -#include +#include #include "ibdiag_common.h" @@ -166,7 +166,7 @@ static char * pkey_table(ib_portid_t *dest, char **argv, int argc) { uint8_t data[IB_SMP_DATA_SIZE]; - uint32_t i, j, k; + int i, j, k; uint16_t *p; unsigned mod; int n, t, phy_ports; @@ -343,7 +343,7 @@ static char * guid_info(ib_portid_t *dest, char **argv, int argc) { uint8_t data[IB_SMP_DATA_SIZE]; - uint32_t i, j, k; + int i, j, k; uint64_t *p; unsigned mod; int n; @@ -412,7 +412,7 @@ int main(int argc, char **argv) const struct ibdiag_opt opts[] = { { "combined", 'c', 0, NULL, "use Combined route address argument"}, { "node-name-map", 1, 1, "", "node name map file"}, - {} + { 0 } }; const char *usage_examples[] = { "portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier", From sashak at voltaire.com Wed Feb 18 10:15:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 20:15:56 +0200 Subject: [ofa-general] Re: [PATCH v2] [ib-diags] smpquery: add support for WinOF In-Reply-To: <905F24B8D493487CB5E91C02E68E3799@amr.corp.intel.com> References: <750A8E35453B4F91B3876971D99E1C3A@amr.corp.intel.com> <8B21199DAF6B4010B109838D36505522@amr.corp.intel.com> <20090218095230.GC7189@sashak.voltaire.com> <3CFB22DFCDDD4172AC491FF23F4A3D74@amr.corp.intel.com> <20090218175703.GV5910@sashak.voltaire.com> <905F24B8D493487CB5E91C02E68E3799@amr.corp.intel.com> Message-ID: <20090218181556.GW5910@sashak.voltaire.com> On 10:00 Wed 18 Feb , Sean Hefty wrote: > Allow smpquery to build and run on both Linux and Windows. Window > build files are maintained in the WinOF respository. These changes > allow dropping the infiniband-diags into the WinOF build environment. > > Signed-off-by: Sean Hefty Applied. Thanks. Sasha From volker.jaenisch at inqbus.de Wed Feb 18 10:13:37 2009 From: volker.jaenisch at inqbus.de (Dr. Volker Jaenisch) Date: Wed, 18 Feb 2009 19:13:37 +0100 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny In-Reply-To: <499C0EAD.7040604@voltaire.com> References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com> Message-ID: <499C4FD1.7040200@inqbus.de> Dear Or! Or Gerlitz schrieb: >> Hello Ofa-List! Compiling the ofa-kernel modules from OFED-1.4 on >> Debian Lenny Kernel 2.6.26 (on amd64) gives me the following trace: > First, this list is related to the development of the Linux RDMA stack > not, please refer with ofed issues to ewg at lists.openfabrics.org Sorry for that. But the description of the ofa-List "OpenFabrics General Mailing List" does not indicate this list as an explicit developer forum. And there are lots of postings quite similiar to mine in this list. The description of the ewg-List "OpenFabrics Enterprise Working Group Mailing List" where I find working group anouncements like "Agenda for the OFED meeting today (Jan 5, 09) " looked not so promissing to post my message. May be a dedicated OFED-Users list can be setup where I can post my stupid questions. :-) > Second, what makes you want to replace the IB stack that comes with > Debian and not update the distro? I never said nothing about replacing. But before I can bring in some improvement to the Debian IB stack firstly I like to have a running IB Stack on Debian at all. ISER from the Debian IB Stack does not work for me. Remember our discussion on the STGT-list? http://lists.wpkg.org/pipermail/stgt/2009-February/002649.html So I looked for a working alternative to double check my findings on the iSER read problems before posting a bug report. Therefore I tried to install OFED 1.4 under debian. So what's wrong with that? There are several parts of the OFED (for instance opensm and other user space tools) that are not avaible in debian, yet. The idea is to bring a more consistent Infiniband support to Debian. But this is not my project, so I do not like to discuss over the head of someone other. Here the wishlist entry for OFED Debian support issued by Guy Coates. http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/b42e830ce29c641a Best regards, Volker -- ==================================================== inqbus it-consulting +49 ( 341 ) 5643800 Dr. Volker Jaenisch http://www.inqbus.de Herloßsohnstr. 12 0 4 1 5 5 Leipzig N O T - F Ä L L E +49 ( 170 ) 3113748 ==================================================== From sashak at voltaire.com Wed Feb 18 10:19:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Feb 2009 20:19:55 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <499AB068.2020205@dev.mellanox.co.il> References: <499AB068.2020205@dev.mellanox.co.il> Message-ID: <20090218181955.GX5910@sashak.voltaire.com> On 14:41 Tue 17 Feb , Yevgeny Kliteynik wrote: > Hi Sasha, > > This patch fixes bugzilla issue #1515: > > Topology: > |---------------| > | SW2 | > |---------------| > |x |y |z |v > |----| | | |----| > | | | | > | |----| |----| | > | | | | > a| b| c| d| > |---------------| |---------------| > | SW1 | | SW3 | > |---------------| |---------------| > | | > | | > HCA with SM HCA > > During the discovery: > > SM sends NodeInfo request to SW1 > SM sends NodeInfo request to SW2 through link a->x > SM discovers new node SW2: > - updates DR to SW2 to go through link a->x > - creates physp x > SM sends NodeInfo request to SW2 through link b->y > SM discovers a known node SW2 > - DOES NOT create physp y > - updates DR to SW2 to go through link b->y > > From now on, the DR to SW2 is going through port y, so OpenSM won't deal with > port y any more, leaving it uninitialized (no physp object for this port). > > The fix is to create physp for the newly discovered port of the known > switch node, same way as it is done for HCAs. > I also added one log message for the case that showed the problem - when > one of the link sides is uninitialized (no valid ports check). Perhaps > this log message should be an error message instead? > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_node_info_rcv.c | 24 +++++++++++++++++++++++- > 1 files changed, 23 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c > index c52c0d5..7da3103 100644 > --- a/opensm/opensm/osm_node_info_rcv.c > +++ b/opensm/opensm/osm_node_info_rcv.c > @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, > */ > if (!osm_node_link_has_valid_ports(p_node, port_num, > p_neighbor_node, > - p_ni_context->port_num)) > + p_ni_context->port_num)) { Actually if port is initialized unconditionally on NodeInfo receiving this case becomes impossible. No? If yes, we probably need to put CL_ASSERT() there instead of run-time check. Sasha > + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", > + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); > goto _exit; > + } > > if (osm_node_link_exists(p_node, port_num, > p_neighbor_node, p_ni_context->port_num)) { > @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, > IN osm_node_t * const p_node, > IN const osm_madw_t * const p_madw) > { > + > + ib_smp_t *p_smp; > + ib_node_info_t *p_ni; > + uint8_t port_num; > + > OSM_LOG_ENTER(sm->p_log); > > + p_smp = osm_madw_get_smp_ptr(p_madw); > + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); > + port_num = ib_node_info_get_local_port_num(p_ni); > + > + if (!osm_node_get_physp_ptr(p_node, port_num)) { > + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > + "Creating physp for node GUID:0x%" > + PRIx64 ", port %u\n", > + cl_ntoh64(osm_node_get_node_guid(p_node)), > + port_num); > + osm_node_init_physp(p_node, p_madw); > + } > + > /* > If this switch has already been probed during this sweep, > then don't bother reprobing it. > -- > 1.5.1.4 > From rdreier at cisco.com Wed Feb 18 10:38:38 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Feb 2009 10:38:38 -0800 Subject: [ofa-general] Re: [PATCH] IPoIB: In unicast_arp, do path_free only for newly-created paths In-Reply-To: <200902180913.16171.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Wed, 18 Feb 2009 09:13:15 +0200") References: <200902171701.36107.jackm@dev.mellanox.co.il> <200902180913.16171.jackm@dev.mellanox.co.il> Message-ID: > Yossi identified the problem flow. I wrote and tested the actual patch. > Moni reviewed it, and I wrote the final version. I always thought that > the first s-o-b was for the patch writer. Next time, I'll do it right. Yes, first s-o-b should be for the patch writer. But since Moni wasn't involved in sending the patch out, there's no reason for his s-o-b and in fact it doesn't make sense. If he reviewed it, then "Reviewed-by:" is probably the right thing to include. - R. From rdreier at cisco.com Wed Feb 18 10:41:48 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Feb 2009 10:41:48 -0800 Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: logical-/bit-or confusion? In-Reply-To: <499C256E.7050004@opengridcomputing.com> (Steve Wise's message of "Wed, 18 Feb 2009 09:12:46 -0600") References: <499BD470.4080705@gmail.com> <499C256E.7050004@opengridcomputing.com> Message-ID: > > - ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; > > + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) || crc_enabled ? 1 : 0; I don't seem to have the original email for some reason. Has anyone looked at which way generates better/smaller code? Since || requires short-circuit evaluation it might be better to leave it as |. But maybe it's not worth being so tricky. If someone can resend the patch to me I'm happy to apply it. - R. From or.gerlitz at gmail.com Wed Feb 18 11:58:20 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 18 Feb 2009 21:58:20 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH 2 of 2 for 2.6.28] mlx4: Add Raw Ethertype QP support In-Reply-To: References: <200812151312.53603.jackm@dev.mellanox.co.il> Message-ID: <15ddcffd0902181158o54477d62kbb3798e3b3310fc9@mail.gmail.com> On Sat, Feb 7, 2009 at 12:05 AM, Roland Dreier wrote: > Seems we're at the point where mlx4 could use a "is_special_qpt()" > helper maybe? Jack, Igor Can you address Roland's comments? the 2.6.30 merge window becomes closer and I'd like to see this patch set in, to be used in possible sniffer implementation. Or. From kliteyn at dev.mellanox.co.il Wed Feb 18 13:26:52 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 18 Feb 2009 23:26:52 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <20090218174218.GT5910@sashak.voltaire.com> References: <499AB068.2020205@dev.mellanox.co.il> <20090218010303.GZ7189@sashak.voltaire.com> <499BD55B.3090606@dev.mellanox.co.il> <20090218174218.GT5910@sashak.voltaire.com> Message-ID: <499C7D1C.8070800@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: >>> On 14:41 Tue 17 Feb , Yevgeny Kliteynik wrote: >>>> This patch fixes bugzilla issue #1515: >>>> >>>> Topology: >>>> |---------------| >>>> | SW2 | >>>> |---------------| >>>> |x |y |z |v >>>> |----| | | |----| >>>> | | | | >>>> | |----| |----| | >>>> | | | | >>>> a| b| c| d| >>>> |---------------| |---------------| >>>> | SW1 | | SW3 | >>>> |---------------| |---------------| >>>> | | >>>> | | >>>> HCA with SM HCA >>>> >>>> During the discovery: >>>> >>>> SM sends NodeInfo request to SW1 >>>> SM sends NodeInfo request to SW2 through link a->x >>>> SM discovers new node SW2: >>>> - updates DR to SW2 to go through link a->x >>>> - creates physp x >>> And requests SwitchInfo from SW2, and on response sends PortInfo to all >>> switch ports. PortInfo receiver will initialize all switch ports. Isn't >>> it? >> Links are created only by getting NodeInfo response. W/o the >> fix, when SW1 gets NodeInfo from SW2 through link b->y, it >> doesn't initialize physp for y, hence the link can't be created. >> So the only chance for the link to be created is when >> SW2 will send NodeInfo request to SW1 through link y->b. >> But this isn't happening, because DR for SW2 is updated >> to contain this link, so SM doesn't probe the remote side >> of y to avoid loop. > > Ok, so whole story should be caused by race between SW2 SwitchInfo > receiving (using a->x) and SW2 NodeInfo (using b->y). As far as I can > see only in this case SW2 port 0 path will be altered (and PortInfo will > be requested using new path). Right? Right. >> BTW, thing happens with every other link that connects >> same nodes. In the example above, link v<->d will be >> missing as well. > > Hmm, I was not able to reproduce this using two switch setup. But if it > is resulted by race it also should not be 100% reproducible. Right again. Discovery shouldn't rely on the order of packets that it receives. I guess that on real hardware the packets are handled serially, so we need some more complex example for higher probability of this race. I see the problem on the simple example using the simulator (ibmgtsim), which has several threads handling the packets, so the chances for OOO packets are much higher. -- Yevgeny > Basically I'm not against proposed physp initialization, but want to > understand the problem better. > > Sasha > >> -- Yevgeny >> >>> Sasha >>>> SM sends NodeInfo request to SW2 through link b->y >>>> SM discovers a known node SW2 >>>> - DOES NOT create physp y >>>> - updates DR to SW2 to go through link b->y >>>> >>>> From now on, the DR to SW2 is going through port y, so OpenSM won't deal >>>> with >>>> port y any more, leaving it uninitialized (no physp object for this >>>> port). >>>> >>>> The fix is to create physp for the newly discovered port of the known >>>> switch node, same way as it is done for HCAs. >>>> I also added one log message for the case that showed the problem - when >>>> one of the link sides is uninitialized (no valid ports check). Perhaps >>>> this log message should be an error message instead? >>>> >>>> Signed-off-by: Yevgeny Kliteynik >>>> --- >>>> opensm/opensm/osm_node_info_rcv.c | 24 +++++++++++++++++++++++- >>>> 1 files changed, 23 insertions(+), 1 deletions(-) >>>> >>>> diff --git a/opensm/opensm/osm_node_info_rcv.c >>>> b/opensm/opensm/osm_node_info_rcv.c >>>> index c52c0d5..7da3103 100644 >>>> --- a/opensm/opensm/osm_node_info_rcv.c >>>> +++ b/opensm/opensm/osm_node_info_rcv.c >>>> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, >>>> */ >>>> if (!osm_node_link_has_valid_ports(p_node, port_num, >>>> p_neighbor_node, >>>> - p_ni_context->port_num)) >>>> + p_ni_context->port_num)) { >>>> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >>>> + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", >>>> + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); >>>> goto _exit; >>>> + } >>>> >>>> if (osm_node_link_exists(p_node, port_num, >>>> p_neighbor_node, p_ni_context->port_num)) { >>>> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * >>>> sm, >>>> IN osm_node_t * const p_node, >>>> IN const osm_madw_t * const p_madw) >>>> { >>>> + >>>> + ib_smp_t *p_smp; >>>> + ib_node_info_t *p_ni; >>>> + uint8_t port_num; >>>> + >>>> OSM_LOG_ENTER(sm->p_log); >>>> >>>> + p_smp = osm_madw_get_smp_ptr(p_madw); >>>> + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); >>>> + port_num = ib_node_info_get_local_port_num(p_ni); >>>> + >>>> + if (!osm_node_get_physp_ptr(p_node, port_num)) { >>>> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >>>> + "Creating physp for node GUID:0x%" >>>> + PRIx64 ", port %u\n", >>>> + cl_ntoh64(osm_node_get_node_guid(p_node)), >>>> + port_num); >>>> + osm_node_init_physp(p_node, p_madw); >>>> + } >>>> + >>>> /* >>>> If this switch has already been probed during this sweep, >>>> then don't bother reprobing it. >>>> -- >>>> 1.5.1.4 >>>> > From kliteyn at dev.mellanox.co.il Wed Feb 18 13:31:25 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 18 Feb 2009 23:31:25 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <20090218181955.GX5910@sashak.voltaire.com> References: <499AB068.2020205@dev.mellanox.co.il> <20090218181955.GX5910@sashak.voltaire.com> Message-ID: <499C7E2D.8050301@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 14:41 Tue 17 Feb , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> This patch fixes bugzilla issue #1515: >> >> Topology: >> |---------------| >> | SW2 | >> |---------------| >> |x |y |z |v >> |----| | | |----| >> | | | | >> | |----| |----| | >> | | | | >> a| b| c| d| >> |---------------| |---------------| >> | SW1 | | SW3 | >> |---------------| |---------------| >> | | >> | | >> HCA with SM HCA >> >> During the discovery: >> >> SM sends NodeInfo request to SW1 >> SM sends NodeInfo request to SW2 through link a->x >> SM discovers new node SW2: >> - updates DR to SW2 to go through link a->x >> - creates physp x >> SM sends NodeInfo request to SW2 through link b->y >> SM discovers a known node SW2 >> - DOES NOT create physp y >> - updates DR to SW2 to go through link b->y >> >> From now on, the DR to SW2 is going through port y, so OpenSM won't deal with >> port y any more, leaving it uninitialized (no physp object for this port). >> >> The fix is to create physp for the newly discovered port of the known >> switch node, same way as it is done for HCAs. >> I also added one log message for the case that showed the problem - when >> one of the link sides is uninitialized (no valid ports check). Perhaps >> this log message should be an error message instead? >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/osm_node_info_rcv.c | 24 +++++++++++++++++++++++- >> 1 files changed, 23 insertions(+), 1 deletions(-) >> >> diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c >> index c52c0d5..7da3103 100644 >> --- a/opensm/opensm/osm_node_info_rcv.c >> +++ b/opensm/opensm/osm_node_info_rcv.c >> @@ -164,8 +164,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, >> */ >> if (!osm_node_link_has_valid_ports(p_node, port_num, >> p_neighbor_node, >> - p_ni_context->port_num)) >> + p_ni_context->port_num)) { > > Actually if port is initialized unconditionally on NodeInfo receiving > this case becomes impossible. No? > > If yes, we probably need to put CL_ASSERT() there instead of run-time > check. Good point. I'll repost the patch when we finish discussing it. -- Yevgeny > Sasha > >> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", >> + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); >> goto _exit; >> + } >> >> if (osm_node_link_exists(p_node, port_num, >> p_neighbor_node, p_ni_context->port_num)) { >> @@ -537,8 +541,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, >> IN osm_node_t * const p_node, >> IN const osm_madw_t * const p_madw) >> { >> + >> + ib_smp_t *p_smp; >> + ib_node_info_t *p_ni; >> + uint8_t port_num; >> + >> OSM_LOG_ENTER(sm->p_log); >> >> + p_smp = osm_madw_get_smp_ptr(p_madw); >> + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); >> + port_num = ib_node_info_get_local_port_num(p_ni); >> + >> + if (!osm_node_get_physp_ptr(p_node, port_num)) { >> + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> + "Creating physp for node GUID:0x%" >> + PRIx64 ", port %u\n", >> + cl_ntoh64(osm_node_get_node_guid(p_node)), >> + port_num); >> + osm_node_init_physp(p_node, p_madw); >> + } >> + >> /* >> If this switch has already been probed during this sweep, >> then don't bother reprobing it. >> -- >> 1.5.1.4 >> > From weiny2 at llnl.gov Wed Feb 18 14:38:32 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 18 Feb 2009 14:38:32 -0800 Subject: [ofa-general] Re: [PATCH 1/8] Clean up "new" interface In-Reply-To: References: <20090217210642.41c64624.weiny2@llnl.gov> Message-ID: <20090218143832.c1a809ce.weiny2@llnl.gov> I will resend this whole series. Al also informed me that my signature/from is messed up. From: weiny2 at llnl.gov It looks like my .gitconfig is not right. Sorry, Ira On Wed, 18 Feb 2009 12:07:15 -0500 Hal Rosenstock wrote: > On Wed, Feb 18, 2009 at 12:06 AM, Ira Weiny wrote: > > > > From bac9afe0da7772f97190b3ce758d3e5bfa1fcb65 Mon Sep 17 00:00:00 2001 > > From: weiny2 at llnl.gov > > Date: Tue, 17 Feb 2009 17:32:15 -0800 > > Subject: [PATCH] Clean up "new" interface > > > > type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > > Create new mad_rpc_portid(struct ibmad_port *srcport) function > > which mirrors madrpc_portid(void) > > > > Signed-off-by: weiny2 at llnl.gov > > --- > > libibmad/include/infiniband/mad.h | 58 ++++++++++++++++++++++-------------- > > libibmad/src/gs.c | 19 ++++++------ > > libibmad/src/libibmad.map | 1 + > > libibmad/src/resolve.c | 10 ++++-- > > libibmad/src/rpc.c | 29 +++++++++--------- > > libibmad/src/sa.c | 4 +- > > libibmad/src/smp.c | 4 +- > > 7 files changed, 71 insertions(+), 54 deletions(-) > > > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > > index 1aaaa1b..56b87e6 100644 > > --- a/libibmad/include/infiniband/mad.h > > +++ b/libibmad/include/infiniband/mad.h > > @@ -724,42 +724,49 @@ static inline int mad_is_vendor_range2(int mgmt) > > } > > > > /* rpc.c */ > > +/* Depricated interface */ > > typo - Deprecated > > > MAD_EXPORT int madrpc_portid(void); > > -MAD_EXPORT int madrpc_set_retries(int retries); > > -MAD_EXPORT int madrpc_set_timeout(int timeout); > > I thought initially we weren't going to remove APIs but move over to > the new ones ? A subsequent step would be to deprecate the old APIs > and then eventually remove the old APIs. > > -- Hal > > > void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); > > void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, > > void *data); > > MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > > int num_classes); > > void madrpc_save_mad(void *madbuf, int len); > > -MAD_EXPORT void madrpc_show_errors(int set); > > > > -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > > +/* New interface */ > > +MAD_EXPORT void madrpc_show_errors(int set); > > +MAD_EXPORT int madrpc_set_retries(int retries); > > +MAD_EXPORT int madrpc_set_timeout(int timeout); > > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > > int num_classes); > > -void mad_rpc_close_port(void *ibmad_port); > > -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > > +void mad_rpc_close_port(struct ibmad_port *srcport); > > +void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > > void *payload, void *rcvdata); > > -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > > +void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > > ib_rmpp_hdr_t * rmpp, void *data); > > +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); > > > > /* smp.c */ > > MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, > > unsigned mod, unsigned timeout); > > MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, > > unsigned mod, unsigned timeout); > > + > > +/* smp.c new interface */ > > MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, > > - unsigned mod, unsigned timeout, const void *srcport); > > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport); > > uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, > > - unsigned timeout, const void *srcport); > > + unsigned timeout, const struct ibmad_port *srcport); > > > > /* sa.c */ > > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > > unsigned timeout); > > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > - ib_sa_call_t * sa, unsigned timeout); > > MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ > > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > > + > > +/* sa.c new interface */ > > +uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid, > > + ib_sa_call_t * sa, unsigned timeout); > > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); > > > > /* resolve.c */ > > @@ -771,14 +778,17 @@ MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > > MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, > > ibmad_gid_t * gid); > > > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); > > +/* resolve.c new interface */ > > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport); > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > - ib_portid_t * sm_id, int timeout, const void *srcport); > > + ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport); > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > enum MAD_DEST dest, ib_portid_t * sm_id, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > > > /* gs.c */ > > MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, > > @@ -798,26 +808,28 @@ MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, > > MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout); > > > > +/* gs.c new interface */ > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > > - unsigned timeout, const void *srcport); > > + unsigned timeout, const struct ibmad_port *srcport); > > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > > unsigned mask, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned mask, > > - unsigned timeout, const void *srcport); > > + unsigned timeout, > > + const struct ibmad_port *srcport); > > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport); > > + const struct ibmad_port *srcport); > > /* dump.c */ > > MAD_EXPORT ib_mad_dump_fn > > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c > > index d2c4574..e302caf 100644 > > --- a/libibmad/src/gs.c > > +++ b/libibmad/src/gs.c > > @@ -47,7 +47,7 @@ > > > > static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port, > > unsigned timeout, unsigned id, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > int lid = dest->lid; > > @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, > > > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO, > > srcport); > > @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, > > } > > > > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > > - unsigned timeout, const void *srcport) > > + unsigned timeout, const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_COUNTERS, srcport); > > @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port, > > > > static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned mask, unsigned timeout, > > - unsigned id, const void *srcport) > > + unsigned id, const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > int lid = dest->lid; > > @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > > unsigned mask, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > > IB_GSI_PORT_COUNTERS, srcport); > > @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_COUNTERS_EXT, srcport); > > @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned mask, > > - unsigned timeout, const void *srcport) > > + unsigned timeout, > > + const struct ibmad_port *srcport) > > { > > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > > IB_GSI_PORT_COUNTERS_EXT, srcport); > > @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_SAMPLES_CONTROL, srcport); > > @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, > > > > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > > int port, unsigned timeout, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > return pma_query_via(rcvbuf, dest, port, timeout, > > IB_GSI_PORT_SAMPLES_RESULT, srcport); > > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > > index f944d86..94d7762 100644 > > --- a/libibmad/src/libibmad.map > > +++ b/libibmad/src/libibmad.map > > @@ -69,6 +69,7 @@ IBMAD_1.3 { > > mad_rpc_close_port; > > mad_rpc; > > mad_rpc_rmpp; > > + mad_rpc_portid; > > madrpc; > > madrpc_def_timeout; > > madrpc_init; > > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > > index 553949d..3291f43 100644 > > --- a/libibmad/src/resolve.c > > +++ b/libibmad/src/resolve.c > > @@ -45,7 +45,8 @@ > > #undef DEBUG > > #define DEBUG if (ibdebug) IBWARN > > > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport) > > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport) > > { > > ib_portid_t self = { 0 }; > > uint8_t portinfo[64]; > > @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) > > } > > > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > - ib_portid_t * sm_id, int timeout, const void *srcport) > > + ib_portid_t * sm_id, int timeout, > > + const struct ibmad_port *srcport) > > { > > ib_portid_t sm_portid; > > char buf[IB_SA_DATA_SIZE] = { 0 }; > > @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > > enum MAD_DEST dest_type, ib_portid_t * sm_id, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > uint64_t guid; > > int lid; > > @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > > } > > > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > > - const void *srcport) > > + const struct ibmad_port *srcport) > > { > > ib_portid_t self = { 0 }; > > uint8_t portinfo[64]; > > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > > index e811526..d47873b 100644 > > --- a/libibmad/src/rpc.c > > +++ b/libibmad/src/rpc.c > > @@ -100,6 +100,11 @@ int madrpc_portid(void) > > return mad_portid; > > } > > > > +int mad_rpc_portid(struct ibmad_port *srcport) > > +{ > > + return (srcport->port_id); > > +} > > + > > static int > > _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > > int timeout) > > @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > > return -1; > > } > > > > -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > > void *payload, void *rcvdata) > > { > > - const struct ibmad_port *p = port_id; > > int status, len; > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > > return 0; > > > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > - p->class_agents[rpc->mgtclass], > > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > > + port->class_agents[rpc->mgtclass], > > len, rpc->timeout)) < 0) { > > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > > return 0; > > @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > return rcvdata; > > } > > > > -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > > ib_rmpp_hdr_t * rmpp, void *data) > > { > > - const struct ibmad_port *p = port_id; > > int status, len; > > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > > > @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > > return 0; > > > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > > - p->class_agents[rpc->mgtclass], > > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > > + port->class_agents[rpc->mgtclass], > > len, rpc->timeout)) < 0) { > > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > > return 0; > > @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) > > } > > } > > > > -void *mad_rpc_open_port(char *dev_name, int dev_port, > > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, > > int *mgmt_classes, int num_classes) > > { > > struct ibmad_port *p; > > @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port, > > return p; > > } > > > > -void mad_rpc_close_port(void *port_id) > > +void mad_rpc_close_port(struct ibmad_port *port) > > { > > - struct ibmad_port *p = port_id; > > - > > - umad_close_port(p->port_id); > > - free(p); > > + umad_close_port(port->port_id); > > + free(port); > > } > > > > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > > index 7403d4f..ddeb152 100644 > > --- a/libibmad/src/sa.c > > +++ b/libibmad/src/sa.c > > @@ -44,7 +44,7 @@ > > #undef DEBUG > > #define DEBUG if (ibdebug) IBWARN > > > > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > ib_sa_call_t * sa, unsigned timeout) > > { > > ib_rpc_t rpc = { 0 }; > > @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > > IB_PR_COMPMASK_SGID |\ > > IB_PR_COMPMASK_NUMBPATH) > > > > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf) > > { > > int npath; > > diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c > > index fad263c..e5489b3 100644 > > --- a/libibmad/src/smp.c > > +++ b/libibmad/src/smp.c > > @@ -45,7 +45,7 @@ > > #define DEBUG if (ibdebug) IBWARN > > > > uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid, > > - unsigned mod, unsigned timeout, const void *srcport) > > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > > > @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid, > > } > > > > uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid, > > - unsigned mod, unsigned timeout, const void *srcport) > > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > > { > > ib_rpc_t rpc = { 0 }; > > > > -- > > 1.5.4.5 > > > > -- Ira Weiny Math Programer/Computer Scientist Larence Livermore National Lab weiny2 at llnl.gov From rdreier at cisco.com Wed Feb 18 16:40:37 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Feb 2009 16:40:37 -0800 Subject: [ofa-general] ib_reg_phys_mr( ) results in crash In-Reply-To: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> (neutron's message of "Tue, 17 Feb 2009 09:50:21 -0500") References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> Message-ID: > Before calling ib_reg_phys_mr, printk() shows that all its arguments > are valid. But the system always crashes immediately after entering > the function ib_reg_phys_mr( ). Any possible reasons ? Thanks!! What do you mean by "immediately after entering ib_reg_phys_mr()"? Do you get an oops message? If so that would be very important info for debugging this. - R. From sean.hefty at intel.com Wed Feb 18 17:43:28 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 17:43:28 -0800 Subject: [ofa-general] [PATCH 0/6] [ib-diag] add support to more diags for WinOF Message-ID: This series adds support to all remaining IB diagnostics utilities, except saquery. Signed-off-by: Sean Hefty From sean.hefty at intel.com Wed Feb 18 17:46:05 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 17:46:05 -0800 Subject: [ofa-general] [PATCH 1/6] [ib-diag] ibnetdiscover: add support for WinOF In-Reply-To: References: Message-ID: <16F309DB95BC45BE90DE636AE675310C@amr.corp.intel.com> Mainly fixing datatypes to avoid type mismatches. Signed-off-by: Sean Hefty --- Also attaching patch in case my mailer wraps the lines. infiniband-diags/src/grouping.c | 28 ++++++++++++++-------------- infiniband-diags/src/ibnetdiscover.c | 8 ++++---- 2 files changed, 18 insertions(+), 18 deletions(-) diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c index 0ea139f..0266af4 100644 --- a/infiniband-diags/src/grouping.c +++ b/infiniband-diags/src/grouping.c @@ -265,20 +265,20 @@ int is_chassis_switch(Node *node) } /* these structs help find Line (Anafa) slot number while using spine portnum */ -int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; -int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; -int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; -int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; +char line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; +char anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; +char line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; +char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; /* IPR FCR modules connectivity while using sFB4 port as reference */ -int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; +char ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; /* these structs help find Spine (Anafa) slot number while using spine portnum */ -int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ +char spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */ static void get_sfb_slot(Node *node, Port *lineport) { @@ -309,7 +309,7 @@ static void get_sfb_slot(Node *node, Port *lineport) static void get_router_slot(Node *node, Port *spineport) { ChassisRecord *ch = node->chrecord; - int guessnum = 0; + uint64_t guessnum = 0; if (!ch) { if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) @@ -460,7 +460,7 @@ static void insert_line_router(Node *node, ChassisList *chassislist) return; /* already filled slot */ chassislist->linenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; + node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum; } static void insert_spine(Node *node, ChassisList *chassislist) @@ -471,7 +471,7 @@ static void insert_spine(Node *node, ChassisList *chassislist) return; /* already filled slot */ chassislist->spinenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; + node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum; } static void pass_on_lines_catch_spines(ChassisList *chassislist) @@ -770,7 +770,7 @@ ChassisList *group_nodes() if (!node->chrecord) { if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) IBPANIC("out of mem"); - node->chrecord->chassisnum = chassis->chassisnum; + node->chrecord->chassisnum = (unsigned char) chassis->chassisnum; } } } diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 466d522..27afd6a 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -47,7 +47,7 @@ #include #include -#include +#include #include "ibnetdiscover.h" #include "grouping.h" @@ -212,7 +212,7 @@ extend_dpath(ib_dr_path_t *path, int nextport) ++path->cnt; if (path->cnt > maxhops_discovered) maxhops_discovered = path->cnt; - path->p[path->cnt] = nextport; + path->p[path->cnt] = (uint8_t) nextport; return path->cnt; } @@ -517,7 +517,7 @@ out_chassis(int chassisnum) uint64_t guid; fprintf(f, "\nChassis %d", chassisnum); - guid = get_chassis_guid(chassisnum); + guid = get_chassis_guid((unsigned char) chassisnum); if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); @@ -964,7 +964,7 @@ int main(int argc, char **argv) { "Router_list", 'R', 0, NULL, "list of connected routers" }, { "node-name-map", 1, 1, "", "node name map file" }, { "ports", 'p', 0, NULL, "obtain a ports report" }, - { } + { 0 } }; char usage_args[] = "[topology-file]"; -------------- next part -------------- A non-text attachment was scrubbed... Name: 01-win-ibnet Type: application/octet-stream Size: 5992 bytes Desc: not available URL: From sean.hefty at intel.com Wed Feb 18 17:46:38 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 17:46:38 -0800 Subject: [ofa-general] [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: References: Message-ID: Signed-off-by: Sean Hefty --- infiniband-diags/src/ibroute.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 144d1b2..d1049ad 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -45,7 +45,7 @@ #include #include -#include +#include #include "ibdiag_common.h" @@ -327,7 +327,7 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid) for (;i < e; i++) { unsigned outport = lft[i % IB_SMP_DATA_SIZE]; - unsigned valid = (outport <= nports); + unsigned valid = (outport <= (unsigned) nports); if (!valid && !dump_all) continue; @@ -370,7 +370,7 @@ int main(int argc, char **argv) { "all", 'a', 0, NULL, "show all lids, even invalid entries" }, { "no_dests", 'n', 0, NULL, "do not try to resolve destinations" }, { "Multicast", 'M', 0, NULL, "show multicast forwarding tables" }, - { } + { 0 } }; char usage_args[] = "[ [ []]]"; const char *usage_examples[] = { From sean.hefty at intel.com Wed Feb 18 17:47:10 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 17:47:10 -0800 Subject: [ofa-general] [PATCH 3/6] [ib-diag] ibtracert: add support for WinOF In-Reply-To: References: Message-ID: <05EDF7233B20414B821BCFF5B9938F44@amr.corp.intel.com> Signed-off-by: Sean Hefty --- infiniband-diags/src/ibtracert.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index ea5662b..db3b906 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -46,7 +46,7 @@ #include #include -#include +#include #include "ibdiag_common.h" @@ -180,7 +180,7 @@ extend_dpath(ib_dr_path_t *path, int nextport) if (path->cnt+2 >= sizeof(path->p)) return -1; ++path->cnt; - path->p[path->cnt] = nextport; + path->p[path->cnt] = (uint8_t) nextport; return path->cnt; } @@ -718,7 +718,7 @@ int main(int argc, char **argv) { "no_info", 'n', 0, NULL, "simple format" }, { "mlid", 'm', 1, "", "multicast trace of the mlid" }, { "node-name-map", 1, 1, "", "node name map file" }, - { } + { 0 } }; char usage_args[] = " "; const char *usage_examples[] = { From sean.hefty at intel.com Wed Feb 18 17:48:44 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 17:48:44 -0800 Subject: [ofa-general] [PATCH 4/6] [ib-diag] ibsysstat: add support for WinOF In-Reply-To: References: Message-ID: Use char* pointers to obtain offsets, in place of void*. Signed-off-by: Sean Hefty --- infiniband-diags/src/ibsysstat.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index cc1418d..b9f2f85 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -183,7 +183,7 @@ static char *ibsystat_serv(void) DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod); - size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS, + size = mk_reply(attr, (char *) mad + IB_VENDOR_RANGE2_DATA_OFFS, sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS); if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0) @@ -210,7 +210,7 @@ static char *ibsystat(ib_portid_t *portid, int attr) { ib_rpc_t rpc = { 0 }; int fd, agent, timeout, len; - void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; + void *data = (char *) umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; DEBUG("Sysstat ping.."); @@ -318,7 +318,7 @@ int main(int argc, char **argv) const struct ibdiag_opt opts[] = { { "oui", 'o', 1, NULL, "use specified OUI number" }, { "Server", 'S', 0, NULL, "start in server mode" }, - { } + { 0 } }; char usage_args[] = " []"; From sean.hefty at intel.com Wed Feb 18 17:49:09 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 17:49:09 -0800 Subject: [ofa-general] [PATCH 5/6] [ib-diag] ibsendtrap: add support for WinOF In-Reply-To: References: Message-ID: <0BC5E717DDC24248A6A7515FFAC7225D@amr.corp.intel.com> Signed-off-by: Sean Hefty --- infiniband-diags/src/ibsendtrap.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index ba6aa8b..ba6f86a 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -43,7 +43,7 @@ #include #include -#include +#include #include "ibdiag_common.h" From sean.hefty at intel.com Wed Feb 18 17:50:15 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Feb 2009 17:50:15 -0800 Subject: [ofa-general] [PATCH 6/6] [ib-diag] mcm_rereg_test: add support for WinOF In-Reply-To: References: Message-ID: <46D52E76EEAC43519FC7D536301C69CE@amr.corp.intel.com> Fix some typecasts and variable argument function macro definitions Signed-off-by: Sean Hefty --- infiniband-diags/src/mcm_rereg_test.c | 24 +++++++++++++++--------- 1 files changed, 15 insertions(+), 9 deletions(-) diff --git a/infiniband-diags/src/mcm_rereg_test.c b/infiniband-diags/src/mcm_rereg_test.c index 9285b95..5252459 100644 --- a/infiniband-diags/src/mcm_rereg_test.c +++ b/infiniband-diags/src/mcm_rereg_test.c @@ -31,6 +31,10 @@ * */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + #include #include #include @@ -39,12 +43,12 @@ #include #include -#define info(fmt, arg...) fprintf(stderr, "INFO: " fmt, ##arg ) -#define err(fmt, arg...) fprintf(stderr, "ERR: " fmt, ##arg ) +#define info(fmt, ...) fprintf(stderr, "INFO: " fmt, ## __VA_ARGS__ ) +#define err(fmt, ...) fprintf(stderr, "ERR: " fmt, ## __VA_ARGS__ ) #ifdef NOISY_DEBUG -#define dbg(fmt, arg...) fprintf(stderr, "DBG: " fmt, ##arg ) +#define dbg(fmt, ...) fprintf(stderr, "DBG: " fmt, ## __VA_ARGS__ ) #else -#define dbg(fmt, arg...) +#define dbg(fmt, ...) #endif #define TMO 100 @@ -161,7 +165,8 @@ static int rereg_send_all(int port, int agent, ib_portid_t *dport, { uint8_t *umad; int len = umad_size() + 256; - int i, ret; + unsigned i; + int ret; info("rereg_send_all... cnt = %u\n", cnt); @@ -247,7 +252,7 @@ static int rereg_recv_all(int port, int agent, ib_portid_t *dport, int len = umad_size() + 256; uint64_t trid; unsigned n, method, status; - int i; + unsigned i; info("rereg_recv_all...\n"); @@ -301,7 +306,8 @@ static int rereg_query_all(int port, int agent, ib_portid_t *dport, uint8_t *umad, *mad; int len = umad_size() + 256; unsigned method, status; - int i, ret; + unsigned i; + int ret; info("rereg_query_all...\n"); @@ -384,8 +390,8 @@ static int rereg_and_test_port(char *guid_file, int port, int agent, ib_portid_t char line[256]; FILE *f; ibmad_gid_t port_gid; - uint64_t prefix = htonll(0xfe80000000000000llu); - uint64_t guid = htonll(0x0002c90200223825llu); + uint64_t prefix = htonll(0xfe80000000000000ull); + uint64_t guid = htonll(0x0002c90200223825ull); struct guid_trid *list; int i = 0; From Jie.Cai at cs.anu.edu.au Wed Feb 18 18:07:46 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Thu, 19 Feb 2009 13:07:46 +1100 Subject: [ofa-general] RDMA write with immediate data. Message-ID: <499CBEF2.2010909@cs.anu.edu.au> I am currently facing a problem that I let an initiator to RDMA write data to the remote side with immediate data. if (initiator) { ret = dat_ib_post_rdma_write_immed( h_ep, // ep_handle 1, // num_segments &l_iov, // LMR cookie, // user_cookie &r_iov, // RMR immed_data, DAT_COMPLETION_DEFAULT_FLAG); ret = dat_evd_wait(h_dto_req_evd, DTO_TIMEOUT, 1, &event, &nmore); } else { ret = dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore); } However, at remote side I got the following error message indicates that no event coming through. 5217 ERROR: DTO dat_evd_wait() DAT_TIMEOUT_EXPIRED 5217 Error do_rdmw_write_with_immd: DAT_TIMEOUT_EXPIRED The return of dat_evd_wait is DAT_TIMEOUT_EXPIRED. Would anyone helped with this. -- Mr. Jie Cai Department of Computer Science Faculty of Engineering and Information Technology College of Engineering & Computer Science CSIT Building (108), North Road The Australian National University Canberra ACT 0200 Australia Email: Jie.Cai at cs.anu.edu.au Tel: +61-2-61251451 Fax: +61-2-61250010 Web: http://cs.anu.edu.au/~Jie.Cai Mobile: 0433992958 From arlin.r.davis at intel.com Thu Feb 19 00:19:29 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 19 Feb 2009 00:19:29 -0800 Subject: [ofa-general] RDMA write with immediate data. In-Reply-To: <499CBEF2.2010909@cs.anu.edu.au> References: <499CBEF2.2010909@cs.anu.edu.au> Message-ID: > >if (initiator) { > ret = dat_ib_post_rdma_write_immed( h_ep, // > >However, at remote side I got the following error message >indicates that >no event coming through. > >5217 ERROR: DTO dat_evd_wait() DAT_TIMEOUT_EXPIRED >5217 Error do_rdmw_write_with_immd: DAT_TIMEOUT_EXPIRED > >The return of dat_evd_wait is DAT_TIMEOUT_EXPIRED. > Does the initiator side complete successfully? Do you have receive's posted at the remote side for immed data? You can look at dtestx source for an immed data example. -arlin From vlad at lists.openfabrics.org Thu Feb 19 03:21:00 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 19 Feb 2009 03:21:00 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090219-0200 daily build status Message-ID: <20090219112101.10361E28155@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From tziporet at dev.mellanox.co.il Thu Feb 19 03:32:21 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 19 Feb 2009 13:32:21 +0200 Subject: [ofa-general] ib_reg_phys_mr( ) results in crash In-Reply-To: References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> Message-ID: <499D4345.1010007@mellanox.co.il> Roland Dreier wrote: > > Before calling ib_reg_phys_mr, printk() shows that all its arguments > > are valid. But the system always crashes immediately after entering > > the function ib_reg_phys_mr( ). Any possible reasons ? Thanks!! > > What do you mean by "immediately after entering ib_reg_phys_mr()"? Do > you get an oops message? If so that would be very important info for > debugging this. > > Also HCA used and other system info can help us Tziporet From Zhen.Liang at Sun.COM Thu Feb 19 03:39:13 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Thu, 19 Feb 2009 19:39:13 +0800 Subject: [ofa-general] ib_reg_phys_mr( ) results in crash In-Reply-To: References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> Message-ID: <499D44E1.3010809@sun.com> Roland Dreier : > > Before calling ib_reg_phys_mr, printk() shows that all its arguments > > are valid. But the system always crashes immediately after entering > > the function ib_reg_phys_mr( ). Any possible reasons ? Thanks!! > > What do you mean by "immediately after entering ib_reg_phys_mr()"? Do > you get an oops message? If so that would be very important info for > debugging this. > Also, what kind of address did you pass into ib_reg_phys_mr? a little context of your calling is helpful Regards Liang From Line.Holen at Sun.COM Thu Feb 19 03:42:00 2009 From: Line.Holen at Sun.COM (Line.Holen at Sun.COM) Date: Thu, 19 Feb 2009 12:42:00 +0100 Subject: [ofa-general] opensm logoutput In-Reply-To: References: Message-ID: <499D4588.1030702@Sun.COM> Hi Bert, most of these messages indicates that you do have unstable links in your system. But there is one message that can indicate that you've hit a newly discovered SM bug: __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies) po If you do have NEM switches in your system, then you are exposed to this bug. I hit it quite easily. Yevgeny Kliteynik posted a patch for this bug just a few minutes after you sent your email. (If you are interested look for the email thread "create physp for the newly discovered port of the known node"). Line On 02/17/09 01:23 PM, Wiegers, Bert wrote: > Hi, > > we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from > SUN. > As we are debugging our System I'm trying to understand the > opensm.log's. > (Where can I find any documentation to that?) > > > We see frequent messages as follows: > > Feb 17 10:25:34 134964 [41802940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128 > (Link state change) Producer:2 (Switch) from LID:111 > TID:0x000000000000006e > Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:1 num:128 (Link state change) from LID:111 > GID:fe80::14:4fa4:cff8:50 > Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:65 (GID out of service) from LID:336 > GID:fe80::3:ba00:100:3341 > Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port: > Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of > node:MT25408 ConnectX Mellanox Technologies > Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop > tables configured on all switches > Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP > Feb 17 10:25:46 662611 [41802940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128 > (Link state change) Producer:2 (Switch) from LID:111 > TID:0x000000000000006f > Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:1 num:128 (Link state change) from LID:111 > GID:fe80::14:4fa4:cff8:50 > Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop > tables configured on all switches > Feb 17 10:25:52 476653 [44007940] 0x01 -> > __osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00 > Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x81 > (SubnGetResp) > D bit...................0x1 > status..................0x1C00 > hop_ptr.................0x0 > hop_count...............0x4 > trans_id................0x18c08de > attr_id.................0x15 (PortInfo) > resv....................0x0 > attr_mod................0x6 > > m_key...................0x0000000000000000 > dr_slid.................65535 > dr_dlid.................65535 > > Initial path: 0,1,10,15,23 > Return path: 0,23,20,12,17 > Reserved: [0][0][0][0][0][0][0] > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 11 > 03 03 02 > > 34 52 00 23 40 40 00 08 08 04 F0 4C 00 > 00 00 00 > > 00 00 00 00 00 88 00 00 00 00 00 00 00 > 00 00 00 > > > > > Other issues I see with messages similar to the following ones: > > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for > node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies) > po > > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > > osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5 > (Invalid argument) > > > I'm still googleing, but hopefully someone can give me some answers. > > > > Thanks and best regards > Bert > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From subbukl at gmail.com Thu Feb 19 04:55:55 2009 From: subbukl at gmail.com (subbu kl) Date: Thu, 19 Feb 2009 18:25:55 +0530 Subject: [ofa-general] ***SPAM*** INT-X fallback in mthca driver Message-ID: I am trying PCI passthrogh of Mellanox Infinihost III Lx chip based Infiniband and TG3 ethernet PCIe cards on Centos 5.2 Full virtualized guest with Xen 3.3.0 ib_mthca driver fails with QUERY_FW failed probe failed with errror -11 But interestingly tg3 driver says "Could not get MSI interrupts falling back to INTx" and works fine So, 1) why Xen could not get the MSI interrupts working ? 2) Should we have INT-x falling back method for Mellanox driver also if its needed ? ~subbu -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Thu Feb 19 05:28:06 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 19 Feb 2009 15:28:06 +0200 Subject: [ofa-general] opensm logoutput In-Reply-To: <499D4588.1030702@Sun.COM> References: <499D4588.1030702@Sun.COM> Message-ID: <499D5E66.3010600@dev.mellanox.co.il> Bert, Line.Holen at Sun.COM wrote: > Hi Bert, > > most of these messages indicates that you do have unstable links in your > system. > But there is one message that can indicate that you've hit a newly > discovered SM bug: > > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for > node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies) This message is probably also related to the unstable links (or nodes). Some port didn't answer a query from the SM (see below), so SM warns that there is a port that is physically not down, but the other side of the link couldn't be probed. > If you do have NEM switches in your system, then you are exposed to this > bug. > I hit it quite easily. > > Yevgeny Kliteynik posted a patch for this bug just a few minutes after > you sent > your email. (If you are interested look for the email thread "create > physp for the > newly discovered port of the known node"). Of course, using the patch wouldn't hurt :) > Line > > On 02/17/09 01:23 PM, Wiegers, Bert wrote: >> Hi, >> >> we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from >> SUN. >> As we are debugging our System I'm trying to understand the >> opensm.log's. >> (Where can I find any documentation to that?) >> >> >> We see frequent messages as follows: >> >> Feb 17 10:25:34 134964 [41802940] 0x01 -> >> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128 >> (Link state change) Producer:2 (Switch) from LID:111 >> TID:0x000000000000006e >> Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:1 num:128 (Link state change) from LID:111 >> GID:fe80::14:4fa4:cff8:50 Generic notice num. 128 (trap 128) is issued by switch (LID 111) because it detected port state change on one of its ports, could be because of unstable link, could be something else. SM logs that it got this trap from the switch. >> Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:65 (GID out of service) from LID:336 >> GID:fe80::3:ba00:100:3341 SM can't find some port any more, so it informs the fabric that this GID is "out of service" by sending notice num. 65. >> Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port: >> Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of >> node:MT25408 ConnectX Mellanox Technologies LID 1047 is no longer reachable and removed from the SM's DB. >> Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop >> tables configured on all switches >> Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP >> Feb 17 10:25:46 662611 [41802940] 0x01 -> >> __osm_trap_rcv_process_request: Received Generic Notice type:1 num:128 >> (Link state change) Producer:2 (Switch) from LID:111 >> TID:0x000000000000006f >> Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:1 num:128 (Link state change) from LID:111 >> GID:fe80::14:4fa4:cff8:50 >> Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop >> tables configured on all switches >> Feb 17 10:25:52 476653 [44007940] 0x01 -> >> __osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00 >> Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump: >> base_ver................0x1 >> mgmt_class..............0x81 >> class_ver...............0x1 >> method..................0x81 >> (SubnGetResp) >> D bit...................0x1 >> status..................0x1C00 >> hop_ptr.................0x0 >> hop_count...............0x4 >> trans_id................0x18c08de >> attr_id.................0x15 (PortInfo) >> resv....................0x0 >> attr_mod................0x6 >> >> m_key...................0x0000000000000000 >> dr_slid.................65535 >> dr_dlid.................65535 >> >> Initial path: 0,1,10,15,23 >> Return path: 0,23,20,12,17 >> Reserved: [0][0][0][0][0][0][0] >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 00 >> >> 00 00 00 00 00 00 00 00 00 00 00 00 11 >> 03 03 02 >> >> 34 52 00 23 40 40 00 08 08 04 F0 4C 00 >> 00 00 00 >> >> 00 00 00 00 00 88 00 00 00 00 00 00 00 >> 00 00 00 >> >> >> >> >> Other issues I see with messages similar to the following ones: >> >> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for >> node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies) >> po >> >> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error >> (IB_TIMEOUT) The above two messages are related. The IB_TIMEOUT says that some MAD was sent, but no response was received. This, in turn, would cause the "unknown remote side" message. Bottom line - there might be unstable ports/links in the fabric. Check all the links that reported by the SM as having an unknown remote side. -- Yevgeny >> osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5 >> (Invalid argument) >> >> I'm still googleing, but hopefully someone can give me some answers. >> >> >> >> Thanks and best regards >> Bert >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hnrose at comcast.net Thu Feb 19 05:06:53 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 19 Feb 2009 08:06:53 -0500 Subject: [ofa-general] [PATCH] opensm/console: Enhance perfmgr print_counters for better nodenames Message-ID: <20090219130653.GA29318@comcast.net> nodenames can have spaces in them Also, no need for next_token being inlined Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 00e2a94..9cad594 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -73,11 +73,16 @@ on: 0, delay_s: 2, loop_function:NULL}; static const struct command console_cmds[]; -static inline char *next_token(char **p_last) +static char *next_token(char **p_last) { return strtok_r(NULL, " \t\n\r", p_last); } +static char *name_token(char **p_last) +{ + return strtok_r(NULL, "\t\n\r", p_last); +} + static void help_command(FILE * out, int detail) { int i; @@ -1152,7 +1157,7 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) PERFMGR_EVENT_DB_DUMP_HR); } } else if (strcmp(p_cmd, "print_counters") == 0) { - p_cmd = next_token(p_last); + p_cmd = name_token(p_last); if (p_cmd) { osm_perfmgr_print_counters(&p_osm->perfmgr, p_cmd, out); diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 3babe3a..8766f93 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -1304,9 +1304,9 @@ void osm_perfmgr_print_counters(osm_perfmgr_t *pm, char *nodename, FILE *fp) { uint64_t guid = strtoull(nodename, NULL, 0); - if (guid == 0 && errno == EINVAL) + if (guid == 0 && errno) // name perfmgr_db_print_by_name(pm->db, nodename, fp); - else + else // guid perfmgr_db_print_by_guid(pm->db, guid, fp); } From ogerlitz at Voltaire.com Thu Feb 19 06:52:41 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Thu, 19 Feb 2009 16:52:41 +0200 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny In-Reply-To: <499C4FD1.7040200@inqbus.de> References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com> <499C4FD1.7040200@inqbus.de> Message-ID: <499D7239.5060502@Voltaire.com> Dr. Volker Jaenisch wrote: > Or Gerlitz schrieb: >>> Hello Ofa-List! Compiling the ofa-kernel modules from OFED-1.4 on >>> Debian Lenny Kernel 2.6.26 (on amd64) gives me the following trace: >> Second, what makes you want to replace the IB stack that comes with >> Debian and not update the distro? > I never said nothing about replacing. But before I can bring in some improvement > to the Debian IB stack firstly I like to have a running IB Stack on Debian at all. Hi, The Linux kernel Infiniband maintainer Roland Dreier made the following comment @ http://lists.openfabrics.org/pipermail/general/2008-July/052824.html > I use Debian for pretty much all my development. However I haven't > tried to use OFED -- rather, I have just gotten all the support that I > use into the main Debian archive. I'm not sure how much is in Etch but > Lenny should be pretty good: there are libibverbs, librdmacm, libmthca, > libmlx4, libcxgb3, and libipathverbs packages in the main archive, along > with Open MPI 1.2.6 built with IB support. And the 2.6.25 kernel in the > archive should have all the kernel drivers you need. So IB comes with Debian out of the box, and if its broken, please report it and I'm sure Roland will act to fix things. > There are several parts of the OFED (for instance opensm and other user > space tools) that are not avaible in debian, yet. The idea is to bring a more > consistent Infiniband support to Debian. But this is not my project, so I do > not like to discuss over the head of > someone other. Here the wishlist entry for OFED Debian support issued by Guy Coates. > http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/b42e830ce29c641a The OFED packages by no means provide "more consistent Infiniband support to Debian". If some packages are missing, I would recommend to act and add them in the native Debian forms as Roland suggested in his postings over the other thread and not to go and "port ofed to debian" - its useless and end-in-mind will get you nothing. If you need help with the management libraries (libibmad, libibumad, opensm, diags) push info Debian, there are people on this list who might be able to help you with that. Or. From gmpc at sanger.ac.uk Thu Feb 19 07:31:58 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Thu, 19 Feb 2009 15:31:58 +0000 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny In-Reply-To: <499D7239.5060502@Voltaire.com> References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com> <499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com> Message-ID: <499D7B6E.3050206@sanger.ac.uk> > The OFED packages by no means provide "more consistent Infiniband support to Debian". >If some packages are missing, I would recommend to act and add them in the native Debian forms >as Roland suggested in his postings over the other thread and not to go and "port ofed to debian" >- its useless and end-in-mind will get you nothing. > If you need help with the management libraries (libibmad, libibumad, opensm, diags) push info Debian, > there are people on this list who might be able to help you with that. Hi all, A bit of historical background; I started packaging the missing bits of OFED 1.3 for debian etch for my own private use, as I needed some bits that were not present. (openSM, srp-tools, and a set of OFED 1.3 kernel modules+headers that I could build lustre against, which was the ultimate aim of the exercise). Now that OFED 1.4 + lenny has been released, my aim is to build on that work and push the remaining unpackaged bits of OFED 1.4 bits upstream. I am not repackaging any of the bits that Roland has already done. (in fact, my packages depend on them). (for the record, the unpackaged bits of OFED which I now have packages for are below) dapl_2.0.15-1 ibutils_1.2-1 infiniband-diags_1.4.4-1 libibcm_1.0.4 libibcommon_1.1.2 libibmad_1.2.3-1 libibumad_1.2.3-1 libnes_0.5 libsdp_1.1.99 mstflint_1.4 ofa-kernel_1.4 ofed_1.4 ofed-docs opensm_3.2.5 perftest_1.2 qlvnictools_0.0.1 rds-tools-1.4 sdpnetstat_1.60 srptools_0.0.4 tvflash_0.9.0 Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From brian at sun.com Thu Feb 19 07:41:00 2009 From: brian at sun.com (Brian J. Murrell) Date: Thu, 19 Feb 2009 10:41:00 -0500 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny In-Reply-To: <499D7B6E.3050206@sanger.ac.uk> References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com> <499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com> <499D7B6E.3050206@sanger.ac.uk> Message-ID: <1235058060.28114.307.camel@pc.interlinx.bc.ca> On Thu, 2009-02-19 at 15:31 +0000, Guy Coates wrote: > Hi Guy, > I started packaging the missing bits of OFED 1.3 for debian etch for my own > private use, as I needed some bits that were not present. (openSM, srp-tools, > and a set of OFED 1.3 kernel modules+headers that I could build lustre against, /me waves. > Now that OFED 1.4 + lenny has been released, my aim is to build on that work and > push the remaining unpackaged bits of OFED 1.4 bits upstream. Out of interest, what kernel and OFED backports is that using? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From eli at dev.mellanox.co.il Thu Feb 19 08:55:05 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 19 Feb 2009 18:55:05 +0200 Subject: [ofa-general] iscsi initiator ipoib+lro crash on upstream kernel Message-ID: <20090219165505.GA13617@mtls03> Hi, I have encountered a kernel crash when running a iSCSI initiator on IPoIB configured with LRO (if LRO is off it does not happen). This was seen first on Sles10sp2 but then I verified it happens on 2.6.28.2 too. Bellow is a dump of the crash info from 2.6.28.2: sd 2:0:0:1: Attached scsi generic sg3 type 0 BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [] skb_seq_read+0xfb/0x1a1 PGD 227115067 PUD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/platform/host2/session2/target2:0:0/2:0:0:1/type CPU 2 Modules linked in: ib_uverbs ib_umad mlx4_ib nfs lockd nfs_acl mlx4_core sunrpc ib_mthca ib_ipoib ib_cm ib_sa ib_mad ib_core inet_lro ipv6 button battery a] Pid: 0, comm: swapper Not tainted 2.6.28.2-debug #3 RIP: 0010:[] [] skb_seq_read+0xfb/0x1a1 RSP: 0018:ffff88022f0e3b00 EFLAGS: 00010246 RAX: ffff88022dd44f38 RBX: ffff88022f0e3b30 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88022f0e3b88 RDI: 00000000000007d4 RBP: 00000000000007d4 R08: ffff880220476d30 R09: 000000000000085c R10: 00000000000b0038 R11: ffffffffa0126115 R12: ffff88022f0e3b88 R13: ffff88022d974d38 R14: 00000000000007d4 R15: 00000000000007d4 FS: 0000000000000000(0000) GS:ffff88022f07bb50(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000004 CR3: 00000002271c2000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffff88022f0da000, task ffff88022f0a4050) Stack: ffff88022d974fa0 ffff88022f0e3b30 00000000000007d4 ffffffffa01261fe ffff88022d974f80 ffff880220418068 0000085c00000000 000007d400000000 ffff880220476d30 ffff88022dd44f38 0000000000000000 ffff88022dd44e58 Call Trace: <0> [] ? iscsi_tcp_recv+0x64/0x39b [iscsi_tcp] [] ? ip_queue_xmit+0x2aa/0x2fd [] ? tcp_read_sock+0x97/0x212 [] ? iscsi_tcp_recv+0x0/0x39b [iscsi_tcp] [] ? iscsi_tcp_data_ready+0x48/0x85 [iscsi_tcp] [] ? tcp_rcv_established+0x4c0/0x567 [] ? tcp_v4_do_rcv+0x2c/0x1c8 [] ? tcp_v4_rcv+0x630/0x683 [] ? skb_release_head_state+0x60/0x8f [] ? ip_local_deliver_finish+0xda/0x197 [] ? ip_rcv_finish+0x32f/0x349 [] ? lro_flush+0x159/0x17e [inet_lro] [] ? __lro_proc_skb+0x1ca/0x1ed [inet_lro] [] ? swiotlb_map_single_phys+0x0/0x12 [] ? lro_receive_skb+0x18/0x3e [inet_lro] [] ? ipoib_ib_handle_rx_wc+0x1ed/0x22b [ib_ipoib] [] ? ipoib_poll+0x9c/0x173 [ib_ipoib] [] ? net_rx_action+0x9d/0x175 [] ? __do_softirq+0x7a/0x13d [] ? call_softirq+0x1c/0x28 [] ? do_softirq+0x2c/0x68 [] ? do_IRQ+0xc2/0xdf [] ? ret_from_intr+0x0/0xa <0> [] ? mwait_idle+0x41/0x44 [] ? cpu_idle+0x40/0x5e Code: ff 88 48 e0 ff ff 48 c7 43 20 00 00 00 00 ff 43 08 8b 46 0c 01 43 0c 48 8b 43 18 8b 4b 08 8b 90 b4 00 00 00 48 03 90 b8 00 00 00 <0f> b7 42 04 39 c1 RIP [] skb_seq_read+0xfb/0x1a1 RSP CR2: 0000000000000004 Kernel panic - not syncing: Fatal exception in interrupt When I looked at this on sles10 I was able to verify that the problem was with (see bellow where this comes from) st->cur_skb->next equals 0xffffffff: if (st->cur_skb->next) { st->cur_skb = st->cur_skb->next; <<<=== this where I see the problem st->frag_idx = 0; goto next_skb; } else if (st->root_skb == st->cur_skb && From brian at sun.com Thu Feb 19 09:42:54 2009 From: brian at sun.com (Brian J. Murrell) Date: Thu, 19 Feb 2009 12:42:54 -0500 Subject: [ofa-general] IB function calls in kernel module fail In-Reply-To: <499C1DDA.3060601@mellanox.co.il> References: <7d5928b30902151440q4015ea1as76167b50c597c393@mail.gmail.com> <49994BB2.3010206@mellanox.co.il> <7d5928b30902160732t2bc1b36dud5282205786b13e6@mail.gmail.com> <499A8A20.1090507@mellanox.co.il> <1234893143.21802.96.camel@pc.interlinx.bc.ca> <499C1DDA.3060601@mellanox.co.il> Message-ID: <1235065374.28114.463.camel@pc.interlinx.bc.ca> On Wed, 2009-02-18 at 16:40 +0200, Tziporet Koren wrote: > Brian J. Murrell wrote: > > Ahhh. But should he just include /src/openib/include/ or > > also > > /src/openib/kernel_addons/backport//include/ > > (as described in /src/openib/ofed_patch.mk as well? > > > > And in what order should these be specified in? > > > > > You need both > Order not important Are you sure about this? I have been, in the past, unsure about this ordering too, but have been seeing evidence that order is important. Take for example in the current ~vlad/ofed_kernel-1.4 tree, there is an exportfs.h in both /src/openib/include/linux and /src/openib/kernel_addons/backport/2.6.16_sles10_sp2/include/linux. Having discussed the presence of this (newish) header in the SLES10 SP2 backports tree with Jeff it's clear that it should be used in preference to the one in the general include tree. Therefore, if one is not careful about ordering (so that backport headers take precedence) over the general ones, one would get the wrong exportfs.h header for SLES10 SP2 builds. But then there is the question of the kernel headers and ordering. Many of backport headers use #include_next to get the next found instance of a header. I have always assumed that was to get the kernel's version of a header included in the backport header. But if the order is: 1. backports headers 2. ofa general headers 3. kernel headers then an "#include_next " in /src/openib/kernel_addons/backport//include/linux/foo.h could potentially pick up /src/openib/include/linux/foo.h rather than /include/linux/foo.h which I think is what is intended/desired in most cases. But if the ordering is changed to: 1. backports headers 2. kernel headers 3. ofa general headers Then the desired preference of /src/openib/include/{rdma,scsi}/* headers vs. the ones included in the kernel will be lost. How can we reconcile this? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From hnrose at comcast.net Thu Feb 19 09:44:13 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 19 Feb 2009 12:44:13 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/umad2sim.c: Eliminate unneeded umad2sim_dev num Message-ID: <20090219174413.GA29805@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c index e13e30a..aaa6260 100644 --- a/umad2sim/umad2sim.c +++ b/umad2sim/umad2sim.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This file is part of ibsim. * @@ -77,7 +78,6 @@ struct ib_user_mad_reg_req { struct umad2sim_dev { int fd; - unsigned num; char name[32]; uint8_t port; struct sim_client sim_client; @@ -351,15 +351,13 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) *str = '\0'; /* /sys/class/infiniband_mad/umad0/ */ - snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, - dev->num); + snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, 0); make_path(path); file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name); file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port); /* /sys/class/infiniband_mad/issm0/ */ - snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir, - dev->num); + snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir, 0); make_path(path); file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name); file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port); @@ -546,7 +544,7 @@ static int umad2sim_ioctl(struct umad2sim_dev *dev, unsigned long request, return -1; } -static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) +static struct umad2sim_dev *umad2sim_dev_create(const char *name) { struct umad2sim_dev *dev; unsigned i; @@ -558,7 +556,6 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) return NULL; memset(dev, 0, sizeof(*dev)); - dev->num = num; strncpy(dev->name, name, sizeof(dev->name) - 1); if (sim_client_init(&dev->sim_client) < 0) @@ -574,9 +571,9 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) dev_sysfs_create(dev); snprintf(dev->umad_path, sizeof(dev->umad_path), "%s/%s%u", - umad_dev_dir, "umad", num); + umad_dev_dir, "umad", 0); snprintf(dev->issm_path, sizeof(dev->issm_path), "%s/%s%u", - umad_dev_dir, "issm", num); + umad_dev_dir, "issm", 0); return dev; @@ -646,7 +643,7 @@ static void umad2sim_init(void) DEBUG("umad2sim_init...\n"); snprintf(umad2sim_sysfs_prefix, sizeof(umad2sim_sysfs_prefix), "./sys-%d", getpid()); - devices[0] = umad2sim_dev_create(0, "ibsim0"); + devices[0] = umad2sim_dev_create("ibsim0"); if (!devices[0]) { ERROR("cannot init umad2sim. Exit.\n"); exit(-1); From gmpc at sanger.ac.uk Thu Feb 19 10:28:52 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Thu, 19 Feb 2009 18:28:52 +0000 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny In-Reply-To: <1235058060.28114.307.camel@pc.interlinx.bc.ca> References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com> <499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com> <499D7B6E.3050206@sanger.ac.uk> <1235058060.28114.307.camel@pc.interlinx.bc.ca> Message-ID: <499DA4E4.5020904@sanger.ac.uk> Brian J. Murrell wrote: > On Thu, 2009-02-19 at 15:31 +0000, Guy Coates wrote: >> > > Hi Guy, > >> I started packaging the missing bits of OFED 1.3 for debian etch for my own >> private use, as I needed some bits that were not present. (openSM, srp-tools, >> and a set of OFED 1.3 kernel modules+headers that I could build lustre against, > > /me waves. Lenny comes with 2.6.26 and I've been using the 2.6.26 backport to build the OFED 1.4 kernel modules. The modules all build except for ipath_inf-mod, iser-mod and ehca. I think I have fixed the iser module build issue; I'll generate some patches. Following your comments on the lustre mailing list, I haven't attempted to build lustre against OFED 1.4. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From hnrose at comcast.net Thu Feb 19 10:44:15 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 19 Feb 2009 13:44:15 -0500 Subject: [ofa-general] ***SPAM*** [PATCHv2] opensm/man/opensm.8.in: Indicate ROUTER_EXP obsoleted Message-ID: <20090219184415.GA29943@comcast.net> Pointed out by Rolf Signed-off-by: Hal Rosenstock --- diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index 7690980..f9f30d6 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -569,8 +569,8 @@ opensm will return the path to the first available matching router. A configuration file with a single line where both prefix and GUID are wild-carded means that a path record query specifying any off-subnet DGID should return a path to the first available router. -This configuration yields the same behaviour formerly achieved by -compiling opensm with -DROUTER_EXP. +This configuration yields the same behavior formerly achieved by +compiling opensm with -DROUTER_EXP which has been obsoleted. .SH ROUTING .PP From neutronsharc at gmail.com Thu Feb 19 10:47:27 2009 From: neutronsharc at gmail.com (neutron) Date: Thu, 19 Feb 2009 13:47:27 -0500 Subject: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results in crash In-Reply-To: References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> Message-ID: <7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com> I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version: 2.6.18-53.1.14.el5, ofed 1.3.1. The failed function call is like: { ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE, &dma_addr, GFP_KERNEL); ctx->phy_buf[0].addr = dma_addr; ctx->phy_buf[0].size = MAX_SIZE; ctx->iovstart = (u64) ctx->send_buf; printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n", ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart ); send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1, IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart)); } The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf". Below is /var/log/messages output around the crash. ---------------- Feb 19 12:50:22 wci30 kernel: pd=ffff8101da3ddce0, phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000 Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: Feb 19 12:50:22 wci30 kernel: [<0000000000000000>] _stext+0x7ffff000/0x1000 Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0 Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version Feb 19 12:50:22 wci30 kernel: CPU 0 Feb 19 12:54:05 wci30 syslogd 1.4.1: restart. Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg started. Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5 (brewbuilder at hs20-bc2-3.build.redha t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb 19 07:18:46 EST 2008 Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet ==================== It's strange that the kernel doesn't print out the function call stack before crashing. Any hints? Thanks a lot! On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier wrote: > > Before calling ib_reg_phys_mr, printk() shows that all its arguments > > are valid. But the system always crashes immediately after entering > > the function ib_reg_phys_mr( ). Any possible reasons ? Thanks!! > > What do you mean by "immediately after entering ib_reg_phys_mr()"? Do > you get an oops message? If so that would be very important info for > debugging this. > > - R. > From brian at sun.com Thu Feb 19 11:00:45 2009 From: brian at sun.com (Brian J. Murrell) Date: Thu, 19 Feb 2009 14:00:45 -0500 Subject: [ofa-general] OFED-1.4: ofa-kernel modules do not compile on 2.6.26 under Debian Lenny In-Reply-To: <499DA4E4.5020904@sanger.ac.uk> References: <499BE728.8080002@inqbus.de> <499C0EAD.7040604@voltaire.com> <499C4FD1.7040200@inqbus.de> <499D7239.5060502@Voltaire.com> <499D7B6E.3050206@sanger.ac.uk> <1235058060.28114.307.camel@pc.interlinx.bc.ca> <499DA4E4.5020904@sanger.ac.uk> Message-ID: <1235070045.28114.472.camel@pc.interlinx.bc.ca> On Thu, 2009-02-19 at 18:28 +0000, Guy Coates wrote: > > Lenny comes with 2.6.26 and I've been using the 2.6.26 backport to build the > OFED 1.4 kernel modules. Without actually looking, I'd guess the backport for 2.6.26 is a lot smaller and less intrusive than it is for the SLES10 and RHEL5 (2.6.16 and 2.6.18 respectively) kernels. > Following your comments on the lustre mailing list, I haven't attempted to build > lustre against OFED 1.4. Chances are good that it will be a lot more successful than RHEL5 or SLES10 I'd guess, so don't let comments regarding those OSes hold you back. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From or.gerlitz at gmail.com Thu Feb 19 11:40:49 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 19 Feb 2009 21:40:49 +0200 Subject: [ofa-general] ***SPAM*** Re: [ewg] iscsi initiator ipoib+lro crash on upstream kernel In-Reply-To: <20090219165505.GA13617@mtls03> References: <20090219165505.GA13617@mtls03> Message-ID: <15ddcffd0902191140p3a72c1b4p2bab0aa7f0aef87a@mail.gmail.com> On Thu, Feb 19, 2009 at 6:55 PM, Eli Cohen wrote: > I have encountered a kernel crash when running a iSCSI initiator on > IPoIB configured with LRO (if LRO is off it does not happen). This > was seen first on Sles10sp2 but then I verified it happens on 2.6.28.2 too. Eli, This is a known issue (http://bugzilla.kernel.org/show_bug.cgi?id=11804) a fix was submitted upstream and would be included in the next kernel. Or. From sean.hefty at intel.com Thu Feb 19 12:48:32 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Feb 2009 12:48:32 -0800 Subject: [ofa-general] [PATCH 5/6 v2] [ib-diag] ibsendtrap: add support for WinOF In-Reply-To: <0BC5E717DDC24248A6A7515FFAC7225D@amr.corp.intel.com> References: <0BC5E717DDC24248A6A7515FFAC7225D@amr.corp.intel.com> Message-ID: Add typecasts and modify include path. Signed-off-by: Sean Hefty --- Update from v1: need casts from int to uint16. One of the include files in the winof tree disables certain build warnings for the callers convenience... infiniband-diags/src/ibsendtrap.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index ba6aa8b..92b72f1 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -43,7 +43,7 @@ #include #include -#include +#include #include "ibdiag_common.h" @@ -73,8 +73,8 @@ static int send_144_node_desc_update(void) notice.generic_type = 0x80 | IB_NOTICE_TYPE_INFO; notice.g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); notice.g_or_v.generic.trap_num = cl_hton16(144); - notice.issuer_lid = cl_hton16(selfportid.lid); - notice.data_details.ntc_144.lid = cl_hton16(selfportid.lid); + notice.issuer_lid = cl_hton16((uint16_t) selfportid.lid); + notice.data_details.ntc_144.lid = cl_hton16((uint16_t) selfportid.lid); notice.data_details.ntc_144.local_changes = TRAP_144_MASK_OTHER_LOCAL_CHANGES; notice.data_details.ntc_144.change_flgs = From weiny2 at llnl.gov Thu Feb 19 19:05:20 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:20 -0800 Subject: [ofa-general] [PATCH 0/10 libibmad/infiniband-diags -- converting to "new" interface. Message-ID: <20090219190520.c18280e1.weiny2@llnl.gov> Here is v2 of the patch series. I used __attribute__ ((deprecated)) on the functions which should aid others in realizing that these functions will go away. (It sure helped me to convert all the diags. Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the new interface and I am hoping it will be accepted soon. The final patch converts perfquery, saquery, sminfo, smpquery, and vendstat because they were all simple to convert and the patch series was getting ridiculous. Thanks, Ira -- Ira Weiny Math Programer/Computer Scientist Larence Livermore National Lab weiny2 at llnl.gov From weiny2 at llnl.gov Thu Feb 19 19:05:25 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:25 -0800 Subject: [ofa-general] [PATCH 1/10] libibmad: Clean up "new" interface Message-ID: <20090219190525.322681b8.weiny2@llnl.gov> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 18 Feb 2009 16:37:36 -0800 Subject: [PATCH] libibmad: Clean up "new" interface type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * Create new mad_rpc_portid(struct ibmad_port *srcport) function which mirrors madrpc_portid(void) Mark all "old" functions with __attribute__ ((deprecated)) Signed-off-by: Ira Weiny --- libibmad/include/infiniband/mad.h | 139 ++++++++++++++++++++++--------------- libibmad/src/gs.c | 19 +++--- libibmad/src/libibmad.map | 1 + libibmad/src/resolve.c | 10 ++- libibmad/src/rpc.c | 29 ++++---- libibmad/src/sa.c | 4 +- libibmad/src/smp.c | 4 +- 7 files changed, 118 insertions(+), 88 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 1aaaa1b..80e38be 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt) } /* rpc.c */ -MAD_EXPORT int madrpc_portid(void); -MAD_EXPORT int madrpc_set_retries(int retries); -MAD_EXPORT int madrpc_set_timeout(int timeout); -void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); -void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, - void *data); +MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated)); +void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) + __attribute__ ((deprecated)); +void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) + __attribute__ ((deprecated)); MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, - int num_classes); -void madrpc_save_mad(void *madbuf, int len); -MAD_EXPORT void madrpc_show_errors(int set); + int num_classes) __attribute__ ((deprecated)); +void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated)); -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, +/* New interface */ +MAD_EXPORT void madrpc_show_errors(int set); +MAD_EXPORT int madrpc_set_retries(int retries); +MAD_EXPORT int madrpc_set_timeout(int timeout); +MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); -void mad_rpc_close_port(void *ibmad_port); -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, - void *payload, void *rcvdata); -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, - ib_rmpp_hdr_t * rmpp, void *data); +MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport); +MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, + void *payload, void *rcvdata); +MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, + ib_rmpp_hdr_t * rmpp, void *data); +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); /* smp.c */ MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, - unsigned mod, unsigned timeout); + unsigned mod, unsigned timeout) __attribute__ ((deprecated)); MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, - unsigned mod, unsigned timeout); + unsigned mod, unsigned timeout) __attribute__ ((deprecated)); + +/* smp.c new interface */ MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, - unsigned mod, unsigned timeout, const void *srcport); -uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, - unsigned timeout, const void *srcport); + unsigned mod, unsigned timeout, const struct ibmad_port *srcport); +MAD_EXPORT uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, + unsigned timeout, const struct ibmad_port *srcport); /* sa.c */ uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, - unsigned timeout); -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, + unsigned timeout) __attribute__ ((deprecated)); +MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, + void *buf) __attribute__ ((deprecated)); + +/* sa.c new interface */ +MAD_EXPORT uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, unsigned timeout); -MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, +MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); + /* returns lid */ /* resolve.c */ -MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); +MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) + __attribute__ ((deprecated)); MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, - ib_portid_t * sm_id, int timeout); + ib_portid_t * sm_id, int timeout) + __attribute__ ((deprecated)); MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, - enum MAD_DEST dest, ib_portid_t * sm_id); + enum MAD_DEST dest, ib_portid_t * sm_id) + __attribute__ ((deprecated)); MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, - ibmad_gid_t * gid); - -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); -int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, - ib_portid_t * sm_id, int timeout, const void *srcport); -int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, + ibmad_gid_t * gid) + __attribute__ ((deprecated)); + +/* resolve.c new interface */ +MAD_EXPORT int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport); +MAD_EXPORT int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, + ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport); +MAD_EXPORT int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest, ib_portid_t * sm_id, - const void *srcport); -int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, - const void *srcport); + const struct ibmad_port *srcport); +MAD_EXPORT int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, + const struct ibmad_port *srcport); /* gs.c */ MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, - int port, unsigned timeout); + int port, unsigned timeout) + __attribute__ ((deprecated)); MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, - int port, unsigned timeout); + int port, unsigned timeout) + __attribute__ ((deprecated)); MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, - unsigned timeout); + unsigned timeout) + __attribute__ ((deprecated)); MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, - int port, unsigned timeout); + int port, unsigned timeout) + __attribute__ ((deprecated)); MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, - unsigned timeout); + unsigned timeout) + __attribute__ ((deprecated)); MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, - int port, unsigned timeout); + int port, unsigned timeout) + __attribute__ ((deprecated)); MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, - int port, unsigned timeout); + int port, unsigned timeout) + __attribute__ ((deprecated)); -uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, +/* gs.c new interface */ +MAD_EXPORT uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); -uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, - unsigned timeout, const void *srcport); -uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, + const struct ibmad_port *srcport); +MAD_EXPORT uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, + unsigned timeout, const struct ibmad_port *srcport); +MAD_EXPORT uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout, - const void *srcport); -uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, + const struct ibmad_port *srcport); +MAD_EXPORT uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); -uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, + const struct ibmad_port *srcport); +MAD_EXPORT uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, - unsigned timeout, const void *srcport); -uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, + unsigned timeout, + const struct ibmad_port *srcport); +MAD_EXPORT uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); -uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, + const struct ibmad_port *srcport); +MAD_EXPORT uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport); + const struct ibmad_port *srcport); /* dump.c */ MAD_EXPORT ib_mad_dump_fn mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index d2c4574..e302caf 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -47,7 +47,7 @@ static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, unsigned id, - const void *srcport) + const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; int lid = dest->lid; @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO, srcport); @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, } uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, - unsigned timeout, const void *srcport) + unsigned timeout, const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_COUNTERS, srcport); @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port, static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout, - unsigned id, const void *srcport) + unsigned id, const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; int lid = dest->lid; @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return performance_reset_via(rcvbuf, dest, port, mask, timeout, IB_GSI_PORT_COUNTERS, srcport); @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_COUNTERS_EXT, srcport); @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, - unsigned timeout, const void *srcport) + unsigned timeout, + const struct ibmad_port *srcport) { return performance_reset_via(rcvbuf, dest, port, mask, timeout, IB_GSI_PORT_COUNTERS_EXT, srcport); @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_SAMPLES_CONTROL, srcport); @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, - const void *srcport) + const struct ibmad_port *srcport) { return pma_query_via(rcvbuf, dest, port, timeout, IB_GSI_PORT_SAMPLES_RESULT, srcport); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index f944d86..94d7762 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -69,6 +69,7 @@ IBMAD_1.3 { mad_rpc_close_port; mad_rpc; mad_rpc_rmpp; + mad_rpc_portid; madrpc; madrpc_def_timeout; madrpc_init; diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 553949d..3291f43 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -45,7 +45,8 @@ #undef DEBUG #define DEBUG if (ibdebug) IBWARN -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport) +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport) { ib_portid_t self = { 0 }; uint8_t portinfo[64]; @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) } int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, - ib_portid_t * sm_id, int timeout, const void *srcport) + ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport) { ib_portid_t sm_portid; char buf[IB_SA_DATA_SIZE] = { 0 }; @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest_type, ib_portid_t * sm_id, - const void *srcport) + const struct ibmad_port *srcport) { uint64_t guid; int lid; @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, } int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, - const void *srcport) + const struct ibmad_port *srcport) { ib_portid_t self = { 0 }; uint8_t portinfo[64]; diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index e811526..d47873b 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -100,6 +100,11 @@ int madrpc_portid(void) return mad_portid; } +int mad_rpc_portid(struct ibmad_port *srcport) +{ + return (srcport->port_id); +} + static int _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, return -1; } -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) { - const struct ibmad_port *p = port_id; int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) return 0; - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, - p->class_agents[rpc->mgtclass], + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, + port->class_agents[rpc->mgtclass], len, rpc->timeout)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return 0; @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, return rcvdata; } -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) { - const struct ibmad_port *p = port_id; int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) return 0; - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, - p->class_agents[rpc->mgtclass], + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, + port->class_agents[rpc->mgtclass], len, rpc->timeout)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return 0; @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) } } -void *mad_rpc_open_port(char *dev_name, int dev_port, +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) { struct ibmad_port *p; @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port, return p; } -void mad_rpc_close_port(void *port_id) +void mad_rpc_close_port(struct ibmad_port *port) { - struct ibmad_port *p = port_id; - - umad_close_port(p->port_id); - free(p); + umad_close_port(port->port_id); + free(port); } uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c index 7403d4f..ddeb152 100644 --- a/libibmad/src/sa.c +++ b/libibmad/src/sa.c @@ -44,7 +44,7 @@ #undef DEBUG #define DEBUG if (ibdebug) IBWARN -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, unsigned timeout) { ib_rpc_t rpc = { 0 }; @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, IB_PR_COMPMASK_SGID |\ IB_PR_COMPMASK_NUMBPATH) -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf) { int npath; diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c index fad263c..e5489b3 100644 --- a/libibmad/src/smp.c +++ b/libibmad/src/smp.c @@ -45,7 +45,7 @@ #define DEBUG if (ibdebug) IBWARN uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid, - unsigned mod, unsigned timeout, const void *srcport) + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid, } uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid, - unsigned mod, unsigned timeout, const void *srcport) + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) { ib_rpc_t rpc = { 0 }; -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:05:28 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:28 -0800 Subject: [ofa-general] [PATCH 2/10] infiniband-diags: Convert ibaddr to "new" ibmad interface Message-ID: <20090219190528.11c080f8.weiny2@llnl.gov> >From 1ead0cdb05b159dbd3a89d2030870fc7326ec84d Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 14:47:05 -0800 Subject: [PATCH] infiniband-diags: Convert ibaddr to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/ibaddr.c | 17 ++++++++++++----- infiniband-diags/src/ibdiag_common.c | 3 ++- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index 9098699..bb22be9 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -45,6 +45,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static int ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid) { @@ -55,10 +57,10 @@ ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid) ibmad_gid_t gid; int lmc; - if (!smp_query(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0, srcport)) return -1; - if (!smp_query(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_query_via(portinfo, portid, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return -1; mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid); @@ -137,17 +139,22 @@ int main(int argc, char **argv) if (!show_lid && !show_gid) show_lid = show_gid = 1; - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { - if (ib_resolve_self(&portid, &port, 0) < 0) + if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0) IBERROR("can't resolve self port %s", argv[0]); } if (ib_resolve_addr(&portid, port, show_lid, show_gid) < 0) IBERROR("can't resolve requested address"); + + mad_rpc_close_port(srcport); exit(0); } diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index 5f2472d..609df69 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -179,7 +179,8 @@ static int process_opt(int ch, char *optarg) ibd_timeout = val; break; case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + if (ib_resolve_portid_str_via(&sm_portid, optarg, IB_DEST_LID, + 0, NULL) < 0) IBERROR("cannot resolve SM destination port %s", optarg); ibd_sm_id = &sm_portid; break; -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:05:36 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:36 -0800 Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert ibportstate to "new" ibmad interface Message-ID: <20090219190536.f96edca7.weiny2@llnl.gov> >From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 17:27:21 -0800 Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/ibportstate.c | 18 ++++++++++++------ 1 files changed, 12 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index c0b9b34..ca72bda 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -46,6 +46,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + /*******************************************/ static int @@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data) { int node_type; - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) return -1; node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); @@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) char buf[2048]; char val[64]; - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return -1; if (port_op != 4) { @@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) char buf[2048]; char val[64]; - if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return -1; if (port_op != 4) @@ -223,9 +225,12 @@ int main(int argc, char **argv) if (argc < 2) ibdiag_show_usage(); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); /* First, make sure it is a switch port if it is a "set" */ @@ -314,7 +319,8 @@ int main(int argc, char **argv) peerportid.drpath.p[1] = (uint8_t) portnum; /* Set DrSLID to local lid */ - if (ib_resolve_self(&selfportid, &selfport, 0) < 0) + if (ib_resolve_self_via(&selfportid, + &selfport, 0, srcport) < 0) IBERROR("could not resolve self"); peerportid.drpath.drslid = (uint16_t) selfportid.lid; peerportid.drpath.drdlid = 0xffff; -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:05:32 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:32 -0800 Subject: [ofa-general] [PATCH 3/10] infiniband-diags: convert ibping to "new" ibmad interface Message-ID: <20090219190532.faf400f5.weiny2@llnl.gov> >From 039b42d9df09598d146d47d5d2adc1a13d952999 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 16:57:55 -0800 Subject: [PATCH] infiniband-diags: convert ibping to "new" ibmad interface To do this I needed the following additional functions mad_register_client_via mad_register_server_via mad_send_via mad_receive_via mad_respond_via ib_vendor_call_via And I marked their counterparts as deprecated and clean up interface a bit more. Note I moved some of the "new" interface declarations higher in mad.h Signed-off-by: Ira Weiny --- infiniband-diags/src/ibping.c | 21 ++++++++---- libibmad/include/infiniband/mad.h | 66 +++++++++++++++++++++++++------------ libibmad/src/libibmad.map | 5 +++ libibmad/src/mad_internal.h | 44 ++++++++++++++++++++++++ libibmad/src/register.c | 58 ++++++++++++++++++++++++++------ libibmad/src/rpc.c | 8 +--- libibmad/src/serv.c | 39 ++++++++++++++++++++-- libibmad/src/vendor.c | 15 +++++++- 8 files changed, 206 insertions(+), 50 deletions(-) create mode 100644 libibmad/src/mad_internal.h diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c index 1994eba..901079f 100644 --- a/infiniband-diags/src/ibping.c +++ b/infiniband-diags/src/ibping.c @@ -48,6 +48,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE]; static char last_host[IB_VENDOR_RANGE2_DATA_SIZE]; @@ -82,7 +84,7 @@ ibping_serv(void) DEBUG("starting to serve..."); - while ((umad = mad_receive(0, -1))) { + while ((umad = mad_receive_via(0, -1, srcport))) { mad = umad_get_mad(umad); data = (char *)mad + IB_VENDOR_RANGE2_DATA_OFFS; @@ -91,7 +93,7 @@ ibping_serv(void) DEBUG("Pong: %s", data); - if (mad_respond(umad, 0, 0) < 0) + if (mad_respond_via(umad, 0, 0, srcport) < 0) DEBUG("respond failed"); mad_free(umad); @@ -120,7 +122,7 @@ ibping(ib_portid_t *portid, int quiet) call.timeout = 0; memset(&call.rmpp, 0, sizeof call.rmpp); - if (!ib_vendor_call(data, portid, &call)) + if (!ib_vendor_call_via(data, portid, &call, srcport)) return ~0ull; rtt = cl_get_time_stamp() - start; @@ -208,10 +210,12 @@ int main(int argc, char **argv) if (!argc && !server) ibdiag_show_usage(); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (server) { - if (mad_register_server(ping_class, 0, 0, oui) < 0) + if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0) IBERROR("can't serve class %d on this port", ping_class); get_host_and_domain(host_and_domain, sizeof host_and_domain); @@ -221,10 +225,11 @@ int main(int argc, char **argv) exit(0); } - if (mad_register_client(ping_class, 0) < 0) + if (mad_register_client_via(ping_class, 0, srcport) < 0) IBERROR("can't register ping class %d on this port", ping_class); - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); signal(SIGINT, report); @@ -252,5 +257,7 @@ int main(int argc, char **argv) report(0); + mad_rpc_close_port(srcport); + exit(-1); } diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 80e38be..5cf135e 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -691,27 +691,64 @@ MAD_EXPORT uint64_t mad_trid(void); MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data); +/* New interface */ +MAD_EXPORT void madrpc_show_errors(int set); +MAD_EXPORT int madrpc_set_retries(int retries); +MAD_EXPORT int madrpc_set_timeout(int timeout); +MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, + int num_classes); +MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport); +MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, + void *payload, void *rcvdata); +MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, + ib_rmpp_hdr_t * rmpp, void *data); +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); + /* register.c */ MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, - uint8_t rmpp_version); -MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version); + uint8_t rmpp_version) __attribute__ ((deprecated)); +MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version) + __attribute__ ((deprecated)); MAD_EXPORT int mad_register_server(int mgmt, uint8_t rmpp_version, - long method_mask[16 / sizeof(long)], - uint32_t class_oui); + long method_mask[16 / sizeof(long)], + uint32_t class_oui) __attribute__ ((deprecated)); +/* register.c new interface */ +MAD_EXPORT int mad_register_client_via(int mgmt, uint8_t rmpp_version, + struct ibmad_port *srcport); +MAD_EXPORT int mad_register_server_via(int mgmt, uint8_t rmpp_version, + long method_mask[16 / sizeof(long)], + uint32_t class_oui, + struct ibmad_port *srcport); MAD_EXPORT int mad_class_agent(int mgmt); MAD_EXPORT int mad_agent_class(int agent); /* serv.c */ MAD_EXPORT int mad_send(ib_rpc_t * rpc, ib_portid_t * dport, - ib_rmpp_hdr_t * rmpp, void *data); -MAD_EXPORT void *mad_receive(void *umad, int timeout); -MAD_EXPORT int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus); + ib_rmpp_hdr_t * rmpp, void *data) __attribute__ ((deprecated)); +MAD_EXPORT void *mad_receive(void *umad, int timeout) + __attribute__ ((deprecated)); +MAD_EXPORT int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus) + __attribute__ ((deprecated)); + +/* serv.c new interface */ +MAD_EXPORT int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, + ib_rmpp_hdr_t * rmpp, void *data, + struct ibmad_port *srcport); +MAD_EXPORT void *mad_receive_via(void *umad, int timeout, + struct ibmad_port *srcport); +MAD_EXPORT int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus, + struct ibmad_port *srcport); MAD_EXPORT void *mad_alloc(void); MAD_EXPORT void mad_free(void *umad); /* vendor.c */ MAD_EXPORT uint8_t *ib_vendor_call(void *data, ib_portid_t * portid, - ib_vendor_call_t * call); + ib_vendor_call_t * call) __attribute__ ((deprecated)); + +/* vendor.c new interface */ +MAD_EXPORT uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid, + ib_vendor_call_t * call, + struct ibmad_port *srcport); static inline int mad_is_vendor_range1(int mgmt) { @@ -733,19 +770,6 @@ MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) __attribute__ ((deprecated)); void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated)); -/* New interface */ -MAD_EXPORT void madrpc_show_errors(int set); -MAD_EXPORT int madrpc_set_retries(int retries); -MAD_EXPORT int madrpc_set_timeout(int timeout); -MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, - int num_classes); -MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport); -MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, - void *payload, void *rcvdata); -MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, - ib_rmpp_hdr_t * rmpp, void *data); -MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); - /* smp.c */ MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, unsigned timeout) __attribute__ ((deprecated)); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 94d7762..bac74a9 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -60,6 +60,8 @@ IBMAD_1.3 { mad_class_agent; mad_register_client; mad_register_server; + mad_register_client_via; + mad_register_server_via; ib_resolve_guid; ib_resolve_portid_str; ib_resolve_self; @@ -86,10 +88,13 @@ IBMAD_1.3 { mad_free; mad_receive; mad_respond; + mad_receive_via; + mad_respond_via; mad_send; smp_query; smp_set; ib_vendor_call; + ib_vendor_call_via; smp_query_via; smp_set_via; ib_path_query_via; diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h new file mode 100644 index 0000000..9afe7a9 --- /dev/null +++ b/libibmad/src/mad_internal.h @@ -0,0 +1,44 @@ +/* + * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _MAD_INTERNAL_H_ +#define _MAD_INTERNAL_H_ + +#define MAX_CLASS 256 + +struct ibmad_port { + int port_id; /* file descriptor returned by umad_open() */ + int class_agents[MAX_CLASS]; /* class2agent mapper */ +}; + +#endif /* _MAD_INTERNAL_H_ */ diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 4d91ff8..4aabd7c 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -43,10 +43,11 @@ #include #include +#include "mad_internal.h" + #undef DEBUG #define DEBUG if (ibdebug) IBWARN -#define MAX_CLASS 256 #define MAX_AGENTS 256 static int class_agent[MAX_CLASS]; @@ -136,22 +137,57 @@ int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version) int mad_register_client(int mgmt, uint8_t rmpp_version) { + int rc = 0; + struct ibmad_port port; + + port.port_id = madrpc_portid(); + rc = mad_register_client_via(mgmt, rmpp_version, &port); + if (rc < 0) + return rc; + return register_agent(port.class_agents[mgmt], mgmt); +} + +int mad_register_client_via(int mgmt, uint8_t rmpp_version, + struct ibmad_port *srcport) +{ int agent; - agent = mad_register_port_client(madrpc_portid(), mgmt, rmpp_version); + if (!srcport) + return -1; + + agent = mad_register_port_client(mad_rpc_portid(srcport), mgmt, rmpp_version); if (agent < 0) return agent; - return register_agent(agent, mgmt); + srcport->class_agents[mgmt] = agent; + return 0; } int mad_register_server(int mgmt, uint8_t rmpp_version, long method_mask[], uint32_t class_oui) { + int rc = 0; + struct ibmad_port port; + + port.port_id = madrpc_portid(); + port.class_agents[mgmt] = class_agent[mgmt]; + rc = mad_register_server_via(mgmt, rmpp_version, + method_mask, class_oui, + &port); + if (rc < 0) + return rc; + return register_agent(port.class_agents[mgmt], mgmt); +} + +int +mad_register_server_via(int mgmt, uint8_t rmpp_version, + long method_mask[], uint32_t class_oui, + struct ibmad_port *srcport) +{ long class_method_mask[16 / sizeof(long)]; uint8_t oui[3]; - int agent, vers, mad_portid; + int agent, vers; if (method_mask) memcpy(class_method_mask, method_mask, @@ -159,11 +195,12 @@ mad_register_server(int mgmt, uint8_t rmpp_version, else memset(class_method_mask, 0xff, sizeof(class_method_mask)); - if ((mad_portid = madrpc_portid()) < 0) + if (!srcport) return -1; - if (class_agent[mgmt] >= 0) { - DEBUG("Class 0x%x already registered", mgmt); + if (srcport->class_agents[mgmt] >= 0) { + DEBUG("Class 0x%x already registered %d", + mgmt, srcport->class_agents[mgmt]); return -1; } if ((vers = mgmt_class_vers(mgmt)) <= 0) { @@ -175,19 +212,18 @@ mad_register_server(int mgmt, uint8_t rmpp_version, oui[0] = (class_oui >> 16) & 0xff; oui[1] = (class_oui >> 8) & 0xff; oui[2] = class_oui & 0xff; - if ((agent = umad_register_oui(mad_portid, mgmt, rmpp_version, + if ((agent = umad_register_oui(srcport->port_id, mgmt, rmpp_version, oui, class_method_mask)) < 0) { DEBUG("Can't register agent for class %d", mgmt); return -1; } - } else if ((agent = umad_register(mad_portid, mgmt, vers, rmpp_version, + } else if ((agent = umad_register(srcport->port_id, mgmt, vers, rmpp_version, class_method_mask)) < 0) { DEBUG("Can't register agent for class %d", mgmt); return -1; } - if (register_agent(agent, mgmt) < 0) - return -1; + srcport->class_agents[mgmt] = agent; return agent; } diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index d47873b..210f0c2 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -43,12 +43,7 @@ #include #include -#define MAX_CLASS 256 - -struct ibmad_port { - int port_id; /* file descriptor returned by umad_open() */ - int class_agents[MAX_CLASS]; /* class2agent mapper */ -}; +#include "mad_internal.h" int ibdebug; @@ -339,6 +334,7 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, return NULL; } + memset(p->class_agents, 0xff, sizeof p->class_agents); while (num_classes--) { uint8_t rmpp_version = 0; int mgmt = *mgmt_classes++; diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index c7631bb..0ce1660 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -42,12 +42,25 @@ #include #include +#include "mad_internal.h" + #undef DEBUG #define DEBUG if (ibdebug) IBWARN int mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) { + struct ibmad_port port; + + port.port_id = madrpc_portid(); + port.class_agents[rpc->mgtclass] = mad_class_agent(rpc->mgtclass); + return mad_send_via(rpc, dport, rmpp, data, &port); +} + +int +mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data, + struct ibmad_port *srcport) +{ uint8_t pktbuf[1024]; void *umad = pktbuf; @@ -64,7 +77,7 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) (char *)umad_get_mad(umad) + rpc->dataoffs, rpc->datasz); } - if (umad_send(madrpc_portid(), mad_class_agent(rpc->mgtclass), + if (umad_send(srcport->port_id, srcport->class_agents[rpc->mgtclass], umad, IB_MAD_SIZE, rpc->timeout, 0) < 0) { IBWARN("send failed; %m"); return -1; @@ -75,6 +88,18 @@ mad_send(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus) { + int i = 0; + struct ibmad_port port; + + port.port_id = madrpc_portid(); + for (i = 1; i < MAX_CLASS; i++) + port.class_agents[i] = mad_class_agent(i); + return mad_respond_via(umad, portid, rstatus, &port); +} + +int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus, + struct ibmad_port *srcport) +{ uint8_t *mad = umad_get_mad(umad); ib_mad_addr_t *mad_addr; ib_rpc_t rpc = { 0 }; @@ -138,7 +163,7 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus) if (ibdebug > 1) xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE); - if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad, + if (umad_send(srcport->port_id, srcport->class_agents[rpc.mgtclass], umad, IB_MAD_SIZE, rpc.timeout, 0) < 0) { DEBUG("send failed; %m"); return -1; @@ -149,11 +174,19 @@ int mad_respond(void *umad, ib_portid_t * portid, uint32_t rstatus) void *mad_receive(void *umad, int timeout) { + struct ibmad_port port; + + port.port_id = madrpc_portid(); + return mad_receive_via(umad, timeout, &port); +} + +void *mad_receive_via(void *umad, int timeout, struct ibmad_port *srcport) +{ void *mad = umad ? umad : umad_alloc(1, umad_size() + IB_MAD_SIZE); int agent; int length = IB_MAD_SIZE; - if ((agent = umad_recv(madrpc_portid(), mad, &length, timeout)) < 0) { + if ((agent = umad_recv(srcport->port_id, mad, &length, timeout)) < 0) { if (!umad) umad_free(mad); DEBUG("recv failed: %m"); diff --git a/libibmad/src/vendor.c b/libibmad/src/vendor.c index 50a878e..1a129e5 100644 --- a/libibmad/src/vendor.c +++ b/libibmad/src/vendor.c @@ -40,6 +40,7 @@ #include #include +#include "mad_internal.h" #undef DEBUG #define DEBUG if (ibdebug) IBWARN @@ -53,6 +54,16 @@ static inline int response_expected(int method) uint8_t *ib_vendor_call(void *data, ib_portid_t * portid, ib_vendor_call_t * call) { + struct ibmad_port port; + + port.port_id = madrpc_portid(); + return ib_vendor_call_via(data, portid, call, &port); +} + +uint8_t *ib_vendor_call_via(void *data, ib_portid_t * portid, + ib_vendor_call_t * call, + struct ibmad_port *srcport) +{ ib_rpc_t rpc = { 0 }; int range1 = 0, resp_expected; @@ -90,7 +101,7 @@ uint8_t *ib_vendor_call(void *data, ib_portid_t * portid, portid->qkey = IB_DEFAULT_QP1_QKEY; if (resp_expected) - return madrpc_rmpp(&rpc, portid, 0, data); /* FIXME: no RMPP for now */ + return mad_rpc_rmpp(srcport, &rpc, portid, 0, data); /* FIXME: no RMPP for now */ - return mad_send(&rpc, portid, 0, data) < 0 ? 0 : data; /* FIXME: no RMPP for now */ + return mad_send_via(&rpc, portid, 0, data, srcport) < 0 ? 0 : data; /* FIXME: no RMPP for now */ } -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:05:41 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:41 -0800 Subject: [ofa-general] [PATCH 5/10] infiniband-diags: Convert ibroute to "new" ibmad interface Message-ID: <20090219190541.f4a50fdc.weiny2@llnl.gov> >From 5b66b604de9bc43458ca4d295c5ab14cf2c6df10 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 17:30:14 -0800 Subject: [PATCH] infiniband-diags: Convert ibroute to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/ibroute.c | 30 +++++++++++++++++++----------- 1 files changed, 19 insertions(+), 11 deletions(-) diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 144d1b2..60bfdd8 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -49,6 +49,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static int brief, dump_all, multicast; /*******************************************/ @@ -61,12 +63,12 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid, int type; DEBUG("checking node type"); - if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, 0)) { + if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, 0, srcport)) { xdump(stderr, "nodeinfo\n", ni, sizeof ni); return "node info failed: valid addr?"; } - if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, 0)) + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, 0, srcport)) return "node desc failed"; mad_decode_field(ni, IB_NODE_TYPE_F, &type); @@ -77,7 +79,7 @@ check_switch(ib_portid_t *portid, int *nports, uint64_t *guid, mad_decode_field(ni, IB_NODE_NPORTS_F, nports); mad_decode_field(ni, IB_NODE_GUID_F, guid); - if (!smp_query(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0)) + if (!smp_query_via(sw, portid, IB_ATTR_SWITCH_INFO, 0, 0, srcport)) return "switch info failed: is a switch node?"; return 0; @@ -195,7 +197,8 @@ dump_multicast_tables(ib_portid_t *portid, int startlid, int endlid) mod = (block - IB_MIN_MCAST_LID/IB_MLIDS_IN_BLOCK) | (j << 28); DEBUG("reading block %x chunk %d mod %x", block, j, mod); - if (!smp_query(mft + j, portid, IB_ATTR_MULTICASTFORWTBL, mod, 0)) + if (!smp_query_via(mft + j, portid, + IB_ATTR_MULTICASTFORWTBL, mod, 0, srcport)) return "multicast forwarding table get failed"; } @@ -259,9 +262,9 @@ dump_lid(char *str, int strlen, int lid, int valid) portguid = 0; lidport.lid = lid; - if (!smp_query(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100) || - !smp_query(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100) || - !smp_query(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100)) + if (!smp_query_via(nd, &lidport, IB_ATTR_NODE_DESC, 0, 100, srcport) || + !smp_query_via(pi, &lidport, IB_ATTR_PORT_INFO, 0, 100, srcport) || + !smp_query_via(ni, &lidport, IB_ATTR_NODE_INFO, 0, 100, srcport)) return snprintf(str, strlen, ": (unknown node and type)"); mad_decode_field(ni, IB_NODE_PORT_GUID_F, &portguid); @@ -316,7 +319,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid) endblock = ALIGN(endlid, IB_SMP_DATA_SIZE) / IB_SMP_DATA_SIZE; for (block = startblock; block <= endblock; block++) { DEBUG("reading block %d", block); - if (!smp_query(lft, portid, IB_ATTR_LINEARFORWTBL, block, 0)) + if (!smp_query_via(lft, portid, IB_ATTR_LINEARFORWTBL, block, + 0, srcport)) return "linear forwarding table get failed"; i = block * IB_SMP_DATA_SIZE; e = i + IB_SMP_DATA_SIZE; @@ -403,12 +407,15 @@ int main(int argc, char **argv) if (argc > 2) endlid = strtoul(argv[2], 0, 0); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (!argc) { - if (ib_resolve_self(&portid, 0, 0) < 0) + if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0) IBERROR("can't resolve self addr"); - } else if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + } else if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[1]); if (multicast) @@ -419,5 +426,6 @@ int main(int argc, char **argv) if (err) IBERROR("dump tables: %s", err); + mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:05:46 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:46 -0800 Subject: [ofa-general] [PATCH 6/10] infiniband-diags: Convert ibsendtrap to "new" ibmad interface Message-ID: <20090219190546.4fcaa158.weiny2@llnl.gov> >From 9fcd0a9ec62fff981770e823281c660089b22d91 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 17:53:30 -0800 Subject: [PATCH] infiniband-diags: Convert ibsendtrap to "new" ibmad interface also make mad_send_via public to do the conversion Signed-off-by: Ira Weiny --- infiniband-diags/src/ibsendtrap.c | 13 +++++++++---- libibmad/src/libibmad.map | 1 + 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index ba6aa8b..d038dff 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -47,6 +47,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static int send_144_node_desc_update(void) { ib_portid_t sm_port; @@ -55,10 +57,10 @@ static int send_144_node_desc_update(void) ib_rpc_t trap_rpc; ib_mad_notice_attr_t notice; - if (ib_resolve_self(&selfportid, &selfport, NULL)) + if (ib_resolve_self_via(&selfportid, &selfport, NULL, srcport)) IBERROR("can't resolve self"); - if (ib_resolve_smlid(&sm_port, 0)) + if (ib_resolve_smlid_via(&sm_port, 0, srcport)) IBERROR("can't resolve SM destination port"); memset(&trap_rpc, 0, sizeof(trap_rpc)); @@ -80,7 +82,7 @@ static int send_144_node_desc_update(void) notice.data_details.ntc_144.change_flgs = TRAP_144_MASK_NODE_DESCRIPTION_CHANGE; - return (mad_send(&trap_rpc, &sm_port, NULL, ¬ice)); + return (mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport)); } typedef struct _trap_def { @@ -137,7 +139,10 @@ int main(int argc, char **argv) } madrpc_show_errors(1); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); + + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); return (send_trap(trap_name)); } diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index bac74a9..0412027 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -91,6 +91,7 @@ IBMAD_1.3 { mad_receive_via; mad_respond_via; mad_send; + mad_send_via; smp_query; smp_set; ib_vendor_call; -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:05:51 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:51 -0800 Subject: [ofa-general] [PATCH 7/10] infiniband-diags: Convert ibtracert to "new" ibmad interface Message-ID: <20090219190551.346fccb4.weiny2@llnl.gov> >From 0961e0ce048950e65bb78578538cff38b2c8332d Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 17:58:36 -0800 Subject: [PATCH] infiniband-diags: Convert ibtracert to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/ibtracert.c | 36 ++++++++++++++++++++++++------------ 1 files changed, 24 insertions(+), 12 deletions(-) diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index ea5662b..1965aa0 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -50,6 +50,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + #define MAXHOPS 63 static char *node_type_str[] = { @@ -116,10 +118,10 @@ get_node(Node *node, Port *port, ib_portid_t *portid) void *pi = port->portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; char *s, *e; - if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout)) + if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport)) return -1; - if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout)) + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport)) return -1; for (s = nd, e = s + 64; s < e; s++) { @@ -129,7 +131,7 @@ get_node(Node *node, Port *port, ib_portid_t *portid) *s = ' '; } - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout)) + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport)) return -1; mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid); @@ -151,7 +153,7 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid) { void *si = sw->switchinfo, *fdb = sw->fdb; - if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) + if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport)) return -1; mad_decode_field(si, IB_SW_LINEAR_FDB_CAP_F, &sw->linearcap); @@ -160,7 +162,8 @@ switch_lookup(Switch *sw, ib_portid_t *portid, int lid) if (lid > sw->linearcap && lid > sw->linearFDBtop) return -1; - if (!smp_query(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64, timeout)) + if (!smp_query_via(fdb, portid, IB_ATTR_LINEARFORWTBL, lid / 64, + timeout, srcport)) return -1; DEBUG("portid %s: forward lid %d to port %d", @@ -382,7 +385,8 @@ get_port(Port *port, int portnum, ib_portid_t *portid) port->portnum = portnum; - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout, + srcport)) return -1; mad_decode_field(pi, IB_PORT_LID_F, &port->lid); @@ -439,7 +443,7 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map) memset(map, 0, 256); - if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) + if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport)) return -1; mlid -= 0xc000; @@ -453,8 +457,8 @@ switch_mclookup(Node *node, ib_portid_t *portid, int mlid, char *map) maxsets = (node->numports + 15) / 16; /* round up */ for (set = 0; set < maxsets; set++) { - if (!smp_query(mdb, portid, IB_ATTR_MULTICASTFORWTBL, - block | (set << 28), timeout)) + if (!smp_query_via(mdb, portid, IB_ATTR_MULTICASTFORWTBL, + block | (set << 28), timeout, srcport)) return -1; for (i = 0; i < 16; i++, map++) { @@ -746,13 +750,18 @@ int main(int argc, char **argv) if (ibd_timeout) timeout = ibd_timeout; - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + node_name_map = open_node_name_map(node_name_map_file); - if (ib_resolve_portid_str(&src_portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&src_portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve source port %s", argv[0]); - if (ib_resolve_portid_str(&dest_portid, argv[1], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&dest_portid, argv[1], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[1]); if (ibd_dest_type == IB_DEST_DRPATH) { @@ -796,5 +805,8 @@ int main(int argc, char **argv) dump_mcpath(endnode, dumplevel); close_node_name_map(node_name_map); + + mad_rpc_close_port(srcport); + exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:05:56 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:05:56 -0800 Subject: [ofa-general] [PATCH 8/10] infiniband-diags: Convert ibsysstat to "new" ibmad interface Message-ID: <20090219190556.a831f6d3.weiny2@llnl.gov> >From 1c19e419e04a98bcfe10b1c597856f43ea36668a Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 18:14:49 -0800 Subject: [PATCH] infiniband-diags: Convert ibsysstat to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/ibsysstat.c | 20 +++++++++++++------- 1 files changed, 13 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index cc1418d..d7daa37 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -48,6 +48,8 @@ #define MAX_CPUS 8 +struct ibmad_port *srcport; + enum ib_sysstat_attr_t { IB_PING_ATTR = 0x10, IB_HOSTINFO_ATTR = 0x11, @@ -101,7 +103,7 @@ static int server_respond(void *umad, int size) if (ibdebug > 1) xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE); - if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad, + if (umad_send(mad_rpc_portid(srcport), mad_class_agent(rpc.mgtclass), umad, size, rpc.timeout, 0) < 0) { DEBUG("send failed; %m"); return -1; @@ -169,7 +171,7 @@ static char *ibsystat_serv(void) DEBUG("starting to serve..."); - while ((umad = mad_receive(buf, -1))) { + while ((umad = mad_receive_via(buf, -1, srcport))) { if (umad_status(buf)) { DEBUG("drop mad with status %x: %s", umad_status(buf), strerror(umad_status(buf))); @@ -230,7 +232,7 @@ static char *ibsystat(ib_portid_t *portid, int attr) if ((len = mad_build_pkt(buf, &rpc, portid, NULL, NULL)) < 0) IBPANIC("cannot build packet."); - fd = madrpc_portid(); + fd = mad_rpc_portid(srcport); agent = mad_class_agent(rpc.mgtclass); timeout = ibd_timeout ? ibd_timeout : MAD_DEF_TIMEOUT_MS; @@ -334,10 +336,12 @@ int main(int argc, char **argv) if (argc > 1 && (attr = match_attr(argv[1])) < 0) ibdiag_show_usage(); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (server) { - if (mad_register_server(sysstat_class, 1, 0, oui) < 0) + if (mad_register_server_via(sysstat_class, 1, 0, oui, srcport) < 0) IBERROR("can't serve class %d", sysstat_class); host_ncpu = build_cpuinfo(); @@ -347,14 +351,16 @@ int main(int argc, char **argv) exit(0); } - if (mad_register_client(sysstat_class, 1) < 0) + if (mad_register_client_via(sysstat_class, 1, srcport) < 0) IBERROR("can't register to sysstat class %d", sysstat_class); - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); if ((err = ibsystat(&portid, attr))) IBERROR("ibsystat to %s: %s", portid2str(&portid), err); + mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:06:02 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:06:02 -0800 Subject: [ofa-general] [PATCH 9/10] infiniband-diags: Convert mcm_rereg_test to "new" ibmad interface Message-ID: <20090219190602.2522876e.weiny2@llnl.gov> >From 4dcd4839baaa7f3bc31d01d5e695fced36b53533 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 18:24:56 -0800 Subject: [PATCH] infiniband-diags: Convert mcm_rereg_test to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/mcm_rereg_test.c | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/mcm_rereg_test.c b/infiniband-diags/src/mcm_rereg_test.c index 9285b95..b9d18a4 100644 --- a/infiniband-diags/src/mcm_rereg_test.c +++ b/infiniband-diags/src/mcm_rereg_test.c @@ -74,6 +74,8 @@ static ibmad_gid_t mgid_ipoib = { 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff }; +struct ibmad_port *srcport; + uint64_t build_mcm_rec(uint8_t *data, ibmad_gid_t mgid, ibmad_gid_t port_gid) { memset(data, 0, IB_SA_DATA_SIZE); @@ -436,10 +438,13 @@ int main(int argc, char **argv) if (argc > 1) guid_file = argv[1]; - madrpc_init(NULL, 0, mgmt_classes, 2); + srcport = mad_rpc_open_port(NULL, 0, mgmt_classes, 2); + if (!srcport) + err("Failed to open port"); + #if 1 - ib_resolve_smlid(&dport_id, TMO); + ib_resolve_smlid_via(&dport_id, TMO, srcport); #else memset(&dport_id, 0, sizeof(dport_id)); dport_id.lid = 1; @@ -457,7 +462,7 @@ int main(int argc, char **argv) } #if 1 - port = madrpc_portid(); + port = mad_rpc_portid(srcport); #else ret = umad_init(); -- 1.5.4.5 From weiny2 at llnl.gov Thu Feb 19 19:06:08 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 19 Feb 2009 19:06:08 -0800 Subject: [ofa-general] [PATCH 10/10] infiniband-diags: Convert perfquery, saquery, sminfo, smpquery, and vendstat to "new" ibmad interface Message-ID: <20090219190608.f8fd4a02.weiny2@llnl.gov> >From e809dfacb08e6c2237ad2d0f197d1227654dde87 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 18:53:10 -0800 Subject: [PATCH] infiniband-diags: Convert perfquery, saquery, sminfo, smpquery, and vendstat to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/perfquery.c | 35 +++++++++++++++++-------- infiniband-diags/src/saquery.c | 9 ++++-- infiniband-diags/src/sminfo.c | 18 +++++++++--- infiniband-diags/src/smpquery.c | 53 +++++++++++++++++++++++-------------- infiniband-diags/src/vendstat.c | 19 ++++++++----- 5 files changed, 88 insertions(+), 46 deletions(-) diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 6292743..2f104b8 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -47,6 +47,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + struct perf_count { uint32_t portselect; uint32_t counterselect; @@ -269,7 +271,7 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, char buf[1024]; if (extended != 1) { - if (!port_performance_query(pc, portid, port, timeout)) + if (!port_performance_query_via(pc, portid, port, timeout, srcport)) IBERROR("perfquery"); if (!(cap_mask & 0x1000)) { /* if PortCounters:PortXmitWait not suppported clear this counter */ @@ -284,7 +286,7 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); - if (!port_performance_ext_query(pc, portid, port, timeout)) + if (!port_performance_ext_query_via(pc, portid, port, timeout, srcport)) IBERROR("perfextquery"); if (aggregate) aggregate_perfcounters_ext(); @@ -299,10 +301,12 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, static void reset_counters(int extended, int timeout, int mask, ib_portid_t *portid, int port) { if (extended != 1) { - if (!port_performance_reset(pc, portid, port, mask, timeout)) + if (!port_performance_reset_via(pc, portid, port, mask, + timeout, srcport)) IBERROR("perf reset"); } else { - if (!port_performance_ext_reset(pc, portid, port, mask, timeout)) + if (!port_performance_ext_reset_via(pc, portid, port, mask, + timeout, srcport)) IBERROR("perf ext reset"); } } @@ -382,18 +386,22 @@ int main(int argc, char **argv) if (argc > 2) mask = strtoul(argv[2], 0, 0); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 4); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { - if (ib_resolve_self(&portid, &port, 0) < 0) + if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0) IBERROR("can't resolve self port %s", argv[0]); } /* PerfMgt ClassPortInfo is a required attribute */ - if (!perf_classportinfo_query(pc, &portid, port, ibd_timeout)) + if (!perf_classportinfo_query_via(pc, &portid, port, + ibd_timeout, srcport)) IBERROR("classportinfo query"); /* ClassPortInfo should be supported as part of libibmad */ memcpy(&cap_mask, pc + 2, sizeof(cap_mask)); /* CapabilityMask */ @@ -406,7 +414,8 @@ int main(int argc, char **argv) } if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { - if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) + if (smp_query_via(data, &portid, IB_ATTR_NODE_INFO, 0, 0, + srcport) < 0) IBERROR("smp query nodeinfo failed"); node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports); @@ -414,7 +423,8 @@ int main(int argc, char **argv) IBERROR("smp query nodeinfo: num ports invalid"); if (node_type == IB_NODE_SWITCH) { - if (smp_query(data, &portid, IB_ATTR_SWITCH_INFO, 0, 0) < 0) + if (smp_query_via(data, &portid, IB_ATTR_SWITCH_INFO, + 0, 0, srcport) < 0) IBERROR("smp query nodeinfo failed"); enhancedport0 = mad_get_field(data, 0, IB_SW_ENHANCED_PORT0_F); if (enhancedport0) @@ -441,8 +451,10 @@ int main(int argc, char **argv) else dump_perfcounters(extended, ibd_timeout, cap_mask, &portid, port, 0); - if (!reset) + if (!reset) { + mad_rpc_close_port(srcport); exit(0); + } do_reset: @@ -456,5 +468,6 @@ do_reset: else reset_counters(extended, ibd_timeout, mask, &portid, port); + mad_rpc_close_port(srcport); exit(0); } diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 9726d22..e6cbe50 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1316,12 +1316,15 @@ static int query_mft_records(const struct query_cmd *q, bind_handle_t h, static bind_handle_t get_bind_handle(void) { + static struct ibmad_port *srcport; static struct bind_handle handle; int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - ib_resolve_smlid(&handle.dport, ibd_timeout); + ib_resolve_smlid_via(&handle.dport, ibd_timeout, srcport); if (!handle.dport.lid) IBPANIC("No SM found."); @@ -1329,7 +1332,7 @@ static bind_handle_t get_bind_handle(void) if (!handle.dport.qkey) handle.dport.qkey = IB_DEFAULT_QP1_QKEY; - handle.fd = madrpc_portid(); + handle.fd = mad_rpc_portid(srcport); handle.agent = umad_register(handle.fd, IB_SA_CLASS, 2, 1, NULL); return &handle; diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c index 549cb81..ebf6a47 100644 --- a/infiniband-diags/src/sminfo.c +++ b/infiniband-diags/src/sminfo.c @@ -48,6 +48,8 @@ static uint8_t sminfo[1024]; +struct ibmad_port *srcport; + int strdata, xdata=1, bindata; enum { SMINFO_NOTACT, @@ -113,13 +115,16 @@ int main(int argc, char **argv) if (argc > 1) mod = atoi(argv[1]); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, 0) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + 0, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { - if (ib_resolve_smlid(&portid, ibd_timeout) < 0) + if (ib_resolve_smlid_via(&portid, ibd_timeout, srcport) < 0) IBERROR("can't resolve sm port %s", argv[0]); } @@ -130,10 +135,12 @@ int main(int argc, char **argv) mad_encode_field(sminfo, IB_SMINFO_STATE_F, &state); if (mod) { - if (!(p = smp_set(sminfo, &portid, IB_ATTR_SMINFO, mod, ibd_timeout))) + if (!(p = smp_set_via(sminfo, &portid, IB_ATTR_SMINFO, mod, + ibd_timeout, srcport))) IBERROR("query"); } else - if (!(p = smp_query(sminfo, &portid, IB_ATTR_SMINFO, 0, ibd_timeout))) + if (!(p = smp_query_via(sminfo, &portid, IB_ATTR_SMINFO, 0, + ibd_timeout, srcport))) IBERROR("query"); mad_decode_field(sminfo, IB_SMINFO_GUID_F, &guid); @@ -145,5 +152,6 @@ int main(int argc, char **argv) printf("sminfo: sm lid %d sm guid 0x%" PRIx64 ", activity count %u priority %d state %d %s\n", portid.lid, guid, act, prio, state, STATESTR(state)); + mad_rpc_close_port(srcport); exit(0); } diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index bf1626d..2ed1e65 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -51,6 +51,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + typedef char *(op_fn_t)(ib_portid_t *dest, char **argv, int argc); typedef struct match_rec { @@ -88,13 +90,13 @@ node_desc(ib_portid_t *dest, char **argv, int argc) char dots[128]; char *nodename = NULL; - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) return "node info query failed"; mad_decode_field(data, IB_NODE_TYPE_F, &node_type); mad_decode_field(data, IB_NODE_GUID_F, &node_guid); - if (!smp_query(nd, dest, IB_ATTR_NODE_DESC, 0, 0)) + if (!smp_query_via(nd, dest, IB_ATTR_NODE_DESC, 0, 0, srcport)) return "node desc query failed"; nodename = remap_node_name(node_name_map, node_guid, nd); @@ -119,7 +121,7 @@ node_info(ib_portid_t *dest, char **argv, int argc) char buf[2048]; char data[IB_SMP_DATA_SIZE]; - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) return "node info query failed"; mad_dump_nodeinfo(buf, sizeof buf, data, sizeof data); @@ -138,7 +140,7 @@ port_info(ib_portid_t *dest, char **argv, int argc) if (argc > 0) portnum = strtol(argv[0], 0, 0); - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return "port info query failed"; mad_dump_portinfo(buf, sizeof buf, data, sizeof data); @@ -153,7 +155,7 @@ switch_info(ib_portid_t *dest, char **argv, int argc) char buf[2048]; char data[IB_SMP_DATA_SIZE]; - if (!smp_query(data, dest, IB_ATTR_SWITCH_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_SWITCH_INFO, 0, 0, srcport)) return "switch info query failed"; mad_dump_switchinfo(buf, sizeof buf, data, sizeof data); @@ -176,7 +178,7 @@ pkey_table(ib_portid_t *dest, char **argv, int argc) portnum = strtol(argv[0], 0, 0); /* Get the partition capacity */ - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) return "node info query failed"; mad_decode_field(data, IB_NODE_TYPE_F, &t); @@ -185,7 +187,8 @@ pkey_table(ib_portid_t *dest, char **argv, int argc) return "invalid port number"; if ((t == IB_NODE_SWITCH) && (portnum != 0)) { - if (!smp_query(data, dest, IB_ATTR_SWITCH_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_SWITCH_INFO, 0, 0, + srcport)) return "switch info failed"; mad_decode_field(data, IB_SW_PARTITION_ENFORCE_CAP_F, &n); } else @@ -193,7 +196,8 @@ pkey_table(ib_portid_t *dest, char **argv, int argc) for (i = 0; i < (n + 31) / 32; i++) { mod = i | (portnum << 16); - if (!smp_query(data, dest, IB_ATTR_PKEY_TBL, mod, 0)) + if (!smp_query_via(data, dest, IB_ATTR_PKEY_TBL, mod, 0, + srcport)) return "pkey table query failed"; if (i + 1 == (n + 31) / 32) k = ((n + 7 - i * 32) / 8) * 8; @@ -220,7 +224,7 @@ static char *sl2vl_dump_table_entry(ib_portid_t *dest, int in, int out) char data[IB_SMP_DATA_SIZE]; int portnum = (in << 8) | out; - if (!smp_query(data, dest, IB_ATTR_SLVL_TABLE, portnum, 0)) + if (!smp_query_via(data, dest, IB_ATTR_SLVL_TABLE, portnum, 0, srcport)) return "slvl query failed"; mad_dump_sltovl(buf, sizeof buf, data, sizeof data); @@ -240,7 +244,7 @@ sl2vl_table(ib_portid_t *dest, char **argv, int argc) if (argc > 0) portnum = strtol(argv[0], 0, 0); - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) return "node info query failed"; mad_decode_field(data, IB_NODE_TYPE_F, &type); @@ -270,8 +274,8 @@ static char *vlarb_dump_table_entry(ib_portid_t *dest, int portnum, int offset, char buf[2048]; char data[IB_SMP_DATA_SIZE]; - if (!smp_query(data, dest, IB_ATTR_VL_ARBITRATION, - (offset << 16) | portnum, 0)) + if (!smp_query_via(data, dest, IB_ATTR_VL_ARBITRATION, + (offset << 16) | portnum, 0, srcport)) return "vl arb query failed"; mad_dump_vlarbitration(buf, sizeof(buf), data, cap * 2); printf("%s", buf); @@ -305,12 +309,14 @@ vlarb_table(ib_portid_t *dest, char **argv, int argc) /* port number of 0 could mean SP0 or port MAD arrives on */ if (portnum == 0) { - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, + srcport)) return "node info query failed"; mad_decode_field(data, IB_NODE_TYPE_F, &type); if (type == IB_NODE_SWITCH) { - if (!smp_query(data, dest, IB_ATTR_SWITCH_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_SWITCH_INFO, 0, + 0, srcport)) return "switch info query failed"; mad_decode_field(data, IB_SW_ENHANCED_PORT0_F, &enhsp0); if (!enhsp0) { @@ -321,7 +327,7 @@ vlarb_table(ib_portid_t *dest, char **argv, int argc) } } - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return "port info query failed"; mad_decode_field(data, IB_PORT_VL_ARBITRATION_LOW_CAP_F, &lowcap); @@ -349,13 +355,14 @@ guid_info(ib_portid_t *dest, char **argv, int argc) int n; /* Get the guid capacity */ - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, 0, 0, srcport)) return "port info failed"; mad_decode_field(data, IB_PORT_GUID_CAP_F, &n); for (i = 0; i < (n + 7) / 8; i++) { mod = i; - if (!smp_query(data, dest, IB_ATTR_GUID_INFO, mod, 0)) + if (!smp_query_via(data, dest, IB_ATTR_GUID_INFO, mod, 0, + srcport)) return "guid info query failed"; if (i + 1 == (n + 7) / 8) k = ((n + 1 - i * 8) / 2) * 2; @@ -445,11 +452,15 @@ int main(int argc, char **argv) if (!(fn = match_op(argv[0]))) IBERROR("operation '%s' not supported", argv[0]); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + node_name_map = open_node_name_map(node_name_map_file); if (ibd_dest_type != IB_DEST_DRSLID) { - if (ib_resolve_portid_str(&portid, argv[1], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[1], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[1]); if ((err = fn(&portid, argv+2, argc-2))) IBERROR("operation %s: %s", argv[0], err); @@ -458,11 +469,13 @@ int main(int argc, char **argv) memset(concat, 0, 64); snprintf(concat, sizeof(concat), "%s %s", argv[1], argv[2]); - if (ib_resolve_portid_str(&portid, concat, ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, concat, ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", concat); if ((err = fn(&portid, argv+3, argc-3))) IBERROR("operation %s: %s", argv[0], err); } close_node_name_map(node_name_map); + mad_rpc_close_port(srcport); exit(0); } diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index db87e38..d001a01 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -55,6 +55,8 @@ /* Config space addresses */ #define IB_MLX_IS3_PORT_XMIT_WAIT 0x10013C +struct ibmad_port *srcport; + typedef struct { uint16_t hw_revision; uint16_t device_id; @@ -152,13 +154,16 @@ int main(int argc, char **argv) if (argc > 1) port = strtoul(argv[1], 0, 0); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 4); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { - if (ib_resolve_self(&portid, &port, 0) < 0) + if (ib_resolve_self_via(&portid, &port, 0, srcport) < 0) IBERROR("can't resolve self port %s", argv[0]); } @@ -180,12 +185,12 @@ int main(int argc, char **argv) memset(&buf, 0, sizeof(buf)); /* vendor ClassPortInfo is required attribute if class supported */ call.attrid = CLASS_PORT_INFO; - if (!ib_vendor_call(&buf, &portid, &call)) + if (!ib_vendor_call_via(&buf, &portid, &call, srcport)) IBERROR("classportinfo query"); memset(&buf, 0, sizeof(buf)); call.attrid = IB_MLX_IS3_GENERAL_INFO; - if (!ib_vendor_call(&buf, &portid, &call)) + if (!ib_vendor_call_via(&buf, &portid, &call, srcport)) IBERROR("vendstat"); gi = (is3_general_info_t *)&buf; @@ -217,7 +222,7 @@ int main(int argc, char **argv) cs = (is3_config_space_t *)&buf; for (i = 0; i < 16; i++) cs->record[i].address = htonl(IB_MLX_IS3_PORT_XMIT_WAIT + ((i + 1) << 12)); - if (!ib_vendor_call(&buf, &portid, &call)) + if (!ib_vendor_call_via(&buf, &portid, &call, srcport)) IBERROR("vendstat"); for (i = 0; i < 16; i++) @@ -232,7 +237,7 @@ int main(int argc, char **argv) cs = (is3_config_space_t *)&buf; for (i = 0; i < 8; i++) cs->record[i].address = htonl(IB_MLX_IS3_PORT_XMIT_WAIT + ((i + 17) << 12)); - if (!ib_vendor_call(&buf, &portid, &call)) + if (!ib_vendor_call_via(&buf, &portid, &call, srcport)) IBERROR("vendstat"); for (i = 0; i < 8; i++) -- 1.5.4.5 From Jie.Cai at cs.anu.edu.au Thu Feb 19 19:39:10 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 20 Feb 2009 14:39:10 +1100 Subject: [ofa-general] RDMA write with immediate data. In-Reply-To: References: <499CBEF2.2010909@cs.anu.edu.au> Message-ID: <499E25DE.5020703@cs.anu.edu.au> Davis, Arlin R wrote: > > >> if (initiator) { >> ret = dat_ib_post_rdma_write_immed( h_ep, // >> >> However, at remote side I got the following error message >> indicates that >> no event coming through. >> >> 5217 ERROR: DTO dat_evd_wait() DAT_TIMEOUT_EXPIRED >> 5217 Error do_rdmw_write_with_immd: DAT_TIMEOUT_EXPIRED >> >> The return of dat_evd_wait is DAT_TIMEOUT_EXPIRED. >> >> > > Does the initiator side complete successfully? > yes, the initiator complete successfully. > Do you have receive's posted at the remote side for immed data? > Nope, the remote side didn't got an event, (dat_evd_wait timed out). The way to find out the immed data is to check the out going parameter &event of dat_evd_wait function. &event.event_extension_data[0]->val.immed.data has not got a value yet. > You can look at dtestx source for an immed data example. > Yes, I do checked this test program. The way to use dat_ib_post_rdma_write_immed is as same as dtestx. Thanks, Jie > -arlin > > > > From rdreier at cisco.com Thu Feb 19 22:50:21 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Feb 2009 22:50:21 -0800 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: <200902171742.38223.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 17 Feb 2009 17:42:38 +0200") References: <200902171742.38223.jackm@dev.mellanox.co.il> Message-ID: > We have found a race condition in sysfs.c which occurs when unloading low-level modules > (e.g., mlx4_ib) in the driver. > Specifically: > > Although the kernel takes reference counts on sysfs files, it does not take such counts > on modules which implement attribute reads. > > For example, we have: > static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, > char *buf) > { > struct port_table_attribute *tab_attr = > container_of(attr, struct port_table_attribute, attr); > u16 pkey; > ssize_t ret; > ====>race condition HERE ***** > ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); > if (ret) > return ret; > > return sprintf(buf, "0x%04x\n", pkey); > } I've not been able to reproduce this on a current kernel. I tried adding the patch below to make the race window very big: --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -273,6 +273,9 @@ static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, u16 pkey; ssize_t ret; + printk(KERN_ERR "enter show_port_pkey\n"); + msleep(10000); + printk(KERN_ERR "call ib_query_pkey\n"); ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); if (ret) return ret; so show_port_pkey() waits 10 seconds before actually calling ib_query_pkey(). Then I do something like cat /sys/class/infiniband/mlx4_0/ports/1/pkeys/0 in one shell and immediately (during the 10 second window before ib_query_pkey() is called): modprobe -r mlx4_ib in another shell. And I see that the mlx4_ib module is not removed until the read of the pkey file completes; this is as I would expect, since the sysfs delete of the pkey file should wait until there are no open fds for that file. What test are you using to hit this race? Are you using a distro kernel with OFED? - R. From Zhen.Liang at Sun.COM Fri Feb 20 00:21:58 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Fri, 20 Feb 2009 16:21:58 +0800 Subject: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results in crash In-Reply-To: <7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com> References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> <7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com> Message-ID: <499E6826.704@sun.com> Hmm, I didn't see any problem in your code. Have you installed ofa_kernel_devel (kernel headers of OFED) after installation of ofa_kernel_1_3_1? Regards Liang neutron: > I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version: > 2.6.18-53.1.14.el5, ofed 1.3.1. > > The failed function call is like: > > { > > ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE, > &dma_addr, GFP_KERNEL); > > ctx->phy_buf[0].addr = dma_addr; > ctx->phy_buf[0].size = MAX_SIZE; > ctx->iovstart = (u64) ctx->send_buf; > > printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n", > ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart ); > > send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1, > IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ > | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart)); > } > > The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf". > > Below is /var/log/messages output around the crash. > ---------------- > Feb 19 12:50:22 wci30 kernel: pd=ffff8101da3ddce0, > phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000 > > Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer > dereference at 0000000000000000 > RIP: > Feb 19 12:50:22 wci30 kernel: [<0000000000000000>] _stext+0x7ffff000/0x1000 > Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0 > Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP > Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version > Feb 19 12:50:22 wci30 kernel: CPU 0 > Feb 19 12:54:05 wci30 syslogd 1.4.1: restart. > Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg started. > Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5 > (brewbuilder at hs20-bc2-3.build.redha > t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb > 19 07:18:46 EST 2008 > Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet > > ==================== > It's strange that the kernel doesn't print out the function call stack > before crashing. > > Any hints? Thanks a lot! > > On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier wrote: > >> > Before calling ib_reg_phys_mr, printk() shows that all its arguments >> > are valid. But the system always crashes immediately after entering >> > the function ib_reg_phys_mr( ). Any possible reasons ? Thanks!! >> >> What do you mean by "immediately after entering ib_reg_phys_mr()"? Do >> you get an oops message? If so that would be very important info for >> debugging this. >> >> - R. >> >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vlad at lists.openfabrics.org Fri Feb 20 03:15:07 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 20 Feb 2009 03:15:07 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090220-0200 daily build status Message-ID: <20090220111507.69DF9E301F8@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Fri Feb 20 05:41:56 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 20 Feb 2009 08:41:56 -0500 Subject: ***SPAM*** Re: [ofa-general] [PATCH 1/10] libibmad: Clean up "new" interface In-Reply-To: <20090219190525.322681b8.weiny2@llnl.gov> References: <20090219190525.322681b8.weiny2@llnl.gov> Message-ID: On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Wed, 18 Feb 2009 16:37:36 -0800 > Subject: [PATCH] libibmad: Clean up "new" interface > > type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > Create new mad_rpc_portid(struct ibmad_port *srcport) function > which mirrors madrpc_portid(void) > Mark all "old" functions with __attribute__ ((deprecated)) > > Signed-off-by: Ira Weiny > --- > libibmad/include/infiniband/mad.h | 139 ++++++++++++++++++++++--------------- > libibmad/src/gs.c | 19 +++--- > libibmad/src/libibmad.map | 1 + > libibmad/src/resolve.c | 10 ++- > libibmad/src/rpc.c | 29 ++++---- > libibmad/src/sa.c | 4 +- > libibmad/src/smp.c | 4 +- > 7 files changed, 118 insertions(+), 88 deletions(-) > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 1aaaa1b..80e38be 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt) > } > > /* rpc.c */ > -MAD_EXPORT int madrpc_portid(void); > -MAD_EXPORT int madrpc_set_retries(int retries); > -MAD_EXPORT int madrpc_set_timeout(int timeout); > -void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); > -void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, > - void *data); > +MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated)); > +void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) > + __attribute__ ((deprecated)); > +void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) > + __attribute__ ((deprecated)); > MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > - int num_classes); > -void madrpc_save_mad(void *madbuf, int len); > -MAD_EXPORT void madrpc_show_errors(int set); > + int num_classes) __attribute__ ((deprecated)); > +void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated)); Should there be a mad_rpc_save_mad in the new interface ? It looks like it would only need some additional parameters as part of ibmad_port struct. > -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > +/* New interface */ Nit: /* rpc.c: new interface */ -- Hal > +MAD_EXPORT void madrpc_show_errors(int set); > +MAD_EXPORT int madrpc_set_retries(int retries); > +MAD_EXPORT int madrpc_set_timeout(int timeout); > +MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > -void mad_rpc_close_port(void *ibmad_port); > -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > - void *payload, void *rcvdata); > -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > - ib_rmpp_hdr_t * rmpp, void *data); > +MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport); > +MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > + void *payload, void *rcvdata); > +MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > + ib_rmpp_hdr_t * rmpp, void *data); > +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); > > /* smp.c */ > MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, > - unsigned mod, unsigned timeout); > + unsigned mod, unsigned timeout) __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, > - unsigned mod, unsigned timeout); > + unsigned mod, unsigned timeout) __attribute__ ((deprecated)); > + > +/* smp.c new interface */ > MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport); > -uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, > - unsigned timeout, const void *srcport); > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, > + unsigned timeout, const struct ibmad_port *srcport); > > /* sa.c */ > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > - unsigned timeout); > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > + unsigned timeout) __attribute__ ((deprecated)); > +MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, > + void *buf) __attribute__ ((deprecated)); > + > +/* sa.c new interface */ > +MAD_EXPORT uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid, > ib_sa_call_t * sa, unsigned timeout); > -MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > +MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); > + /* returns lid */ > > /* resolve.c */ > -MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); > +MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout); > + ib_portid_t * sm_id, int timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > - enum MAD_DEST dest, ib_portid_t * sm_id); > + enum MAD_DEST dest, ib_portid_t * sm_id) > + __attribute__ ((deprecated)); > MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, > - ibmad_gid_t * gid); > - > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); > -int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout, const void *srcport); > -int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > + ibmad_gid_t * gid) > + __attribute__ ((deprecated)); > + > +/* resolve.c new interface */ > +MAD_EXPORT int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport); > +MAD_EXPORT int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > + ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport); > +MAD_EXPORT int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > enum MAD_DEST dest, ib_portid_t * sm_id, > - const void *srcport); > -int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > - const void *srcport); > + const struct ibmad_port *srcport); > +MAD_EXPORT int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > + const struct ibmad_port *srcport); > > /* gs.c */ > MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout); > + unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout); > + unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > > -uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > +/* gs.c new interface */ > +MAD_EXPORT uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > -uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > - unsigned timeout, const void *srcport); > -uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > + unsigned timeout, const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned mask, unsigned timeout, > - const void *srcport); > -uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > -uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout, const void *srcport); > -uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > + unsigned timeout, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > -uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > + const struct ibmad_port *srcport); > /* dump.c */ > MAD_EXPORT ib_mad_dump_fn > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c > index d2c4574..e302caf 100644 > --- a/libibmad/src/gs.c > +++ b/libibmad/src/gs.c > @@ -47,7 +47,7 @@ > > static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned timeout, unsigned id, > - const void *srcport) > + const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > int lid = dest->lid; > @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO, > srcport); > @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, > } > > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > - unsigned timeout, const void *srcport) > + unsigned timeout, const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_COUNTERS, srcport); > @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port, > > static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, unsigned timeout, > - unsigned id, const void *srcport) > + unsigned id, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > int lid = dest->lid; > @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned mask, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > IB_GSI_PORT_COUNTERS, srcport); > @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_COUNTERS_EXT, srcport); > @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout, const void *srcport) > + unsigned timeout, > + const struct ibmad_port *srcport) > { > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > IB_GSI_PORT_COUNTERS_EXT, srcport); > @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_SAMPLES_CONTROL, srcport); > @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_SAMPLES_RESULT, srcport); > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > index f944d86..94d7762 100644 > --- a/libibmad/src/libibmad.map > +++ b/libibmad/src/libibmad.map > @@ -69,6 +69,7 @@ IBMAD_1.3 { > mad_rpc_close_port; > mad_rpc; > mad_rpc_rmpp; > + mad_rpc_portid; > madrpc; > madrpc_def_timeout; > madrpc_init; > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > index 553949d..3291f43 100644 > --- a/libibmad/src/resolve.c > +++ b/libibmad/src/resolve.c > @@ -45,7 +45,8 @@ > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport) > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport) > { > ib_portid_t self = { 0 }; > uint8_t portinfo[64]; > @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) > } > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout, const void *srcport) > + ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport) > { > ib_portid_t sm_portid; > char buf[IB_SA_DATA_SIZE] = { 0 }; > @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > enum MAD_DEST dest_type, ib_portid_t * sm_id, > - const void *srcport) > + const struct ibmad_port *srcport) > { > uint64_t guid; > int lid; > @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > } > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > - const void *srcport) > + const struct ibmad_port *srcport) > { > ib_portid_t self = { 0 }; > uint8_t portinfo[64]; > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index e811526..d47873b 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -100,6 +100,11 @@ int madrpc_portid(void) > return mad_portid; > } > > +int mad_rpc_portid(struct ibmad_port *srcport) > +{ > + return (srcport->port_id); > +} > + > static int > _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > int timeout) > @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > return -1; > } > > -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > void *payload, void *rcvdata) > { > - const struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > return 0; > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > - p->class_agents[rpc->mgtclass], > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > + port->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > return rcvdata; > } > > -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > ib_rmpp_hdr_t * rmpp, void *data) > { > - const struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > return 0; > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > - p->class_agents[rpc->mgtclass], > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > + port->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) > } > } > > -void *mad_rpc_open_port(char *dev_name, int dev_port, > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, > int *mgmt_classes, int num_classes) > { > struct ibmad_port *p; > @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port, > return p; > } > > -void mad_rpc_close_port(void *port_id) > +void mad_rpc_close_port(struct ibmad_port *port) > { > - struct ibmad_port *p = port_id; > - > - umad_close_port(p->port_id); > - free(p); > + umad_close_port(port->port_id); > + free(port); > } > > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > index 7403d4f..ddeb152 100644 > --- a/libibmad/src/sa.c > +++ b/libibmad/src/sa.c > @@ -44,7 +44,7 @@ > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN > > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid, > ib_sa_call_t * sa, unsigned timeout) > { > ib_rpc_t rpc = { 0 }; > @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > IB_PR_COMPMASK_SGID |\ > IB_PR_COMPMASK_NUMBPATH) > > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf) > { > int npath; > diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c > index fad263c..e5489b3 100644 > --- a/libibmad/src/smp.c > +++ b/libibmad/src/smp.c > @@ -45,7 +45,7 @@ > #define DEBUG if (ibdebug) IBWARN > > uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport) > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > > @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid, > } > > uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport) > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > > -- > 1.5.4.5 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Fri Feb 20 05:42:31 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 20 Feb 2009 08:42:31 -0500 Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert ibportstate to "new" ibmad interface In-Reply-To: <20090219190536.f96edca7.weiny2@llnl.gov> References: <20090219190536.f96edca7.weiny2@llnl.gov> Message-ID: On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > >From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Thu, 19 Feb 2009 17:27:21 -0800 > Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface > > > Signed-off-by: Ira Weiny > --- > infiniband-diags/src/ibportstate.c | 18 ++++++++++++------ > 1 files changed, 12 insertions(+), 6 deletions(-) > > diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c > index c0b9b34..ca72bda 100644 > --- a/infiniband-diags/src/ibportstate.c > +++ b/infiniband-diags/src/ibportstate.c > @@ -46,6 +46,8 @@ > > #include "ibdiag_common.h" > > +struct ibmad_port *srcport; > + > /*******************************************/ > > static int > @@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data) > { > int node_type; > > - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) > + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) > return -1; > > node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > @@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) > char buf[2048]; > char val[64]; > > - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) > + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) > return -1; > > if (port_op != 4) { > @@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) > char buf[2048]; > char val[64]; > > - if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) > + if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) > return -1; > > if (port_op != 4) > @@ -223,9 +225,12 @@ int main(int argc, char **argv) > if (argc < 2) > ibdiag_show_usage(); > > - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); > + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > + if (!srcport) > + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); Is this missing the corresponding close_port ? Same for some of the subsequent patches. -- Hal > - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) > + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, > + ibd_sm_id, srcport) < 0) > IBERROR("can't resolve destination port %s", argv[0]); > > /* First, make sure it is a switch port if it is a "set" */ > @@ -314,7 +319,8 @@ int main(int argc, char **argv) > peerportid.drpath.p[1] = (uint8_t) portnum; > > /* Set DrSLID to local lid */ > - if (ib_resolve_self(&selfportid, &selfport, 0) < 0) > + if (ib_resolve_self_via(&selfportid, > + &selfport, 0, srcport) < 0) > IBERROR("could not resolve self"); > peerportid.drpath.drslid = (uint16_t) selfportid.lid; > peerportid.drpath.drdlid = 0xffff; > -- > 1.5.4.5 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Fri Feb 20 05:55:57 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 20 Feb 2009 08:55:57 -0500 Subject: [ofa-general] [PATCH 0/10 libibmad/infiniband-diags -- converting to "new" interface. In-Reply-To: <20090219190520.c18280e1.weiny2@llnl.gov> References: <20090219190520.c18280e1.weiny2@llnl.gov> Message-ID: On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > Here is v2 of the patch series. > > I used __attribute__ ((deprecated)) on the functions which should aid others > in realizing that these functions will go away. (It sure helped me to convert > all the diags. > > Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the > new interface and I am hoping it will be accepted soon. A related issue is whether ibnetdiscover will support both the new library and the old way until the library is more proven via some build option. If it is to support both, then converting it should be done. -- Hal > The final patch converts perfquery, saquery, sminfo, smpquery, and vendstat > because they were all simple to convert and the patch series was getting > ridiculous. > > Thanks, > Ira > > -- > Ira Weiny > Math Programer/Computer Scientist > Larence Livermore National Lab > weiny2 at llnl.gov > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From weiny2 at llnl.gov Fri Feb 20 09:23:50 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 20 Feb 2009 09:23:50 -0800 Subject: [ofa-general] [PATCH 0/10 libibmad/infiniband-diags -- converting to "new" interface. In-Reply-To: References: <20090219190520.c18280e1.weiny2@llnl.gov> Message-ID: <20090220092350.7ee3ddab.weiny2@llnl.gov> On Fri, 20 Feb 2009 08:55:57 -0500 Hal Rosenstock wrote: > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > > Here is v2 of the patch series. > > > > I used __attribute__ ((deprecated)) on the functions which should aid others > > in realizing that these functions will go away. (It sure helped me to convert > > all the diags. > > > > Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the > > new interface and I am hoping it will be accepted soon. > > A related issue is whether ibnetdiscover will support both the new > library and the old way until the library is more proven via some > build option. If it is to support both, then converting it should be > done. The conversion is easy. I will do it for now to remove the build warnings. And now that I think about it more leaving in the old and new code to be chosen via configure is probably not a bad idea. I don't know what is going to happen once we standardize on the mad library for decoding strings. There are some incompatibilities there (ie 1x vs 1X and 2.5Gbps vs SDR etc.) I will say, however, that I tested the library extensively and the first version's output was identical to the old version with the sole exception of the order ports were printed in. :-D So my confidence is high it will be accepted sooner rather than later. Ira > > -- Hal > > > The final patch converts perfquery, saquery, sminfo, smpquery, and vendstat > > because they were all simple to convert and the patch series was getting > > ridiculous. > > > > Thanks, > > Ira > > > > -- > > Ira Weiny > > Math Programer/Computer Scientist > > Larence Livermore National Lab > > weiny2 at llnl.gov > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > > > -- Ira Weiny Math Programer/Computer Scientist Larence Livermore National Lab weiny2 at llnl.gov From hnrose at comcast.net Fri Feb 20 09:37:11 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 20 Feb 2009 12:37:11 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] ibsim/sim.h: Better portinfo alignment in Port struct Message-ID: <20090220173711.GA3024@comcast.net> Signed-off-by: Hal Rosenstock --- ibsim/sim.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ibsim/sim.h b/ibsim/sim.h index ec76dac..48b4536 100644 --- a/ibsim/sim.h +++ b/ibsim/sim.h @@ -197,8 +197,8 @@ struct Port { int physstate; int lmc; int hoqlife; - uint8_t op_vls; uint8_t portinfo[64]; + uint8_t op_vls; char remotenodeid[NODEIDLEN]; char remotealias[ALIASLEN + 1]; -- 1.5.6.4 From hal.rosenstock at gmail.com Fri Feb 20 10:24:35 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 20 Feb 2009 13:24:35 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH 1/10] libibmad: Clean up "new" interface In-Reply-To: References: <20090219190525.322681b8.weiny2@llnl.gov> Message-ID: On Fri, Feb 20, 2009 at 8:41 AM, Hal Rosenstock wrote: > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: >> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001 >> From: Ira Weiny >> Date: Wed, 18 Feb 2009 16:37:36 -0800 >> Subject: [PATCH] libibmad: Clean up "new" interface >> >> type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * >> Create new mad_rpc_portid(struct ibmad_port *srcport) function >> which mirrors madrpc_portid(void) >> Mark all "old" functions with __attribute__ ((deprecated)) >> >> Signed-off-by: Ira Weiny >> --- >> libibmad/include/infiniband/mad.h | 139 ++++++++++++++++++++++--------------- >> libibmad/src/gs.c | 19 +++--- >> libibmad/src/libibmad.map | 1 + >> libibmad/src/resolve.c | 10 ++- >> libibmad/src/rpc.c | 29 ++++---- >> libibmad/src/sa.c | 4 +- >> libibmad/src/smp.c | 4 +- >> 7 files changed, 118 insertions(+), 88 deletions(-) >> >> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h >> index 1aaaa1b..80e38be 100644 >> --- a/libibmad/include/infiniband/mad.h >> +++ b/libibmad/include/infiniband/mad.h >> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt) >> } >> >> /* rpc.c */ >> -MAD_EXPORT int madrpc_portid(void); >> -MAD_EXPORT int madrpc_set_retries(int retries); >> -MAD_EXPORT int madrpc_set_timeout(int timeout); retries and timeouts could also be made per ibmad_port struct basis rather than one for all clients. Those two APIs would be deprecated in favor of new ones (mad_rpc_set_retries/timeout). -- Hal From sean.hefty at intel.com Fri Feb 20 10:27:33 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 20 Feb 2009 10:27:33 -0800 Subject: [ofa-general] ib-diag: use of getpass() Message-ID: saquery calls getpass, and according to the man page: 'This function is obsolete. Do not use it.' Can we remove this call? What is your preference for replacing it? (Use scanf? take the SM Key as a command line argument?) From weiny2 at llnl.gov Fri Feb 20 10:28:13 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 20 Feb 2009 10:28:13 -0800 Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert ibportstate to "new" ibmad interface In-Reply-To: References: <20090219190536.f96edca7.weiny2@llnl.gov> Message-ID: <20090220102813.9b0bd107.weiny2@llnl.gov> On Fri, 20 Feb 2009 08:42:31 -0500 Hal Rosenstock wrote: > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > > >From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001 > > From: Ira Weiny > > Date: Thu, 19 Feb 2009 17:27:21 -0800 > > Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface > > > > > > Signed-off-by: Ira Weiny > > --- > > infiniband-diags/src/ibportstate.c | 18 ++++++++++++------ > > 1 files changed, 12 insertions(+), 6 deletions(-) > > > > diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c > > index c0b9b34..ca72bda 100644 > > --- a/infiniband-diags/src/ibportstate.c > > +++ b/infiniband-diags/src/ibportstate.c > > @@ -46,6 +46,8 @@ > > > > #include "ibdiag_common.h" > > > > +struct ibmad_port *srcport; > > + > > /*******************************************/ > > > > static int > > @@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data) > > { > > int node_type; > > > > - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) > > + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) > > return -1; > > > > node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > > @@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) > > char buf[2048]; > > char val[64]; > > > > - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) > > + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) > > return -1; > > > > if (port_op != 4) { > > @@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) > > char buf[2048]; > > char val[64]; > > > > - if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) > > + if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) > > return -1; > > > > if (port_op != 4) > > @@ -223,9 +225,12 @@ int main(int argc, char **argv) > > if (argc < 2) > > ibdiag_show_usage(); > > > > - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); > > + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > > + if (!srcport) > > + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > > Is this missing the corresponding close_port ? Same for some of the > subsequent patches. Yep I missed a couple of them. 4/10, 6/10, and 9/10. New patches to follow. Ira > > -- Hal > > > - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) > > + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, > > + ibd_sm_id, srcport) < 0) > > IBERROR("can't resolve destination port %s", argv[0]); > > > > /* First, make sure it is a switch port if it is a "set" */ > > @@ -314,7 +319,8 @@ int main(int argc, char **argv) > > peerportid.drpath.p[1] = (uint8_t) portnum; > > > > /* Set DrSLID to local lid */ > > - if (ib_resolve_self(&selfportid, &selfport, 0) < 0) > > + if (ib_resolve_self_via(&selfportid, > > + &selfport, 0, srcport) < 0) > > IBERROR("could not resolve self"); > > peerportid.drpath.drslid = (uint16_t) selfportid.lid; > > peerportid.drpath.drdlid = 0xffff; > > -- > > 1.5.4.5 > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > > > -- Ira Weiny Math Programer/Computer Scientist Larence Livermore National Lab weiny2 at llnl.gov From rdreier at cisco.com Fri Feb 20 10:32:20 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Feb 2009 10:32:20 -0800 Subject: [ofa-general] ib-diag: use of getpass() In-Reply-To: (Sean Hefty's message of "Fri, 20 Feb 2009 10:27:33 -0800") References: Message-ID: > saquery calls getpass, and according to the man page: > > 'This function is obsolete. Do not use it.' I believe that information may not be totally accurate. The modern glibc implementation doesn't seem to have any problems. - R. From rdreier at cisco.com Fri Feb 20 10:39:25 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Feb 2009 10:39:25 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for active connections. In-Reply-To: <20090217215959.16117.17150.stgit@NTAC> (Steve Wise's message of "Tue, 17 Feb 2009 16:00:00 -0600") References: <20090217215959.16117.17150.stgit@NTAC> Message-ID: > - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); > + return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); minor but the parens around the function call are totally unnecessary. If we're touching the line anyway may as well leave them off. > +static int iwch_post_qp_fatal(int id, void *p, void *data) > +{ > + struct ib_event event; > + struct iwch_qp *qhp = p; > + > + event.event = IB_EVENT_DEVICE_FATAL; > + event.device = qhp->ibqp.device; > + event.element.qp = &qhp->ibqp; > + BUG_ON(qhp->rhp != data); > + BUG_ON(qhp->wq.qpid != id); > + if (qhp->ibqp.event_handler) { > + PDBG("%s posting DEVICE_FATAL for qpid %u\n", > + __func__, qhp->wq.qpid); > + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); This doesn't match the IB driver behavior (or the IB spec) -- the DEVICE_FATAL event is unaffiliated and delivered for the adapter as a whole. QP events are supposed to be for events connected to a single QP, not the whole adapter failing. BTW I don't think you need the * here, do you? Would be easier to read to just call it like qhp->ibqp.event_handler(&event, qhp->ibqp.qp_context) > +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e) > +{ > + int error=0; > + struct cxio_rdev *rdev; > + > + rdev = (struct cxio_rdev *)tdev->ulp; > + if (rdev->flags) { Might be nice to wrap this rdev->flags test up in a trivial inline function (eg iwch_eeh_set() or something like that) in case other things get put into those flags later. > + kfree_skb(skb); > + return -EIO; > + } > + error = l2t_send(tdev, skb, l2e); > + if (error) > + kfree_skb(skb); > + return error; > +} The kfree_skb() calls here change behavior -- eg you have the change: > - l2t_send(ep->com.tdev, skb, ep->l2t); > - return 0; > + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); and now if l2t_send() returns an error the skb is freed, where before it wasn't. Also I'm wondering why you want these wrappers in iw_cxgb3 -- would it not make more sense for the cxgb3 l2t_send() to check the eeh state and always behave appropriately? Or is it more complicated than that? - R. From hal.rosenstock at gmail.com Fri Feb 20 10:42:59 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 20 Feb 2009 13:42:59 -0500 Subject: [ofa-general] ib-diag: use of getpass() In-Reply-To: References: Message-ID: On Fri, Feb 20, 2009 at 1:27 PM, Sean Hefty wrote: > saquery calls getpass, and according to the man page: > > 'This function is obsolete. Do not use it.' > > Can we remove this call? What is your preference for replacing it? (Use scanf? > take the SM Key as a command line argument?) There was a thread on this back in June 2008: http://lists.openfabrics.org/pipermail/general/2008-June/051057.html Sasha wrote: glibc info page doesn't indicate this. Also I did some googling and looked at glibc code itself - found nothing suspicious yet. Finally it is how password handled in 'su'. -- Hal > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Fri Feb 20 10:59:38 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 20 Feb 2009 10:59:38 -0800 Subject: [ofa-general] ib-diag: use of getpass() In-Reply-To: References: Message-ID: <6F2C9CB988674B1B9478AFBF3D1DE11B@amr.corp.intel.com> >There was a thread on this back in June 2008: >http://lists.openfabrics.org/pipermail/general/2008-June/051057.html > >Sasha wrote: >glibc info page doesn't indicate this. Also I did some >googling and looked at glibc code itself - found nothing suspicious yet. >Finally it is how password handled in 'su'. I'll add an implementation for it on windows then... From hnrose at comcast.net Fri Feb 20 12:33:36 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 20 Feb 2009 15:33:36 -0500 Subject: [ofa-general] [PATCH] ibsim: Handle sim_init_net errors better Message-ID: <20090220203336.GA3874@comcast.net> Use define rather than constant Also, cosmetic formatting and fixed some typos Signed-off-by: Hal Rosenstock --- ibsim/ibsim.c | 7 ++++--- ibsim/sim_cmd.c | 6 ++++-- ibsim/sim_mad.c | 8 ++++---- ibsim/sim_net.c | 2 +- 4 files changed, 13 insertions(+), 10 deletions(-) diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c index 7cea9de..bfc58f5 100644 --- a/ibsim/ibsim.c +++ b/ibsim/ibsim.c @@ -379,7 +379,6 @@ static int sim_ctl_get_pkeys(Client * cl, struct sim_ctl * ctl) memcpy(ctl->data, port->pkey_tbl, size); if (size < sizeof(ctl->data)) memset(ctl->data + size, 0, sizeof(ctl->data) - size); - return 0; } @@ -730,6 +729,7 @@ int main(int argc, char **argv) extern void free_core(void); char *outfname = 0, *netfile; FILE *infile, *outfile; + int status; static char const str_opts[] = "rf:dpvIsN:S:P:L:M:l:Vhu"; static const struct option long_opts[] = { @@ -818,8 +818,9 @@ int main(int argc, char **argv) IBPANIC("not enough memory for core structure"); DEBUG("initializing net \"%s\"", netfile); - if (sim_init_net(netfile, outfile) < 0) - IBPANIC("sim_init failed"); + status = sim_init_net(netfile, outfile); + if (status < 0) + IBPANIC("sim_init failed, status %d", status); sim_init_console(outfile); diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c index c683224..94e0a14 100644 --- a/ibsim/sim_cmd.c +++ b/ibsim/sim_cmd.c @@ -203,7 +203,8 @@ static int do_relink(FILE * f, char *line) return -1; } - rport = node_get_port(lport->previous_remotenode, lport->previous_remoteport); + rport = node_get_port(lport->previous_remotenode, + lport->previous_remoteport); if (link_ports(lport, rport) < 0) return -fprintf(f, @@ -220,7 +221,8 @@ static int do_relink(FILE * f, char *line) if (!lport->previous_remotenode) continue; - rport = node_get_port(lport->previous_remotenode, lport->previous_remoteport); + rport = node_get_port(lport->previous_remotenode, + lport->previous_remoteport); if (link_ports(lport, rport) < 0) continue; diff --git a/ibsim/sim_mad.c b/ibsim/sim_mad.c index 6e08031..2fbf96f 100644 --- a/ibsim/sim_mad.c +++ b/ibsim/sim_mad.c @@ -379,7 +379,7 @@ static int do_vlarb(Port * port, unsigned op, uint32_t mod, uint8_t * data) if (op == IB_MAD_METHOD_SET) { memcpy(vlarb, data, size); } else { - memset(data, 0, 64); + memset(data, 0, IB_SMP_DATA_SIZE); memcpy(data, vlarb, size); } @@ -395,7 +395,7 @@ static int do_guidinfo(Port * port, unsigned op, uint32_t mod, uint8_t * data) if (op != IB_MAD_METHOD_GET) // only get currently supported (non compliant) status = ERR_METHOD_UNSUPPORTED; - memset(data, 0, 64); + memset(data, 0, IB_SMP_DATA_SIZE); if (mod == 0) { if (node->type == SWITCH_NODE) mad_encode_field(data, IB_GUID_GUID0_F, &node->nodeguid); @@ -613,7 +613,7 @@ static int pc_updated(Port ** srcport, Port * destport) uint32_t madsize_div_4 = 72; //real data divided by 4 if (*srcport != destport) { - //PKT get out of port .. + //PKT got out of port .. srcpc->flow_xmt_pkts = addval(srcpc->flow_xmt_pkts, 1, GS_PERF_XMT_PKTS_LIMIT); srcpc->flow_xmt_bytes = @@ -629,7 +629,7 @@ static int pc_updated(Port ** srcport, Port * destport) VERB("drop pkt due error rate %d", destport->errrate); return 0; } - //PKT get in to the port .. + //PKT got into the port .. destpc->flow_rcv_pkts = addval(destpc->flow_rcv_pkts, 1, GS_PERF_RCV_PKTS_LIMIT); destpc->flow_rcv_bytes = diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c index f0628ec..fa05c35 100644 --- a/ibsim/sim_net.c +++ b/ibsim/sim_net.c @@ -1116,7 +1116,7 @@ int link_ports(Port * lport, Port * rport) rport->remoteport = lport->portnum; set_portinfo(rport, rnode->type == SWITCH_NODE ? swport : hcaport); memcpy(rport->remotenodeid, lnode->nodeid, sizeof(rport->remotenodeid)); - lport->state = rport->state = 2; // Initialilze + lport->state = rport->state = 2; // Initialize lport->physstate = rport->physstate = 5; // LinkUP if (lnode->sw) lnode->sw->portchange = 1; -- 1.5.6.4 From neutronsharc at gmail.com Fri Feb 20 12:44:12 2009 From: neutronsharc at gmail.com (neutron) Date: Fri, 20 Feb 2009 15:44:12 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results in crash In-Reply-To: <499E6826.704@sun.com> References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> <7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com> <499E6826.704@sun.com> Message-ID: <7d5928b30902201244i24aff45ct2cabcb99e68ce469@mail.gmail.com> When we installed the ofed, we use: "/install.pl --all". So we expect it should have installed everything. "ofed_info" shows "ofa_kernel-1.3.1" is installed, but "ofa_kernel_devel" is not. What's that package for? where to get it? It seems not located at " /SRPMS ". Thanks. Below is the output given by "ofed_info". ----------------- OFED-1.3.1 libibverbs: git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3 commit 40b771aa6a9c0ad092b2e20775b4723d3b173792 libmthca: git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3 commit 9501e698d257949acfab2edc90812602966dbcc9 libmlx4: git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3 commit 3869d6dab7e12fe452270ca641f7dd7082b42482 libehca: git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3 commit fd898180cfa3b737f893f432a80b91bac3396325 libipathverbs: git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3 commit 82be4d81859d1fd2edf830220fe65a9923b80a46 libcxgb3: git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3 commit 6f7485feb244d8571fcab2292ef92c97bea48df0 libnes: git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3 commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3 libibcm: git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3 commit 53ec35f544bbc1838bbadc2210909c25a954a5e2 librdmacm: git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3 commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1 dapl1: git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3 commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779 dapl2: git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3 commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177 libsdp: git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3 commit c8102dccc502930442b23de658674d386456b350 sdpnetstat: git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3 commit 3341620a7259c4f7bdd4180864b98e260c3dc223 srptools: git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3 commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d perftest: git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3 commit 6321b5468f7293088cc003809049c02b176130d8 qlvnictools: git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3 commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd tvflash: git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3 commit f5e7407a7f2058448df5e5320d9843f944427429 mstflint: git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3 commit 78bbd3d521a9078553a991111ffb6f76665b9ee9 qperf: git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3 commit 6221aabd038df0b7033e035378ca190641ed2295 management: git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3 commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf ibutils: git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3 commit 7daf94fab6eaf307316326f3f49704e6080a1508 ibsim: git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3 commit 55113d9f919709c7c97ea41d29991941b9c8be70 ofa_kernel-1.3.1: Git: git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c # MPI mvapich-1.0.1-2533.src.rpm mvapich2-1.0.3-1.src.rpm openmpi-1.2.6-1.src.rpm mpitests-3.0-773.src.rpm =----------------- On Fri, Feb 20, 2009 at 3:21 AM, Liang Zhen wrote: > Hmm, I didn't see any problem in your code. Have you installed > ofa_kernel_devel (kernel headers of OFED) after installation of > ofa_kernel_1_3_1? > > Regards > Liang > > neutron: >> >> I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version: >> 2.6.18-53.1.14.el5, ofed 1.3.1. >> >> The failed function call is like: >> >> { >> >> ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE, >> &dma_addr, GFP_KERNEL); >> >> ctx->phy_buf[0].addr = dma_addr; >> ctx->phy_buf[0].size = MAX_SIZE; >> ctx->iovstart = (u64) ctx->send_buf; >> >> printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n", >> ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart >> ); >> >> send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1, >> IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ >> | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart)); >> } >> >> The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf". >> >> Below is /var/log/messages output around the crash. >> ---------------- >> Feb 19 12:50:22 wci30 kernel: pd=ffff8101da3ddce0, >> phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000 >> >> Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer >> dereference at 0000000000000000 >> RIP: >> Feb 19 12:50:22 wci30 kernel: [<0000000000000000>] >> _stext+0x7ffff000/0x1000 >> Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0 >> Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP >> Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version >> Feb 19 12:50:22 wci30 kernel: CPU 0 >> Feb 19 12:54:05 wci30 syslogd 1.4.1: restart. >> Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg >> started. >> Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5 >> (brewbuilder at hs20-bc2-3.build.redha >> t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb >> 19 07:18:46 EST 2008 >> Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet >> >> ==================== >> It's strange that the kernel doesn't print out the function call stack >> before crashing. >> >> Any hints? Thanks a lot! >> >> On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier wrote: >> >>> >>> > Before calling ib_reg_phys_mr, printk() shows that all its arguments >>> > are valid. But the system always crashes immediately after entering >>> > the function ib_reg_phys_mr( ). Any possible reasons ? Thanks!! >>> >>> What do you mean by "immediately after entering ib_reg_phys_mr()"? Do >>> you get an oops message? If so that would be very important info for >>> debugging this. >>> >>> - R. >>> >>> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > From divy at chelsio.com Fri Feb 20 13:27:28 2009 From: divy at chelsio.com (Divy Le Ray) Date: Fri, 20 Feb 2009 13:27:28 -0800 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for active connections. In-Reply-To: References: <20090217215959.16117.17150.stgit@NTAC> Message-ID: <499F2040.50008@chelsio.com> Roland Dreier wrote: > > - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); > > + return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); > > minor but the parens around the function call are totally unnecessary. > If we're touching the line anyway may as well leave them off. > > > +static int iwch_post_qp_fatal(int id, void *p, void *data) > > +{ > > + struct ib_event event; > > + struct iwch_qp *qhp = p; > > + > > + event.event = IB_EVENT_DEVICE_FATAL; > > + event.device = qhp->ibqp.device; > > + event.element.qp = &qhp->ibqp; > > + BUG_ON(qhp->rhp != data); > > + BUG_ON(qhp->wq.qpid != id); > > + if (qhp->ibqp.event_handler) { > > + PDBG("%s posting DEVICE_FATAL for qpid %u\n", > > + __func__, qhp->wq.qpid); > > + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); > > This doesn't match the IB driver behavior (or the IB spec) -- the > DEVICE_FATAL event is unaffiliated and delivered for the adapter as a > whole. QP events are supposed to be for events connected to a single > QP, not the whole adapter failing. > > BTW I don't think you need the * here, do you? Would be easier to read > to just call it like > > qhp->ibqp.event_handler(&event, qhp->ibqp.qp_context) > > > +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e) > > +{ > > + int error=0; > > + struct cxio_rdev *rdev; > > + > > + rdev = (struct cxio_rdev *)tdev->ulp; > > + if (rdev->flags) { > > Might be nice to wrap this rdev->flags test up in a trivial inline > function (eg iwch_eeh_set() or something like that) in case other things > get put into those flags later. > > > + kfree_skb(skb); > > + return -EIO; > > + } > > + error = l2t_send(tdev, skb, l2e); > > + if (error) > > + kfree_skb(skb); > > + return error; > > +} > > The kfree_skb() calls here change behavior -- eg you have the change: > > > - l2t_send(ep->com.tdev, skb, ep->l2t); > > - return 0; > > + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); > > and now if l2t_send() returns an error the skb is freed, where before it > wasn't. > > Also I'm wondering why you want these wrappers in iw_cxgb3 -- would it > not make more sense for the cxgb3 l2t_send() to check the eeh state and > always behave appropriately? Or is it more complicated than that? > Hi Roland, l2t_send() is used on connection setup/teardown path for iWARP, but is the data path of the iSCSI offload module. Cheers, Divy From weiny2 at llnl.gov Fri Feb 20 13:51:44 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 20 Feb 2009 13:51:44 -0800 Subject: [ofa-general] [PATCH 4/10] infiniband-diags: Convert ibportstate to "new" ibmad interface In-Reply-To: <20090220102813.9b0bd107.weiny2@llnl.gov> References: <20090219190536.f96edca7.weiny2@llnl.gov> <20090220102813.9b0bd107.weiny2@llnl.gov> Message-ID: <20090220135144.9e3cc6db.weiny2@llnl.gov> On Fri, 20 Feb 2009 10:28:13 -0800 Ira Weiny wrote: > On Fri, 20 Feb 2009 08:42:31 -0500 > Hal Rosenstock wrote: > > > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > > > >From 9ae029eec58963629f4713868f383c6dd651448d Mon Sep 17 00:00:00 2001 > > > From: Ira Weiny > > > Date: Thu, 19 Feb 2009 17:27:21 -0800 > > > Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface > > > > > > > > > Signed-off-by: Ira Weiny > > > --- > > > infiniband-diags/src/ibportstate.c | 18 ++++++++++++------ > > > > > > - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); > > > + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > > > + if (!srcport) > > > + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > > > > Is this missing the corresponding close_port ? Same for some of the > > subsequent patches. > > Yep I missed a couple of them. 4/10, 6/10, and 9/10. New patches to follow. > > Ira > Nope 9/10 does not require this as it uses umad to close the port. The 2 new patches for 4/10 and 6/10 follow. Ira From weiny2 at llnl.gov Fri Feb 20 13:51:50 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 20 Feb 2009 13:51:50 -0800 Subject: [ofa-general] [PATCHv2 4/10] infiniband-diags: Convert ibportstate to "new" ibmad interface In-Reply-To: References: <20090219190536.f96edca7.weiny2@llnl.gov> Message-ID: <20090220135150.cd171cc2.weiny2@llnl.gov> >From 5630f01688b7ea755b02d183d73edc86339f2e8b Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 17:27:21 -0800 Subject: [PATCH] infiniband-diags: Convert ibportstate to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/ibportstate.c | 19 +++++++++++++------ 1 files changed, 13 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index c0b9b34..65c9ca1 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -46,6 +46,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + /*******************************************/ static int @@ -53,7 +55,7 @@ get_node_info(ib_portid_t *dest, uint8_t *data) { int node_type; - if (!smp_query(data, dest, IB_ATTR_NODE_INFO, 0, 0)) + if (!smp_query_via(data, dest, IB_ATTR_NODE_INFO, 0, 0, srcport)) return -1; node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); @@ -69,7 +71,7 @@ get_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) char buf[2048]; char val[64]; - if (!smp_query(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_query_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return -1; if (port_op != 4) { @@ -108,7 +110,7 @@ set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) char buf[2048]; char val[64]; - if (!smp_set(data, dest, IB_ATTR_PORT_INFO, portnum, 0)) + if (!smp_set_via(data, dest, IB_ATTR_PORT_INFO, portnum, 0, srcport)) return -1; if (port_op != 4) @@ -223,9 +225,12 @@ int main(int argc, char **argv) if (argc < 2) ibdiag_show_usage(); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, + ibd_sm_id, srcport) < 0) IBERROR("can't resolve destination port %s", argv[0]); /* First, make sure it is a switch port if it is a "set" */ @@ -314,7 +319,8 @@ int main(int argc, char **argv) peerportid.drpath.p[1] = (uint8_t) portnum; /* Set DrSLID to local lid */ - if (ib_resolve_self(&selfportid, &selfport, 0) < 0) + if (ib_resolve_self_via(&selfportid, + &selfport, 0, srcport) < 0) IBERROR("could not resolve self"); peerportid.drpath.drslid = (uint16_t) selfportid.lid; peerportid.drpath.drdlid = 0xffff; @@ -354,5 +360,6 @@ int main(int argc, char **argv) } } + mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Fri Feb 20 13:51:55 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 20 Feb 2009 13:51:55 -0800 Subject: [ofa-general] [PATCH v2 6/10] infiniband-diags: Convert ibsendtrap to "new" ibmad interface In-Reply-To: <20090219190546.4fcaa158.weiny2@llnl.gov> References: <20090219190546.4fcaa158.weiny2@llnl.gov> Message-ID: <20090220135155.39cbe4e6.weiny2@llnl.gov> >From f70635f4d62fb57221a4239a2013e602f6449548 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 19 Feb 2009 17:53:30 -0800 Subject: [PATCH] infiniband-diags: Convert ibsendtrap to "new" ibmad interface also make mad_send_via public to do the conversion Signed-off-by: Ira Weiny --- infiniband-diags/src/ibsendtrap.c | 21 ++++++++++++++------- libibmad/src/libibmad.map | 1 + 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index ba6aa8b..75120f0 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -47,6 +47,8 @@ #include "ibdiag_common.h" +struct ibmad_port *srcport; + static int send_144_node_desc_update(void) { ib_portid_t sm_port; @@ -55,10 +57,10 @@ static int send_144_node_desc_update(void) ib_rpc_t trap_rpc; ib_mad_notice_attr_t notice; - if (ib_resolve_self(&selfportid, &selfport, NULL)) + if (ib_resolve_self_via(&selfportid, &selfport, NULL, srcport)) IBERROR("can't resolve self"); - if (ib_resolve_smlid(&sm_port, 0)) + if (ib_resolve_smlid_via(&sm_port, 0, srcport)) IBERROR("can't resolve SM destination port"); memset(&trap_rpc, 0, sizeof(trap_rpc)); @@ -80,7 +82,7 @@ static int send_144_node_desc_update(void) notice.data_details.ntc_144.change_flgs = TRAP_144_MASK_NODE_DESCRIPTION_CHANGE; - return (mad_send(&trap_rpc, &sm_port, NULL, ¬ice)); + return (mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport)); } typedef struct _trap_def { @@ -103,7 +105,7 @@ int send_trap(char *trap_name) } } ibdiag_show_usage(); - exit(1); + return(1); } int main(int argc, char **argv) @@ -111,7 +113,7 @@ int main(int argc, char **argv) char usage_args[1024]; int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; char *trap_name = NULL; - int i, n; + int i, n, rc; n = sprintf(usage_args, "[]\n" "\nArgument can be one of the following:\n"); @@ -137,7 +139,12 @@ int main(int argc, char **argv) } madrpc_show_errors(1); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); - return (send_trap(trap_name)); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + + rc = send_trap(trap_name); + mad_rpc_close_port(srcport); + return (rc); } diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index bac74a9..0412027 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -91,6 +91,7 @@ IBMAD_1.3 { mad_receive_via; mad_respond_via; mad_send; + mad_send_via; smp_query; smp_set; ib_vendor_call; -- 1.5.4.5 From hnrose at comcast.net Fri Feb 20 13:59:38 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 20 Feb 2009 16:59:38 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/saquery.c: Convert more LID prints to unsigned decimal Message-ID: <20090220215938.GB7360@comcast.net> Signed-off-by: Hal Rosenstock --- infiniband-diags/src/saquery.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 9726d22..bcd1f61 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -332,13 +332,13 @@ static void dump_class_port_info(void *data) "\t\tResponse time value......0x%02X\n" "\t\tRedirect GID.............%s\n" "\t\tRedirect TC/SL/FL........0x%08X\n" - "\t\tRedirect LID.............0x%04X\n" + "\t\tRedirect LID.............%u\n" "\t\tRedirect PKey............0x%04X\n" "\t\tRedirect QP..............0x%08X\n" "\t\tRedirect QKey............0x%08X\n" "\t\tTrap GID.................%s\n" "\t\tTrap TC/SL/FL............0x%08X\n" - "\t\tTrap LID.................0x%04X\n" + "\t\tTrap LID.................%u\n" "\t\tTrap PKey................0x%04X\n" "\t\tTrap HL/QP...............0x%08X\n" "\t\tTrap QKey................0x%08X\n", @@ -360,7 +360,7 @@ static void dump_portinfo_record(void *data) const ib_port_info_t *const p_pi = &p_pir->port_info; printf("PortInfoRecord dump:\n" - "\t\tEndPortLid..............0x%X\n" + "\t\tEndPortLid..............%u\n" "\t\tPortNum.................0x%X\n" "\t\tbase_lid................0x%X\n" "\t\tmaster_sm_base_lid......0x%X\n" -- 1.5.6.4 From hnrose at comcast.net Fri Feb 20 13:58:45 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 20 Feb 2009 16:58:45 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] libibmad/fields.c: Dump LIDs as unsigned decimal Message-ID: <20090220215845.GA7360@comcast.net> Signed-off-by: Hal Rosenstock --- libibmad/src/fields.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index d6742f9..e14dbb5 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -123,8 +123,8 @@ static const ib_field_t ib_mad_f[] = { */ {0, 64, "Mkey", mad_dump_hex}, {64, 64, "GidPrefix", mad_dump_hex}, - {BITSOFFS(128, 16), "Lid", mad_dump_hex}, - {BITSOFFS(144, 16), "SMLid", mad_dump_hex}, + {BITSOFFS(128, 16), "Lid", mad_dump_uint}, + {BITSOFFS(144, 16), "SMLid", mad_dump_uint}, {160, 32, "CapMask", mad_dump_portcapmask}, {BITSOFFS(192, 16), "DiagCode", mad_dump_hex}, {BITSOFFS(208, 16), "MkeyLeasePeriod", mad_dump_uint}, -- 1.5.6.4 From weiny2 at llnl.gov Fri Feb 20 14:34:02 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 20 Feb 2009 14:34:02 -0800 Subject: [ofa-general] [PATCH 11/10] libibmad:infiniband-diags: deprecate madrpc_set_[retries|timeout] WAS: [PATCH 1/10] libibmad: Clean up "new" interface In-Reply-To: References: <20090219190525.322681b8.weiny2@llnl.gov> Message-ID: <20090220143402.c3b23b0a.weiny2@llnl.gov> On Fri, 20 Feb 2009 13:24:35 -0500 Hal Rosenstock wrote: > On Fri, Feb 20, 2009 at 8:41 AM, Hal Rosenstock > wrote: > > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > >> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001 > >> From: Ira Weiny > >> Date: Wed, 18 Feb 2009 16:37:36 -0800 > >> Subject: [PATCH] libibmad: Clean up "new" interface > >> > >> type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > >> Create new mad_rpc_portid(struct ibmad_port *srcport) function > >> which mirrors madrpc_portid(void) > >> Mark all "old" functions with __attribute__ ((deprecated)) > >> > >> Signed-off-by: Ira Weiny > >> --- > >> libibmad/include/infiniband/mad.h | 139 ++++++++++++++++++++++--------------- > >> libibmad/src/gs.c | 19 +++--- > >> libibmad/src/libibmad.map | 1 + > >> libibmad/src/resolve.c | 10 ++- > >> libibmad/src/rpc.c | 29 ++++---- > >> libibmad/src/sa.c | 4 +- > >> libibmad/src/smp.c | 4 +- > >> 7 files changed, 118 insertions(+), 88 deletions(-) > >> > >> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > >> index 1aaaa1b..80e38be 100644 > >> --- a/libibmad/include/infiniband/mad.h > >> +++ b/libibmad/include/infiniband/mad.h > >> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt) > >> } > >> > >> /* rpc.c */ > >> -MAD_EXPORT int madrpc_portid(void); > >> -MAD_EXPORT int madrpc_set_retries(int retries); > >> -MAD_EXPORT int madrpc_set_timeout(int timeout); > > retries and timeouts could also be made per ibmad_port struct basis > rather than one for all clients. Those two APIs would be deprecated in > favor of new ones (mad_rpc_set_retries/timeout). > Patch below. (To be applied after the others.) >From d12b291041bdfe0d3bddecb7a71ee769a601fd83 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 20 Feb 2009 14:30:52 -0800 Subject: [PATCH] libibmad:infiniband-diags: deprecate madrpc_set_[retries|timeout] replace with mad_rpc_set_[retries|timeout] which are per ibmad_port object Update all diags with new functions Signed-off-by: Ira Weiny --- infiniband-diags/src/ibaddr.c | 1 + infiniband-diags/src/ibdiag_common.c | 1 - infiniband-diags/src/ibping.c | 1 + infiniband-diags/src/ibportstate.c | 1 + infiniband-diags/src/ibroute.c | 1 + infiniband-diags/src/ibsendtrap.c | 1 + infiniband-diags/src/ibsysstat.c | 1 + infiniband-diags/src/ibtracert.c | 1 + infiniband-diags/src/perfquery.c | 1 + infiniband-diags/src/saquery.c | 1 + infiniband-diags/src/sminfo.c | 1 + infiniband-diags/src/smpquery.c | 1 + infiniband-diags/src/vendstat.c | 1 + libibmad/include/infiniband/mad.h | 6 ++++-- libibmad/src/libibmad.map | 2 ++ libibmad/src/mad_internal.h | 2 ++ libibmad/src/rpc.c | 29 ++++++++++++++++++++--------- 17 files changed, 40 insertions(+), 12 deletions(-) diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index bb22be9..e782b36 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -142,6 +142,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (argc) { if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index 609df69..38d6cd3 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -175,7 +175,6 @@ static int process_opt(int ch, char *optarg) break; case 't': val = strtoul(optarg, 0, 0); - madrpc_set_timeout(val); ibd_timeout = val; break; case 's': diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c index 901079f..28e3a64 100644 --- a/infiniband-diags/src/ibping.c +++ b/infiniband-diags/src/ibping.c @@ -213,6 +213,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (server) { if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0) diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index 65c9ca1..deaad51 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -228,6 +228,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, ibd_sm_id, srcport) < 0) diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 60bfdd8..07eddc4 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -410,6 +410,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (!argc) { if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index 75120f0..916b537 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -143,6 +143,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); rc = send_trap(trap_name); mad_rpc_close_port(srcport); diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index d7daa37..7e668e8 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -339,6 +339,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (server) { if (mad_register_server_via(sysstat_class, 1, 0, oui, srcport) < 0) diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index 1965aa0..87b5b17 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -753,6 +753,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); node_name_map = open_node_name_map(node_name_map_file); diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 2f104b8..3d89cc7 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -389,6 +389,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (argc) { if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index e6cbe50..43eff85 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1323,6 +1323,7 @@ static bind_handle_t get_bind_handle(void) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); ib_resolve_smlid_via(&handle.dport, ibd_timeout, srcport); if (!handle.dport.lid) diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c index ebf6a47..0caa3f3 100644 --- a/infiniband-diags/src/sminfo.c +++ b/infiniband-diags/src/sminfo.c @@ -118,6 +118,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (argc) { if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 2ed1e65..dc6b685 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -455,6 +455,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); node_name_map = open_node_name_map(node_name_map_file); diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index d001a01..1c1c08f 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -157,6 +157,7 @@ int main(int argc, char **argv) srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); if (argc) { if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 5cf135e..cbd3049 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -693,8 +693,6 @@ MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport, /* New interface */ MAD_EXPORT void madrpc_show_errors(int set); -MAD_EXPORT int madrpc_set_retries(int retries); -MAD_EXPORT int madrpc_set_timeout(int timeout); MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport); @@ -703,6 +701,8 @@ MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_po MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data); MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); +MAD_EXPORT int mad_rpc_set_retries(int retries, struct ibmad_port *srcport); +MAD_EXPORT int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport); /* register.c */ MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, @@ -761,6 +761,8 @@ static inline int mad_is_vendor_range2(int mgmt) } /* rpc.c */ +MAD_EXPORT int madrpc_set_retries(int retries) __attribute__ ((deprecated)); +MAD_EXPORT int madrpc_set_timeout(int timeout) __attribute__ ((deprecated)); MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated)); void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) __attribute__ ((deprecated)); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 0412027..f231485 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -80,6 +80,8 @@ IBMAD_1.3 { madrpc_save_mad; madrpc_set_retries; madrpc_set_timeout; + mad_rpc_set_retries; + mad_rpc_set_timeout; madrpc_show_errors; ib_path_query; sa_call; diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h index 9afe7a9..3991cc3 100644 --- a/libibmad/src/mad_internal.h +++ b/libibmad/src/mad_internal.h @@ -39,6 +39,8 @@ struct ibmad_port { int port_id; /* file descriptor returned by umad_open() */ int class_agents[MAX_CLASS]; /* class2agent mapper */ + int retries; + int timeout_ms; }; #endif /* _MAD_INTERNAL_H_ */ diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index 210f0c2..229020d 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -49,7 +49,7 @@ int ibdebug; static int mad_portid = -1; static int iberrs; - + int timeout; static int madrpc_retries = MAD_DEF_RETRIES; static int def_madrpc_timeout = MAD_DEF_TIMEOUT_MS; static void *save_mad; @@ -85,9 +85,17 @@ int madrpc_set_timeout(int timeout) return 0; } -int madrpc_def_timeout(void) +int mad_rpc_set_retries(int retries, struct ibmad_port *srcport) +{ + if (retries > 0) + srcport->retries = retries; + return srcport->retries; +} + +int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport) { - return def_madrpc_timeout; + srcport->timeout_ms = timeout_ms; + return 0; } int madrpc_portid(void) @@ -102,14 +110,14 @@ int mad_rpc_portid(struct ibmad_port *srcport) static int _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, - int timeout) + int timeout, const struct ibmad_port *srcport) { uint32_t trid; /* only low 32 bits */ - int retries; + int retries, max_retries; int length, status; if (!timeout) - timeout = def_madrpc_timeout; + timeout = srcport ? srcport->timeout_ms : def_madrpc_timeout; if (ibdebug > 1) { IBWARN(">>> sending: len %d pktsz %zu", len, umad_size() + len); @@ -125,7 +133,8 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, trid = (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); - for (retries = 0; retries < madrpc_retries; retries++) { + max_retries = srcport ? srcport->retries : madrpc_retries; + for (retries = 0; retries < max_retries; retries++) { if (retries) { ERRS("retry %d (timeout %d ms)", retries, timeout); } @@ -178,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, port->class_agents[rpc->mgtclass], - len, rpc->timeout)) < 0) { + len, rpc->timeout, port)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return 0; } @@ -217,7 +226,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, port->class_agents[rpc->mgtclass], - len, rpc->timeout)) < 0) { + len, rpc->timeout, port)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return 0; } @@ -356,6 +365,8 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, } p->port_id = port_id; + p->retries = MAD_DEF_RETRIES; + p->timeout_ms = MAD_DEF_TIMEOUT_MS; return p; } -- 1.5.4.5 From weiny2 at llnl.gov Fri Feb 20 14:45:29 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 20 Feb 2009 14:45:29 -0800 Subject: [ofa-general] [PATCH 12/10] infiniband-diags: convert ibnetdiscover to "new" ibmad interface WAS: [PATCH 0/10] libibmad/infiniband-diags -- converting to "new" interface. In-Reply-To: <20090220092350.7ee3ddab.weiny2@llnl.gov> References: <20090219190520.c18280e1.weiny2@llnl.gov> <20090220092350.7ee3ddab.weiny2@llnl.gov> Message-ID: <20090220144529.018d8675.weiny2@llnl.gov> On Fri, 20 Feb 2009 09:23:50 -0800 Ira Weiny wrote: > On Fri, 20 Feb 2009 08:55:57 -0500 > Hal Rosenstock wrote: > > > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > > > Here is v2 of the patch series. > > > > > > I used __attribute__ ((deprecated)) on the functions which should aid others > > > in realizing that these functions will go away. (It sure helped me to convert > > > all the diags. > > > > > > Also I did _not_ convert ibnetdiscover as my new libibnetdisc already uses the > > > new interface and I am hoping it will be accepted soon. > > > > A related issue is whether ibnetdiscover will support both the new > > library and the old way until the library is more proven via some > > build option. If it is to support both, then converting it should be > > done. > > The conversion is easy. I will do it for now to remove the build warnings. > And now that I think about it more leaving in the old and new code to be > chosen via configure is probably not a bad idea. I don't know what is going > to happen once we standardize on the mad library for decoding strings. There > are some incompatibilities there (ie 1x vs 1X and 2.5Gbps vs SDR etc.) > > I will say, however, that I tested the library extensively and the first > version's output was identical to the old version with the sole exception of > the order ports were printed in. :-D So my confidence is high it will be > accepted sooner rather than later. > > Ira Patch below: >From ad8cbf227a803d64c02872f74d7d542b815c6092 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 20 Feb 2009 14:43:48 -0800 Subject: [PATCH] infiniband-diags: convert ibnetdiscover to "new" ibmad interface Signed-off-by: Ira Weiny --- infiniband-diags/src/ibnetdiscover.c | 23 ++++++++++++++++------- 1 files changed, 16 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 466d522..8a840be 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -53,6 +53,8 @@ #include "grouping.h" #include "ibdiag_common.h" +struct ibmad_port *srcport; + static char *node_type_str[] = { "???", "ca", @@ -143,7 +145,8 @@ get_port(Port *port, int portnum, ib_portid_t *portid) port->portnum = portnum; - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout, + srcport)) return -1; decode_port_info(pi, port); @@ -162,7 +165,7 @@ get_node(Node *node, Port *port, ib_portid_t *portid) void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; void *si = switchinfo; - if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout)) + if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport)) return -1; mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid); @@ -176,10 +179,10 @@ get_node(Node *node, Port *port, ib_portid_t *portid) port->portnum = node->localport; port->portguid = node->portguid; - if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout)) + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport)) return -1; - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout)) + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport)) return -1; decode_port_info(pi, port); @@ -190,11 +193,12 @@ get_node(Node *node, Port *port, ib_portid_t *portid) node->smalmc = port->lmc; /* after we have the sma information find out the real PortInfo for this port */ - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, node->localport, timeout)) + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, node->localport, + timeout, srcport)) return -1; decode_port_info(pi, port); - if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) + if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); @@ -985,7 +989,11 @@ int main(int argc, char **argv) if (argc && !(f = fopen(argv[0], "w"))) IBERROR("can't open file %s for writing", argv[0]); - madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); + srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); + if (!srcport) + IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); + mad_rpc_set_timeout(ibd_timeout, srcport); + node_name_map = open_node_name_map(node_name_map_file); if (discover(&my_portid) < 0) @@ -1000,5 +1008,6 @@ int main(int argc, char **argv) dump_topology(list, group); close_node_name_map(node_name_map); + mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From arlin.r.davis at intel.com Fri Feb 20 18:59:37 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Fri, 20 Feb 2009 18:59:37 -0800 Subject: [ofa-general] RDMA write with immediate data. In-Reply-To: <499E25DE.5020703@cs.anu.edu.au> References: <499CBEF2.2010909@cs.anu.edu.au> <499E25DE.5020703@cs.anu.edu.au> Message-ID: > >> Do you have receive's posted at the remote side for immed data? >> >Nope, the remote side didn't got an event, (dat_evd_wait timed out). >The way to find out the immed data is to check the out going >parameter &event of dat_evd_wait function. I don't understand your answer. Do you have a receive buffer pre-posted on the EP to receive the inbound immediate data? Just waiting on the event in not enough. For immediate data you don't need a buffer associated with the work request but you do need the work request posted for each inbound rdma_write with immed that is expected. -arlin From vuhuong at mellanox.com Sat Feb 21 01:33:19 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Sat, 21 Feb 2009 01:33:19 -0800 Subject: [ofa-general] NFSRDMA connectathon prelim. testing status, Message-ID: <499FCA5F.5070604@mellanox.com> Hi Tom, I have both nfsrdma client and server on 2.6.29-rc5 kernel, nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and ConnectX (mlx4_ib) HCAs I have seen several problems during my testing at NFS Connectathon 2009 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the client can not mount. Talking to Tom Talpey and scanning the code, I saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs provider does not have the implementation for this verb. If I have client on mlx4_ib and server on ib_mthca, I hit the following crash because of bad error handling in xprtrdma (see file attached - mlx4_mount_problem.log) Because of this problem, I use InfiniHost III (ib_mthca) for all of my tests at Connectathon 2. Testing Linux nfsrdma client against both Linux and OpenSolaris nfsrdma servers, I hit the process hung problem during the connectathon's lock test (seeing sync_page_1.log and sync_page_2.log attached files). I can only reproduce it when I ran connectathon more than 500 iterations (-N 1000) I can NOT reproduce the problem with nfs client/server over IPoIB 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I hit the following BUG_ON() right away(see file attached - svcrdma_send.log) thanks, -vu -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mlx4_mount_problem.log URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sync_page_1.log URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sync_page_2.log URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: svcrdma_send.log URL: From vlad at lists.openfabrics.org Sat Feb 21 03:18:08 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 21 Feb 2009 03:18:08 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090221-0200 daily build status Message-ID: <20090221111808.325D4E61072@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From jackm at dev.mellanox.co.il Sat Feb 21 23:09:11 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 22 Feb 2009 09:09:11 +0200 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: References: <200902171742.38223.jackm@dev.mellanox.co.il> Message-ID: <200902220909.11784.jackm@dev.mellanox.co.il> On Friday 20 February 2009 08:50, Roland Dreier wrote: > What test are you using to hit this race?  Are you using a distro kernel > with OFED? > I ran on RHEL5.2, with a ConnectX card, using the following test (source given at the end of this post): 1. Start the driver. 2. In one console window, compile (just gcc) and run the app below which prints out pkeys in a tight loop via libsysfs. 3. In another console window, run the bash script below (which loads/unloads the driver, with some time randomization added). After a few hours of this test, I got a kernel panic, and adding a mutex to make the low-level driver access atomic (wrt ib_core) for showing pkeys fixed the problem entirely. When I added printouts to the low-level driver and to sysfs.c (printout in procedure show_port_pkey just before call to ib_query_pkey), I noticed that the crash occurred as follows (note that mlx4_ib is not in the list of loaded modules, and that the paging request address failure is in virtual function "query_pkey"): ENTERING mlx4_ib_remove: ibdev = ffff81010dfdf800 show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=127 show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=126 show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=125 ... show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=79 show_port_pkey: ibdev=ffff81010dfdf800, query_pkey=ffffffff88422f28, portnum=1, ix=78 ib_device_unregister_sysfs: ibd=ffff81010dfdf800, portnum=1 ib_device_unregister_sysfs: ibd=ffff81010dfdf800, portnum=2 LEAVING mlx4_ib_remove: ibdev = ffff81010dfdf800 Unable to handle kernel paging request at ffffffff88422f53 RIP: [] PGD 203067 PUD 205063 PMD 11658b067 PTE 0 Oops: 0010 [1] SMP last sysfs file: /class/infiniband/mlx4_0/ports/1/pkeys/78 CPU 0 Modules linked in: ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_core(U) ib_mad(U) ib_core(U) hfsplus netconsole nfsd exportfs auth_rpcgss autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ipoib_helper(U) ipv6 xfrm_nalgo crypto_api dm_mirror dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_piix4 ide_cd k8_edac cdrom edac_mc i2c_core k8temp hwmon sg bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache sata_svw libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 26829, comm: opensm Tainted: G 2.6.18-128.el5 #1 RIP: 0010:[] [] RSP: 0018:ffff810212a27e58 EFLAGS: 00010246 RAX: ffff81010ccec180 RBX: ffff81012194bc80 RCX: 0000000000000000 RDX: ffff81010ccec180 RSI: 0000000000000202 RDI: ffff81010ccec280 RBP: ffff81010da7d000 R08: ffff810212a26000 R09: 000000000000003c R10: ffff810123f88800 R11: 0000000000000001 R12: ffff810115354701 R13: 000000000000004e R14: ffff81010dfdf800 R15: ffff810212a27ea6 FS: 00002ad1a47afc00(0000) GS:ffffffff803ac000(0000) knlGS:00000000f75fdb90 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffff88422f53 CR3: 0000000121354000 CR4: 00000000000006e0 Process opensm (pid: 26829, threadinfo ffff810212a26000, task ffff81021c4dc820) Stack: 00000010000280d0 ffff81012194bc80 ffff81010da7d000 ffff810115354740 ffff810212a27f50 ffffffff882665e0 ffff81012194bc80 ffffffff88256e71 ffff810212a27f50 ffffffff882665e0 ffff810115bcbc90 ffff81010f8ef140 Call Trace: [] :ib_core:show_port_pkey+0x59/0x7d [] sysfs_read_file+0xa5/0x13f [] vfs_read+0xcb/0x171 [] sys_read+0x45/0x6e [] tracesys+0xd5/0xe0 Code: Bad RIP value. RIP [] RSP CR2: ffffffff88422f53 <0>Kernel panic - not syncing: Fatal exception - Jack ================================= 1. Pkeys print app: /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * */ #define _GNU_SOURCE #if HAVE_CONFIG_H # include #endif /* HAVE_CONFIG_H */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static int ret_code(void) { int e = errno; if (e > 0) return -e; return e; } int sys_read_string(char *dir_name, char *file_name, char *str, int max_len) { char path[256], *s; int fd, r; snprintf(path, sizeof(path), "%s/%s", dir_name, file_name); if ((fd = open(path, O_RDONLY)) < 0) return ret_code(); if ((r = read(fd, str, max_len)) < 0) { int e = errno; close(fd); errno = e; return ret_code(); } str[(r < max_len) ? r : max_len - 1] = 0; if ((s = strrchr(str, '\n'))) *s = 0; close(fd); return 0; } int sys_read_uint(char *dir_name, char *file_name, unsigned *u) { char buf[32]; int r; if ((r = sys_read_string(dir_name, file_name, buf, sizeof(buf))) < 0) return r; *u = strtoul(buf, 0, 0); return 0; } int main() { int i; char *path = "/sys/class/infiniband/mlx4_0/ports/1/pkeys"; char pkey_is[20]; unsigned u; while (1) for (i = 127; i >= 0; --i) { sprintf(pkey_is, "%d",i); if (sys_read_uint(path, pkey_is, &u)) { sleep(1); break; } printf("%d: %u\n",i, u); } return 0; } ======================================================== Bash driver up-down script: #!/bin/bash -x i=0 while true; do echo iteration number $i; date /etc/init.d/openibd start opensm & sleep 10.$RANDOM pkill -9 opensm wait /etc/init.d/openibd stop let i=$i+1 done From rdreier at cisco.com Sat Feb 21 23:15:24 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 21 Feb 2009 23:15:24 -0800 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: <200902220909.11784.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 22 Feb 2009 09:09:11 +0200") References: <200902171742.38223.jackm@dev.mellanox.co.il> <200902220909.11784.jackm@dev.mellanox.co.il> Message-ID: > I ran on RHEL5.2 ... I suspect that at some point in the 2+ years since 2.6.18 more locking was added to sysfs so that this race no longer exists. You could try and see if my test (add a sleep to the show method and make sure you remove the low-level driver during that window) results in an instant crash with the RHEL 5.2 kernel. - R. From eli at dev.mellanox.co.il Sat Feb 21 23:39:54 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 22 Feb 2009 09:39:54 +0200 Subject: [ofa-general] Re: [ewg] iscsi initiator ipoib+lro crash on upstream kernel In-Reply-To: <15ddcffd0902191140p3a72c1b4p2bab0aa7f0aef87a@mail.gmail.com> References: <20090219165505.GA13617@mtls03> <15ddcffd0902191140p3a72c1b4p2bab0aa7f0aef87a@mail.gmail.com> Message-ID: <4e6a6b3c0902212339t603b13ccs17160a893b0892e4@mail.gmail.com> Thanks! On Thu, Feb 19, 2009 at 9:40 PM, Or Gerlitz wrote: > On Thu, Feb 19, 2009 at 6:55 PM, Eli Cohen wrote: > >> I have encountered a kernel crash when running a iSCSI initiator on >> IPoIB configured with LRO (if LRO is off it does not happen). This >> was seen first on Sles10sp2 but then I verified it happens on 2.6.28.2 too. > > Eli, > > This is a known issue > (http://bugzilla.kernel.org/show_bug.cgi?id=11804) a fix was submitted > upstream and would be included in the next kernel. > > Or. > From jackm at dev.mellanox.co.il Sun Feb 22 00:15:45 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 22 Feb 2009 10:15:45 +0200 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: References: <200902171742.38223.jackm@dev.mellanox.co.il> <200902220909.11784.jackm@dev.mellanox.co.il> Message-ID: <200902221015.46090.jackm@dev.mellanox.co.il> On Sunday 22 February 2009 09:15, Roland Dreier wrote: > > I ran on RHEL5.2 ... > > I suspect that at some point in the 2+ years since 2.6.18 more locking > was added to sysfs so that this race no longer exists. You could try > and see if my test (add a sleep to the show method and make sure you > remove the low-level driver during that window) results in an instant > crash with the RHEL 5.2 kernel. > > - R. You're right -- your test does crash the RHEL5.2 kernel, with the appropriate stack dump (page fault for query_pkey low-level driver function). I'll try to determine in which kernel this was fixed. - Jack From vlad at lists.openfabrics.org Sun Feb 22 03:15:03 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 22 Feb 2009 03:15:03 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090222-0200 daily build status Message-ID: <20090222111503.68A54E6104C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From jackm at dev.mellanox.co.il Sun Feb 22 03:37:00 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 22 Feb 2009 13:37:00 +0200 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: References: <200902171742.38223.jackm@dev.mellanox.co.il> <200902220909.11784.jackm@dev.mellanox.co.il> Message-ID: <200902221337.01172.jackm@dev.mellanox.co.il> On Sunday 22 February 2009 09:15, Roland Dreier wrote: > > I ran on RHEL5.2 ... > > I suspect that at some point in the 2+ years since 2.6.18 more locking > was added to sysfs so that this race no longer exists. You could try > and see if my test (add a sleep to the show method and make sure you > remove the low-level driver during that window) results in an instant > crash with the RHEL 5.2 kernel. > > - R. > There is still a problem, which we do not see with ConnectX (because of the separation between mlx4_ib and mlx4_core -- and we are unloading only mlx4_ib, leaving all the mlx4_core infrastructure intact). I tried your test with a Sinai card (mthca, and got the following Kernel Oops (on Kernel 2,6,27.4) (Note that ib_mthca is still loaded, but with "(-)" following). - Jack ====================== enter show_port_pkey call ib_query_pkey BUG: unable to handle kernel paging request at ffffc20000648698 IP: [] mthca_cmd_post+0x168/0x24c [ib_mthca] PGD 7fc59067 PUD 11fc30067 PMD 11ff34067 PTE 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: rdma_ucm rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs ib_umad iw_nes mlx4_en inet_lro mlx4_ib mlx4_core ib_mthca(-) ib_mad ib_core memtrack mst_pciconf mst_pci nfsd auth_rpcgss exportfs autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_log dm_multipath dm_mod sbs sbshc battery acpi_memhotplug ac parport_pc lp parport rtc_cmos ide_cd_mod floppy sg button rtc_core cdrom serio_raw i2c_nforce2 rtc_lib k8temp shpchp forcedeth i2c_core hwmon pcspkr sata_nv libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: inet_lro] Pid: 23114, comm: cat Not tainted 2.6.27.4 #1 RIP: 0010:[] [] mthca_cmd_post+0x168/0x24c [ib_mthca] RSP: 0018:ffff88011695bcd8 EFLAGS: 00010246 RAX: ffffc20000648680 RBX: ffff8800734b6000 RCX: 0000000000000001 RDX: ffffffff8021072e RSI: 000000005102d000 RDI: ffff8800734b66c8 RBP: 0000000000000001 R08: 0000000000000003 R09: 0000000000000024 R10: ffff88011695be5f R11: 000000000000ea60 R12: 000000000000ffff R13: 000000005102d000 R14: 000000007e57b000 R15: 000000005102d003 FS: 00007fb5ad5c86f0(0000) GS:ffffffff806fca80(0000) knlGS:00000000f735fb90 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffc20000648698 CR3: 000000006f897000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process cat (pid: 23114, threadinfo ffff88011695a000, task ffff88011f4ec240) Stack: 000000005fe51850 002488006b0c5e80 ffff88005fe5ffff ffff88006b0c5e80 ffff88006b0c5e80 002488005fe51840 0000000000000296 ffff8800734b6000 0000000000000024 000000005102d003 ffff88011695bdb8 0000000000000001 Call Trace: [] ? mthca_cmd_poll+0x61/0x118 [ib_mthca] [] ? mthca_cmd_box+0x5d/0x62 [ib_mthca] [] ? mthca_MAD_IFC+0x171/0x1bc [ib_mthca] [] ? mthca_query_pkey+0x103/0x18a [ib_mthca] [] ? process_timeout+0x0/0x5 [] ? show_port_pkey+0x4f/0x74 [ib_core] [] ? sysfs_read_file+0xa8/0x12f [] ? vfs_read+0xaa/0x133 [] ? sys_read+0x45/0x6e [] ? system_call_fastpath+0x16/0x1b Code: c0 48 87 02 e8 73 89 26 e0 48 8b 83 98 06 00 00 8b 40 18 66 85 c0 79 0c 48 8b 05 14 66 53 e0 4c 39 e0 78 d2 48 8b 83 98 06 00 00 <8b> 40 18 66 85 c0 41 bc f5 ff ff ff 0f 88 b4 00 00 00 4c 89 e8 RIP [] mthca_cmd_post+0x168/0x24c [ib_mthca] RSP CR2: ffffc20000648698 ---[ end trace 7cb234a047e4a788 ]--- From jackm at dev.mellanox.co.il Sun Feb 22 08:04:21 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 22 Feb 2009 18:04:21 +0200 Subject: [ofa-general] [PATCH] ib_core: avoid race condition between sysfs access and low-level module unload Message-ID: <200902221804.21627.jackm@dev.mellanox.co.il> In newer kernels, a low-level module will not be unloaded while its sysfs interface is being accessed, so its code pages will be available for the sysfs access. However, nothing prevents the low-level module from freeing its memory resources during such access. This can cause a kernel Oops. To avoid this, we protect the device reg_state with a mutex, and perform all sysfs operations (show, store) atomically within this mutex by locking the mutex, testing whether the device is still "alive", and only if it is, invoking low-level module functions -- and finally, freeing the mutex. Signed-off-by: Jack Morgenstein --- Roland, I think this patch is a reasonable solution to the sysfs problem of a low-level driver module being unloaded while sysfs is being accessed for the device. ib_unregister_device() is always called before the device driver frees up its resources. Since this patch makes sysfs accesses atomic wrt the device registration state, it solves the problem of the race between freeing device resources and accessing the low-level to retrieve device data. (I ran checkpatch.pl on this, and I do have several lines slightly more than 80 chars long -- but that's all). Jack diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 7913b80..6254202 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -172,9 +172,14 @@ static int end_port(struct ib_device *device) */ struct ib_device *ib_alloc_device(size_t size) { + struct ib_device *ibdev; + BUG_ON(size < sizeof (struct ib_device)); - return kzalloc(size, GFP_KERNEL); + ibdev = kzalloc(size, GFP_KERNEL); + if (ibdev) + mutex_init(&ibdev->sysfs_mutex); + return ibdev; } EXPORT_SYMBOL(ib_alloc_device); @@ -305,9 +310,10 @@ int ib_register_device(struct ib_device *device) goto out; } + mutex_lock(&device->sysfs_mutex); list_add_tail(&device->core_list, &device_list); - device->reg_state = IB_DEV_REGISTERED; + mutex_unlock(&device->sysfs_mutex); { struct ib_client *client; @@ -353,7 +359,9 @@ void ib_unregister_device(struct ib_device *device) kfree(context); spin_unlock_irqrestore(&device->client_data_lock, flags); + mutex_lock(&device->sysfs_mutex); device->reg_state = IB_DEV_UNREGISTERED; + mutex_unlock(&device->sysfs_mutex); } EXPORT_SYMBOL(ib_unregister_device); diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index b43f7d3..29f0ce1 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -94,7 +94,7 @@ static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, char *buf) { struct ib_port_attr attr; - ssize_t ret; + ssize_t ret = -ENODEV; static const char *state_name[] = { [IB_PORT_NOP] = "NOP", @@ -105,26 +105,33 @@ static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" }; - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; - - return sprintf(buf, "%d: %s\n", attr.state, - attr.state >= 0 && attr.state < ARRAY_SIZE(state_name) ? - state_name[attr.state] : "UNKNOWN"); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) + ret = sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && + attr.state < ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused, char *buf) { struct ib_port_attr attr; - ssize_t ret; - - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; + ssize_t ret = -ENODEV; - return sprintf(buf, "0x%x\n", attr.lid); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) + ret = sprintf(buf, "0x%x\n", attr.lid); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t lid_mask_count_show(struct ib_port *p, @@ -132,52 +139,64 @@ static ssize_t lid_mask_count_show(struct ib_port *p, char *buf) { struct ib_port_attr attr; - ssize_t ret; - - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; + ssize_t ret = -ENODEV; - return sprintf(buf, "%d\n", attr.lmc); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) + ret = sprintf(buf, "%d\n", attr.lmc); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused, char *buf) { struct ib_port_attr attr; - ssize_t ret; - - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; + ssize_t ret = -ENODEV; - return sprintf(buf, "0x%x\n", attr.sm_lid); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) + ret = sprintf(buf, "0x%x\n", attr.sm_lid); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused, char *buf) { struct ib_port_attr attr; - ssize_t ret; - - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; + ssize_t ret = -ENODEV; - return sprintf(buf, "%d\n", attr.sm_sl); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) + ret = sprintf(buf, "%d\n", attr.sm_sl); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused, char *buf) { struct ib_port_attr attr; - ssize_t ret; - - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; + ssize_t ret = -ENODEV; - return sprintf(buf, "0x%08x\n", attr.port_cap_flags); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) + ret = sprintf(buf, "0x%08x\n", attr.port_cap_flags); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, @@ -186,24 +205,33 @@ static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, struct ib_port_attr attr; char *speed = ""; int rate; - ssize_t ret; - - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; - - switch (attr.active_speed) { - case 2: speed = " DDR"; break; - case 4: speed = " QDR"; break; + ssize_t ret = -ENODEV; + + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) { + switch (attr.active_speed) { + case 2: speed = " DDR"; break; + case 4: speed = " QDR"; break; + } + + rate = 25 * ib_width_enum_to_int(attr.active_width) * + attr.active_speed; + if (rate < 0) { + ret = -EINVAL; + goto out; + } + + ret = sprintf(buf, "%d%s Gb/sec (%dX%s)\n", + rate / 10, rate % 10 ? ".5" : "", + ib_width_enum_to_int(attr.active_width), + speed); + } } - - rate = 25 * ib_width_enum_to_int(attr.active_width) * attr.active_speed; - if (rate < 0) - return -EINVAL; - - return sprintf(buf, "%d%s Gb/sec (%dX%s)\n", - rate / 10, rate % 10 ? ".5" : "", - ib_width_enum_to_int(attr.active_width), speed); +out: + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t phys_state_show(struct ib_port *p, struct port_attribute *unused, @@ -211,22 +239,26 @@ static ssize_t phys_state_show(struct ib_port *p, struct port_attribute *unused, { struct ib_port_attr attr; - ssize_t ret; - - ret = ib_query_port(p->ibdev, p->port_num, &attr); - if (ret) - return ret; - - switch (attr.phys_state) { - case 1: return sprintf(buf, "1: Sleep\n"); - case 2: return sprintf(buf, "2: Polling\n"); - case 3: return sprintf(buf, "3: Disabled\n"); - case 4: return sprintf(buf, "4: PortConfigurationTraining\n"); - case 5: return sprintf(buf, "5: LinkUp\n"); - case 6: return sprintf(buf, "6: LinkErrorRecovery\n"); - case 7: return sprintf(buf, "7: Phy Test\n"); - default: return sprintf(buf, "%d: \n", attr.phys_state); + ssize_t ret = -ENODEV; + + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (!ret) { + switch (attr.phys_state) { + case 1: ret = sprintf(buf, "1: Sleep\n"); + case 2: ret = sprintf(buf, "2: Polling\n"); + case 3: ret = sprintf(buf, "3: Disabled\n"); + case 4: ret = sprintf(buf, "4: PortConfigurationTraining\n"); + case 5: ret = sprintf(buf, "5: LinkUp\n"); + case 6: ret = sprintf(buf, "6: LinkErrorRecovery\n"); + case 7: ret = sprintf(buf, "7: Phy Test\n"); + default: ret = sprintf(buf, "%d: \n", attr.phys_state); + } + } } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static PORT_ATTR_RO(state); @@ -256,13 +288,16 @@ static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, struct port_table_attribute *tab_attr = container_of(attr, struct port_table_attribute, attr); union ib_gid gid; - ssize_t ret; - - ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); - if (ret) - return ret; + ssize_t ret = -ENODEV; - return sprintf(buf, "%pI6\n", gid.raw); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (!ret) + ret = sprintf(buf, "%pI6\n", gid.raw); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, @@ -271,13 +306,16 @@ static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, struct port_table_attribute *tab_attr = container_of(attr, struct port_table_attribute, attr); u16 pkey; - ssize_t ret; - - ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); - if (ret) - return ret; + ssize_t ret = -ENODEV; - return sprintf(buf, "0x%04x\n", pkey); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); + if (!ret) + ret = sprintf(buf, "0x%04x\n", pkey); + } + mutex_unlock(&p->ibdev->sysfs_mutex); + return ret; } #define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ @@ -300,6 +338,12 @@ static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, if (!p->ibdev->process_mad) return sprintf(buf, "N/A (no PMA)\n"); + mutex_lock(&p->ibdev->sysfs_mutex); + if (ibdev_is_alive(p->ibdev)) { + ret = -ENODEV; + goto out; + } + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) { @@ -346,7 +390,7 @@ static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, out: kfree(in_mad); kfree(out_mad); - + mutex_unlock(&p->ibdev->sysfs_mutex); return ret; } @@ -579,20 +623,20 @@ static ssize_t show_sys_image_guid(struct device *device, { struct ib_device *dev = container_of(device, struct ib_device, dev); struct ib_device_attr attr; - ssize_t ret; - - if (!ibdev_is_alive(dev)) - return -ENODEV; - - ret = ib_query_device(dev, &attr); - if (ret) - return ret; - - return sprintf(buf, "%04x:%04x:%04x:%04x\n", - be16_to_cpu(((__be16 *) &attr.sys_image_guid)[0]), - be16_to_cpu(((__be16 *) &attr.sys_image_guid)[1]), - be16_to_cpu(((__be16 *) &attr.sys_image_guid)[2]), - be16_to_cpu(((__be16 *) &attr.sys_image_guid)[3])); + ssize_t ret = -ENODEV; + + mutex_lock(&dev->sysfs_mutex); + if (ibdev_is_alive(dev)) { + ret = ib_query_device(dev, &attr); + if (!ret) + ret = sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((__be16 *) &attr.sys_image_guid)[0]), + be16_to_cpu(((__be16 *) &attr.sys_image_guid)[1]), + be16_to_cpu(((__be16 *) &attr.sys_image_guid)[2]), + be16_to_cpu(((__be16 *) &attr.sys_image_guid)[3])); + } + mutex_unlock(&dev->sysfs_mutex); + return ret; } static ssize_t show_node_guid(struct device *device, @@ -624,17 +668,20 @@ static ssize_t set_node_desc(struct device *device, { struct ib_device *dev = container_of(device, struct ib_device, dev); struct ib_device_modify desc = {}; - int ret; + int ret = -ENODEV; if (!dev->modify_device) return -EIO; memcpy(desc.node_desc, buf, min_t(int, count, 64)); - ret = ib_modify_device(dev, IB_DEVICE_MODIFY_NODE_DESC, &desc); - if (ret) - return ret; - - return count; + mutex_lock(&dev->sysfs_mutex); + if (ibdev_is_alive(dev)) { + ret = ib_modify_device(dev, IB_DEVICE_MODIFY_NODE_DESC, &desc); + if (!ret) + ret = count; + } + mutex_unlock(&dev->sysfs_mutex); + return ret; } static DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL); @@ -662,14 +709,18 @@ static ssize_t show_protocol_stat(const struct device *device, { struct ib_device *dev = container_of(device, struct ib_device, dev); union rdma_protocol_stats stats; - ssize_t ret; - - ret = dev->get_protocol_stats(dev, &stats); - if (ret) - return ret; - - return sprintf(buf, "%llu\n", - (unsigned long long) ((u64 *) &stats)[offset]); + ssize_t ret = -ENODEV; + + mutex_lock(&dev->sysfs_mutex); + if (ibdev_is_alive(dev)) { + ret = dev->get_protocol_stats(dev, &stats); + if (!ret) + ret = sprintf(buf, "%llu\n", + (unsigned long long) + ((u64 *) &stats)[offset]); + } + mutex_unlock(&dev->sysfs_mutex); + return ret; } /* generate a read-only iwarp statistics attribute */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 936e333..3b2768c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -47,6 +47,7 @@ #include #include #include +#include #include #include @@ -1143,6 +1144,7 @@ struct ib_device { IB_DEV_REGISTERED, IB_DEV_UNREGISTERED } reg_state; + struct mutex sysfs_mutex; u64 uverbs_cmd_mask; int uverbs_abi_ver; From rdreier at cisco.com Sun Feb 22 20:05:19 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 22 Feb 2009 20:05:19 -0800 Subject: [ofa-general] [PATCH] IB/ipath: Fix memory leak in init_shadow_tids() error path Message-ID: If the second vmalloc() fails, the wrong pointer is pased to vfree(), so the first vmalloc() ends up getting leaked. This was spotted by the Coverity checker (CID 2709). Signed-off-by: Roland Dreier --- Unless someone objects I'll merge this for 2.6.30. drivers/infiniband/hw/ipath/ipath_init_chip.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index 64aeefb..077879c 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -455,7 +455,7 @@ static void init_shadow_tids(struct ipath_devdata *dd) if (!addrs) { ipath_dev_err(dd, "failed to allocate shadow dma handle " "array, no expected sends!\n"); - vfree(dd->ipath_pageshadow); + vfree(pages); dd->ipath_pageshadow = NULL; return; } -- 1.6.0.4 From rdreier at cisco.com Sun Feb 22 20:17:00 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 22 Feb 2009 20:17:00 -0800 Subject: [ofa-general] [PATCH] IB/ipath: Really run work in ipath_release_user_pages_on_close() In-Reply-To: (Roland Dreier's message of "Sun, 22 Feb 2009 20:05:19 -0800") References: Message-ID: ipath_release_user_pages_on_close() just allocated a structure to schedule work with but just returned (leaking the structure) rather than actually doing schedule_work(). Fix the logic to what was intended. This was spotted by the Coverity checker (CID 2700). Signed-off-by: Roland Dreier --- I'm only 99% sure this patch is correct... so someone who knows please review. drivers/infiniband/hw/ipath/ipath_user_pages.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_user_pages.c b/drivers/infiniband/hw/ipath/ipath_user_pages.c index 0190edc..855911e 100644 --- a/drivers/infiniband/hw/ipath/ipath_user_pages.c +++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c @@ -209,20 +209,20 @@ void ipath_release_user_pages_on_close(struct page **p, size_t num_pages) mm = get_task_mm(current); if (!mm) - goto bail; + return; work = kmalloc(sizeof(*work), GFP_KERNEL); if (!work) goto bail_mm; - goto bail; - INIT_WORK(&work->work, user_pages_account); work->mm = mm; work->num_pages = num_pages; + schedule_work(&work->work); + return; + bail_mm: mmput(mm); -bail: return; } -- 1.6.0.4 From Jie.Cai at cs.anu.edu.au Sun Feb 22 20:46:30 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Mon, 23 Feb 2009 15:46:30 +1100 Subject: [ofa-general] RDMA write with immediate data. In-Reply-To: References: <499CBEF2.2010909@cs.anu.edu.au> <499E25DE.5020703@cs.anu.edu.au> Message-ID: <49A22A26.50809@cs.anu.edu.au> Davis, Arlin R wrote: > > >>> Do you have receive's posted at the remote side for immed data? >>> >>> >> Nope, the remote side didn't got an event, (dat_evd_wait timed out). >> The way to find out the immed data is to check the out going >> parameter &event of dat_evd_wait function. >> > > I don't understand your answer. Do you have a receive buffer pre-posted > on the EP to receive the inbound immediate data? Just waiting on the > event in not enough. For immediate data you don't need a buffer associated > with the work request but you do need the work request posted for each > inbound rdma_write with immed that is expected. > This does help. I forgot to pre-post receive for the immediate data. > -arlin > > > From rdreier at cisco.com Sun Feb 22 20:40:53 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 22 Feb 2009 20:40:53 -0800 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: <200902221337.01172.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 22 Feb 2009 13:37:00 +0200") References: <200902171742.38223.jackm@dev.mellanox.co.il> <200902220909.11784.jackm@dev.mellanox.co.il> <200902221337.01172.jackm@dev.mellanox.co.il> Message-ID: > There is still a problem, which we do not see with ConnectX (because > of the separation between mlx4_ib and mlx4_core -- and we are > unloading only mlx4_ib, leaving all the mlx4_core infrastructure > intact). > > I tried your test with a Sinai card (mthca, and got the following > Kernel Oops (on Kernel 2,6,27.4) (Note that ib_mthca is still loaded, > but with "(-)" following). Oh I see... we leave the sysfs stuff around way too long, since we want to use it for tracking the lifetime of our class device. the patch below fixes things for me here... there's still room for substantial cleanup but I think this gets the crashes fixed at least: diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 7913b80..d1fba41 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -193,7 +193,7 @@ void ib_dealloc_device(struct ib_device *device) BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); - ib_device_unregister_sysfs(device); + kobject_put(&device->dev.kobj); } EXPORT_SYMBOL(ib_dealloc_device); @@ -348,6 +348,8 @@ void ib_unregister_device(struct ib_device *device) mutex_unlock(&device_mutex); + ib_device_unregister_sysfs(device); + spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry_safe(context, tmp, &device->client_data_list, list) kfree(context); diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index b43f7d3..5270aeb 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -848,6 +848,9 @@ void ib_device_unregister_sysfs(struct ib_device *device) struct kobject *p, *t; struct ib_port *port; + /* Hold kobject until ib_dealloc_device() */ + kobject_get(&device->dev.kobj); + list_for_each_entry_safe(p, t, &device->port_list, entry) { list_del(&p->entry); port = container_of(p, struct ib_port, kobj); From YJia at tmriusa.com Sun Feb 22 21:38:00 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Sun, 22 Feb 2009 23:38:00 -0600 Subject: [ofa-general] opensm 3.2.1 lock up problem during initialization Message-ID: Hi Folks, I run into a lock up problem during opensm initialization process. The version I am using is 3.2.1. I noticed that there's a patch to fix race condition in main OpenSM flow for version 3.2.1: http://www.openfabrics.org/git/?p=~sashak/management.git;a=commit;h=adcdb327112c7261077cf4e4076a7499ce36c86f . But the OpenSM I am using is compiled without HAVE_LIBPTHREAD macro, the patch above is for HAVE_LIBPTHREAD code only. So my question are: 1. What is the difference between codes compiled with HAVE_LIBPTHREAD and without HAVE_LIBPTHREAD? 2. Could the race condition occur on OpenSM that's compiled without HAVE_LIBPTHREAD macro? Thanks! Yicheng Jia _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jackm at dev.mellanox.co.il Sun Feb 22 23:28:34 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 23 Feb 2009 09:28:34 +0200 Subject: [ofa-general] build warnings on rhel4 U6 In-Reply-To: <1233949198.3257.19.camel@pc.interlinx.bc.ca> References: <1233949198.3257.19.camel@pc.interlinx.bc.ca> Message-ID: <200902230928.34846.jackm@dev.mellanox.co.il> On Friday 06 February 2009 21:39, Brian J. Murrell wrote: > I get these warnings trying to build with RHEL4U6 and ofa_kernel from OFED 1.4: > > include/linux/jbd.h:1204:1: warning: "assert_spin_locked" redefined > In file included from include/linux/wait.h:25, > from include/linux/fs.h:12, > from /cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/fs.h:4, > from /cache/build/BUILD/lustre-1.6.7.50/lustre/lvfs/fsfilt.c:42: > /cache/build/BUILD/lustre-kernel-2.6.9/lustre/kernel-ib-devel/usr/src/ofa_kernel/kernel_addons/backport/2.6.9_U6/include/linux/spinlock.h:8:1: warning: this is the location of the previous definition > > The code in question is (from jbd.h): > > #ifdef __KERNEL__ > > #ifdef CONFIG_SMP > #define assert_spin_locked(lock) J_ASSERT(spin_is_locked(lock)) > #else > #define assert_spin_locked(lock) do {} while(0) > #endif > > and (from the backport spinlock.h): > > #ifndef BACKPORT_LINUX_SPINLOCK_H > #define BACKPORT_LINUX_SPINLOCK_H > > #include_next > > #define spin_lock_nested(lock, subclass) spin_lock(lock) > > #define assert_spin_locked(lock) do { (void)(lock); } while(0) > > #endif > > Any thoughts on how to resolve? > > b. In the backport spinlock.h file, try the following: #ifndef assert_spin_locked #define assert_spin_locked(lock) do { (void)(lock); } while(0) #endif - Jack From dorfman.eli at gmail.com Sun Feb 22 23:40:46 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 23 Feb 2009 09:40:46 +0200 Subject: [ofa-general] ***SPAM*** opensm segmentation using git head Message-ID: <49A252FE.4010006@gmail.com> Command Line Arguments: Log File: /var/log/opensm.log ------------------------------------------------- OpenSM 3.3.0_c4d9bcf Entering DISCOVERING state Using default GUID 0x2c9020022f019 Loading Cached Option:qos_vlarb_high = 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 *** glibc detected *** ./sbin/opensm: double free or corruption (!prev): 0x000000001bd932b0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x371c871634] /lib64/libc.so.6(cfree+0x8c)[0x371c874c5c] ./sbin/opensm[0x44e824] ./sbin/opensm(osm_subn_rescan_conf_files+0x20f)[0x4507bb] ./sbin/opensm[0x44d64c] ./sbin/opensm(osm_state_mgr_process+0xbc)[0x44de18] ./sbin/opensm[0x445d23] ./sbin/opensm[0x445e8b] /tmp/mgmtbin/lib/libosmcomp.so.2[0x2b469b4e4472] /lib64/libpthread.so.0[0x371d4062f7] /lib64/libc.so.6(clone+0x6d)[0x371c8d1b6d] ======= Memory map: ======== 00400000-004b5000 r-xp 00000000 08:01 655607 /tmp/mgmtbin/sbin/opensm 006b4000-006b5000 rw-p 000b4000 08:01 655607 /tmp/mgmtbin/sbin/opensm 006b5000-006ba000 rw-p 006b5000 00:00 0 1bd93000-1bdb4000 rw-p 1bd93000 00:00 0 40497000-40498000 ---p 40497000 00:00 0 40498000-40e98000 rw-p 40498000 00:00 0 4167a000-4167b000 ---p 4167a000 00:00 0 4167b000-4207b000 rw-p 4167b000 00:00 0 4207b000-4207c000 ---p 4207b000 00:00 0 4207c000-42a7c000 rw-p 4207c000 00:00 0 42a7c000-42a7d000 ---p 42a7c000 00:00 0 42a7d000-4347d000 rw-p 42a7d000 00:00 0 4347d000-4347e000 ---p 4347d000 00:00 0 4347e000-43e7e000 rw-p 4347e000 00:00 0 43e7e000-43e7f000 ---p 43e7e000 00:00 0 43e7f000-4487f000 rw-p 43e7f000 00:00 0 4487f000-44880000 ---p 4487f000 00:00 0 44880000-45280000 rw-p 44880000 00:00 0 45280000-45281000 ---p 45280000 00:00 0 45281000-45c81000 rw-p 45281000 00:00 0 45c81000-45c82000 ---p 45c81000 00:00 0 45c82000-46682000 rw-p 45c82000 00:00 0 46682000-46683000 ---p 46682000 00:00 0 46683000-47083000 rw-p 46683000 00:00 0 47083000-47084000 ---p 47083000 00:00 0 47084000-47a84000 rw-p 47084000 00:00 0 47a84000-47a85000 ---p 47a84000 00:00 0 47a85000-48485000 rw-p 47a85000 00:00 0 371c400000-371c41a000 r-xp 00000000 08:01 1769759 /lib64/ld-2.5.so 371c61a000-371c61b000 r--p 0001a000 08:01 1769759 /lib64/ld-2.5.so 371c61b000-371c61c000 rw-p 0001b000 08:01 1769759 /lib64/ld-2.5.so 371c800000-371c94a000 r-xp 00000000 08:01 1769760 /lib64/libc-2.5.so 371c94a000-371cb49000 ---p 0014a000 08:01 1769760 /lib64/libc-2.5.so 371cb49000-371cb4d000 r--p 00149000 08:01 1769760 /lib64/libc-2.5.so 371cb4d000-371cb4e000 rw-p 0014d000 08:01 1769760 /lib64/libc-2.5.so 371cb4e000-371cb53000 rw-p 371cb4e000 00:00 0 371d000000-371d002000 r-xp 00000000 08:01 1769665 /lib64/libdl-2.5.so 371d002000-371d202000 ---p 00002000 08:01 1769665 /lib64/libdl-2.5.so 371d202000-371d203000 r--p 00002000 08:01 1769665 /lib64/libdl-2.5.so 371d203000-371d204000 rw-p 00003000 08:01 1769665 /lib64/libdl-2.5.so 371d400000-371d415000 r-xp 00000000 08:01 1769762 /lib64/libpthread-2.5.so 371d415000-371d614000 ---p 00015000 08:01 1769762 /lib64/libpthread-2.5.so 371d614000-371d615000 r--p 00014000 08:01 1769762 /lib64/libpthread-2.5.so 371d615000-371d616000 rw-p 00015000 08:01 1769762 /lib64/libpthread-2.5.so 371d616000-371d61a000 rw-p 371d616000 00:00 0 371ec00000-371ec0d000 r-xp 00000000 08:01 1769765 /lib64/libgcc_s-4.1.2-20080102.so.1 371ec0d000-371ee0d000 ---p 0000d000 08:01 1769765 /lib64/libgcc_s-4.1.2-20080102.so.1 371ee0d000-371ee0e000 rw-p 0000d000 08:01 1769765 /lib64/libgcc_s-4.1.2-20080102.so.1 2aaaaaaab000-2aaaaaaad000 rw-p 2aaaaaaab000 00:00 0 2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0 2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0 2b469b2cf000-2b469b2d1000 rw-p 2b469b2cf000 00:00 0 2b469b2d1000-2b469b2d9000 r-xp 00000000 08:01 630370 /tmp/mgmtbin/lib/libosmvendor.so.2.0.0 2b469b2d9000-2b469b4d9000 ---p 00008000 08:01 630370 /tmp/mgmtbin/lib/libosmvendor.so.2.0.0 2b469b4d9000-2b469b4da000 rw-p 00008000 08:01 630370 /tmp/mgmtbin/lib/libosmvendor.so.2.0.0 2b469b4da000-2b469b4ea000 r-xp 00000000 08:01 630322 /tmp/mgmtbin/lib/libosmcomp.so.2.0.4 2b469b4ea000-2b469b6ea000 ---p 00010000 08:01 630322 /tmp/mgmtbin/lib/libosmcomp.so.2.0.4 2b469b6ea000-2b469b6eb000 rw-p 00010000 08:01 630322 /tmp/mgmtbin/lib/libosmcomp.so.2.0.4 2b469b6eb000-2b469b6fb000 r-xp 00000000 08:01 630374 /tmp/mgmtbin/lib/libopensm.so.2.1.3 2b469b6fb000-2b469b8fa000 ---p 00010000 08:01 630374 /tmp/mgmtbin/lib/libopensm.so.2.1.3 2b469b8fa000-2b469b8fc000 rw-p 0000f000 08:01 630374 /tmp/mgmtbin/lib/libopensm.so.2.1.3 2b469b8fc000-2b469b8fd000 rw-p 2b469b8fc000 00:00 0 2b469b8fd000-2b469b903000 r-xp 00000000 08:01 630037 /tmp/mgmtbin/lib/libibumad.so.1.0.3 2b469b903000-2b469bb03000 ---p 00006000 08:01 630037 /tmp/mgmtbin/lib/libibumad.so.1.0.3 2b469bb03000-2b469bb04000 rw-p 00006000 08:01 630037 /tmp/mgmtbin/lib/libibumad.so.1.0.3 2b469bb04000-2b469bb05000 rw-p 2b469bb04000 00:00 0 2b469bb18000-2b469bb1a000 rw-p 2b469bb18000 00:00 0 7fff0f7b5000-7fff0f7db000 rw-p 7fff0f7b5000 00:00 0 [stack] ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] Aborted From vlad at lists.openfabrics.org Mon Feb 23 03:25:49 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 23 Feb 2009 03:25:49 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090223-0200 daily build status Message-ID: <20090223112549.DBCF9E60C4D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From jackm at dev.mellanox.co.il Mon Feb 23 03:30:29 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 23 Feb 2009 13:30:29 +0200 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: References: <200902171742.38223.jackm@dev.mellanox.co.il> <200902221337.01172.jackm@dev.mellanox.co.il> Message-ID: <200902231330.29669.jackm@dev.mellanox.co.il> On Monday 23 February 2009 06:40, Roland Dreier wrote: > Oh I see... we leave the sysfs stuff around way too long, since we want > to use it for tracking the lifetime of our class device.  the patch > below fixes things for me here... there's still room for substantial > cleanup but I think this gets the crashes fixed at least: > I'm not sure that it does. This does not make sysfs access atomic wrt module unloading. I think an app can still lose it's timeslice while inside the sysfs access, and module unload can still occur while the app is waiting for a new time slice (although the code pages will not be removed as yet -- see below). While the module code pages will still be available, what prevents module cleanup from deleting all the module's resources? In this case, the app will succeed in invoking the low-level driver (its code is still loaded), but may cause an Oops when that low-level driver code attempts to access low-level driver data structures (which have been freed). What about the patch I just submitted? http://lists.openfabrics.org/pipermail/general/2009-February/057565.html ([ofa-general] [PATCH] ib_core: avoid race condition between sysfs access and low-level module unload) - Jack From eli at dev.mellanox.co.il Mon Feb 23 05:20:08 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 23 Feb 2009 15:20:08 +0200 Subject: [ofa-general] Too many calls to mlx4_CLOSE_PORT()? Message-ID: <20090223132008.GA1188@mtls03> Roland, browsing the code, I see that mlx4_CLOSE_PORT() gets called from, seemingly, too many places. I would expect it to get called only from __mlx4_ib_modify_qp() when QP0 gets closed, but mlx4_ib_remove() calls it too even though it is soon to be called by __mlx4_ib_modify_qp() due to destroying the MAD QP. It also gets called from mlx4_remove_one() even though by the time this function gets called, the port is already closed. Is there a reason for that? From swise at opengridcomputing.com Mon Feb 23 07:36:49 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 23 Feb 2009 09:36:49 -0600 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for active connections. In-Reply-To: References: <20090217215959.16117.17150.stgit@NTAC> Message-ID: <49A2C291.20706@opengridcomputing.com> Roland Dreier wrote: > > - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); > > + return (iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); > > minor but the parens around the function call are totally unnecessary. > If we're touching the line anyway may as well leave them off. > > Sure. > > +static int iwch_post_qp_fatal(int id, void *p, void *data) > > +{ > > + struct ib_event event; > > + struct iwch_qp *qhp = p; > > + > > + event.event = IB_EVENT_DEVICE_FATAL; > > + event.device = qhp->ibqp.device; > > + event.element.qp = &qhp->ibqp; > > + BUG_ON(qhp->rhp != data); > > + BUG_ON(qhp->wq.qpid != id); > > + if (qhp->ibqp.event_handler) { > > + PDBG("%s posting DEVICE_FATAL for qpid %u\n", > > + __func__, qhp->wq.qpid); > > + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); > > This doesn't match the IB driver behavior (or the IB spec) -- the > DEVICE_FATAL event is unaffiliated and delivered for the adapter as a > whole. QP events are supposed to be for events connected to a single > QP, not the whole adapter failing. > > I'll change this to QP_FATAL then. > BTW I don't think you need the * here, do you? Would be easier to read > to just call it like > > qhp->ibqp.event_handler(&event, qhp->ibqp.qp_context) > > Ok. > > +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e) > > +{ > > + int error=0; > > + struct cxio_rdev *rdev; > > + > > + rdev = (struct cxio_rdev *)tdev->ulp; > > + if (rdev->flags) { > > Might be nice to wrap this rdev->flags test up in a trivial inline > function (eg iwch_eeh_set() or something like that) in case other things > get put into those flags later. > Agreed. > > + kfree_skb(skb); > > + return -EIO; > > + } > > + error = l2t_send(tdev, skb, l2e); > > + if (error) > > + kfree_skb(skb); > > + return error; > > +} > > The kfree_skb() calls here change behavior -- eg you have the change: > > > - l2t_send(ep->com.tdev, skb, ep->l2t); > > - return 0; > > + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); > > and now if l2t_send() returns an error the skb is freed, where before it > wasn't. > In looking at the l2t_send code, it doesn't free on failure, so I believe this was a memory leak in the existing error path. > Also I'm wondering why you want these wrappers in iw_cxgb3 -- would it > not make more sense for the cxgb3 l2t_send() to check the eeh state and > always behave appropriately? Or is it more complicated than that? > > Maybe. Divy, what do you think? Steve. > - R. > From stijn.deweirdt at ugent.be Mon Feb 23 06:40:04 2009 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Mon, 23 Feb 2009 15:40:04 +0100 Subject: [ofa-general] el5.3 backport of 1.4(.0) Message-ID: <1235400004.4588.43.camel@spike.ugent.be> hi all, i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones). one thing we would also like to look at is switching from OFED 1.3.2 to OFED 1.4. and one thing i noticed is that the necessary 5.3 backport fixes only exist in the current 1.4.1 daily snapshots. did anyone already try to backport the el5.3 backport fixes from 1.4.1 to 1.4.0? many thanks, stijn From tziporet at mellanox.co.il Mon Feb 23 08:10:38 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 23 Feb 2009 18:10:38 +0200 Subject: [ofa-general] OFED (EWG) meeting agenda for today (Feb 23) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com> Hi All, Due to unexpected thing I cannot attend the meeting today :-( I sent a mail to Gopal asking him to replace me but got no respond yet. If he can't maybe Woody or Betsy can In any case - these are the items that should be covered: a. OFED 1.4.1 release: 1. SLES 11 - backport progress - Jeff Becker 2. Open MPI 1.3.1 - Jeff Squyres 3. RDS with iWARP support - Steve Wise 4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise 5. Critical bugs: 1287 maj RHEL jackm at mellanox.co.il IPoIB datagram mode initial packet loss 1516 cri RHEL andy.grover at oracle.com Kernel panic on RHAS4.x loading RDS Note: There is 1.4.1 release number in bugzilla - please change bug release number to 1.4.1 if you wish it to be fixed for OFED 1.4.1 b. Open discussion Tziporet From john.russo at qlogic.com Mon Feb 23 08:11:13 2009 From: john.russo at qlogic.com (John Russo) Date: Mon, 23 Feb 2009 10:11:13 -0600 Subject: [ofa-general] RE: OFED (EWG) meeting agenda for today (Feb 23) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> <5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com> Message-ID: Betsy can't make it today. I will be covering for her. Worst case, I will cover the items that you listed. -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Monday, February 23, 2009 11:11 AM To: ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ewg] OFED (EWG) meeting agenda for today (Feb 23) Hi All, Due to unexpected thing I cannot attend the meeting today :-( I sent a mail to Gopal asking him to replace me but got no respond yet. If he can't maybe Woody or Betsy can In any case - these are the items that should be covered: a. OFED 1.4.1 release: 1. SLES 11 - backport progress - Jeff Becker 2. Open MPI 1.3.1 - Jeff Squyres 3. RDS with iWARP support - Steve Wise 4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise 5. Critical bugs: 1287 maj RHEL jackm at mellanox.co.il IPoIB datagram mode initial packet loss 1516 cri RHEL andy.grover at oracle.com Kernel panic on RHAS4.x loading RDS Note: There is 1.4.1 release number in bugzilla - please change bug release number to 1.4.1 if you wish it to be fixed for OFED 1.4.1 b. Open discussion Tziporet _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From tziporet at dev.mellanox.co.il Mon Feb 23 08:16:38 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 23 Feb 2009 18:16:38 +0200 Subject: [ofa-general] Re: [ewg] RE: OFED (EWG) meeting agenda for today (Feb 23) In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> <5D49E7A8952DC44FB38C38FA0D758EAD01D8771F@mtlexch01.mtl.com> Message-ID: <49A2CBE6.7050903@mellanox.co.il> John Russo wrote: > Betsy can't make it today. I will be covering for her. Worst case, I will cover the items that you listed. > > > Many thanks Tziporet From tom at opengridcomputing.com Mon Feb 23 08:30:37 2009 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 23 Feb 2009 10:30:37 -0600 Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status, In-Reply-To: <499FCA5F.5070604@mellanox.com> References: <499FCA5F.5070604@mellanox.com> Message-ID: <49A2CF2D.6020002@opengridcomputing.com> Vu: What memory registration model are you using? Vu Pham wrote: > Hi Tom, > > I have both nfsrdma client and server on 2.6.29-rc5 kernel, > nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and ConnectX > (mlx4_ib) HCAs > I have seen several problems during my testing at NFS Connectathon 2009 > > 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the > client can not mount. Talking to Tom Talpey and scanning the code, I saw > that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs > provider does not have the implementation for this verb. > If I have client on mlx4_ib and server on ib_mthca, I hit the following > crash because of bad error handling in xprtrdma (see file attached - > mlx4_mount_problem.log) > > Because of this problem, I use InfiniHost III (ib_mthca) for all of my > tests at Connectathon > > 2. Testing Linux nfsrdma client against both Linux and OpenSolaris > nfsrdma servers, I hit the process hung problem during the > connectathon's lock test (seeing sync_page_1.log and sync_page_2.log > attached files). I can only reproduce it when I ran connectathon more > than 500 iterations (-N 1000) > I can NOT reproduce the problem with nfs client/server over IPoIB > > 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I > hit the following BUG_ON() right away(see file attached - svcrdma_send.log) > > thanks, > -vu > From sashak at voltaire.com Mon Feb 23 09:03:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Feb 2009 19:03:42 +0200 Subject: [ofa-general] [PATCH] opensm/osm_subnet: fix crash in qos string config parameters reloading In-Reply-To: <49A252FE.4010006@gmail.com> References: <49A252FE.4010006@gmail.com> Message-ID: <20090223170342.GE7641@sashak.voltaire.com> This fixes double free() crash in qos string config parameters reloading. Assuming that qos parameters can be specified using config file only we will always keep this in sync with options copy loaded from file. Signed-off-by: Sasha Khapyorsky --- On 09:40 Mon 23 Feb , Eli Dorfman (Voltaire) wrote: > Command Line Arguments: > Log File: /var/log/opensm.log > ------------------------------------------------- > OpenSM 3.3.0_c4d9bcf [snip...] > Using default GUID 0x2c9020022f019 > Loading Cached Option:qos_vlarb_high = 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 > *** glibc detected *** ./sbin/opensm: double free or corruption (!prev): 0x000000001bd932b0 *** This happens because qos string parameter is freed separately in subn_init_qos_options() and its mirror pointer in file config copy still refer already not allocated memory. Thanks for finding this. The patch should fix the issue. Sasha opensm/opensm/osm_subnet.c | 29 ++++++++++++++++++----------- 1 files changed, 18 insertions(+), 11 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 01478be..b3100a4 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -640,7 +640,7 @@ static void subn_set_default_qos_options(IN osm_qos_options_t * opt) opt->sl2vl = OSM_DEFAULT_QOS_SL2VL; } -static void subn_init_qos_options(IN osm_qos_options_t * opt) +static void subn_init_qos_options(osm_qos_options_t *opt, osm_qos_options_t *f) { opt->max_vls = 0; opt->high_limit = -1; @@ -653,6 +653,8 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) if (opt->sl2vl) free(opt->sl2vl); opt->sl2vl = NULL; + if (f) + memcpy(f, opt, sizeof(*f)); } /********************************************************************** @@ -743,11 +745,11 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->no_clients_rereg = FALSE; p_opt->prefix_routes_file = strdup(OSM_DEFAULT_PREFIX_ROUTES_FILE); p_opt->consolidate_ipv6_snm_req = FALSE; - subn_init_qos_options(&p_opt->qos_options); - subn_init_qos_options(&p_opt->qos_ca_options); - subn_init_qos_options(&p_opt->qos_sw0_options); - subn_init_qos_options(&p_opt->qos_swe_options); - subn_init_qos_options(&p_opt->qos_rtr_options); + subn_init_qos_options(&p_opt->qos_options, NULL); + subn_init_qos_options(&p_opt->qos_ca_options, NULL); + subn_init_qos_options(&p_opt->qos_sw0_options, NULL); + subn_init_qos_options(&p_opt->qos_swe_options, NULL); + subn_init_qos_options(&p_opt->qos_rtr_options, NULL); } /********************************************************************** @@ -1192,11 +1194,16 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) return -1; } - subn_init_qos_options(&p_opts->qos_options); - subn_init_qos_options(&p_opts->qos_ca_options); - subn_init_qos_options(&p_opts->qos_sw0_options); - subn_init_qos_options(&p_opts->qos_swe_options); - subn_init_qos_options(&p_opts->qos_rtr_options); + subn_init_qos_options(&p_opts->qos_options, + &p_opts->file_opts->qos_options); + subn_init_qos_options(&p_opts->qos_ca_options, + &p_opts->file_opts->qos_ca_options); + subn_init_qos_options(&p_opts->qos_sw0_options, + &p_opts->file_opts->qos_sw0_options); + subn_init_qos_options(&p_opts->qos_swe_options, + &p_opts->file_opts->qos_swe_options); + subn_init_qos_options(&p_opts->qos_rtr_options, + &p_opts->file_opts->qos_rtr_options); while (fgets(line, 1023, opts_file) != NULL) { /* get the first token */ -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Mon Feb 23 09:21:47 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Feb 2009 19:21:47 +0200 Subject: [ofa-general] [PATCH] opensm/main.c: remove enable_stack_dump() call Message-ID: <20090223172147.GH7641@sashak.voltaire.com> enable_stack_dump() symbol was defined in already removed libibcommon. There still be conditional (undef #ifdef _DEBUG_) call to this function in opensm/main.c which breaks build opensm linkage when --enable-debug configured. Removing this. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/main.c | 3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index e22c2c4..47fd658 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -596,9 +596,6 @@ int main(int argc, char *argv[]) osm_is_debug(), cl_is_debug()); exit(1); } -#if defined (_DEBUG_) && defined (OSM_VENDOR_INTF_OPENIB) - enable_stack_dump(1); -#endif printf("-------------------------------------------------\n"); printf("%s\n", OSM_VERSION); -- 1.6.1.2.319.gbd9e From vuhuong at mellanox.com Mon Feb 23 10:03:24 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 23 Feb 2009 10:03:24 -0800 Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status, In-Reply-To: <49A2CF2D.6020002@opengridcomputing.com> References: <499FCA5F.5070604@mellanox.com> <49A2CF2D.6020002@opengridcomputing.com> Message-ID: <49A2E4EC.7010202@mellanox.com> Tom, > Vu: > > What memory registration model are you using? It is 6 (when the connection/mount established) > > Vu Pham wrote: >> Hi Tom, >> >> I have both nfsrdma client and server on 2.6.29-rc5 kernel, >> nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and >> ConnectX (mlx4_ib) HCAs >> I have seen several problems during my testing at NFS Connectathon 2009 >> >> 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the >> client can not mount. Talking to Tom Talpey and scanning the code, I >> saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs >> provider does not have the implementation for this verb. >> If I have client on mlx4_ib and server on ib_mthca, I hit the >> following crash because of bad error handling in xprtrdma (see file >> attached - mlx4_mount_problem.log) >> >> Because of this problem, I use InfiniHost III (ib_mthca) for all of >> my tests at Connectathon >> >> 2. Testing Linux nfsrdma client against both Linux and OpenSolaris >> nfsrdma servers, I hit the process hung problem during the >> connectathon's lock test (seeing sync_page_1.log and sync_page_2.log >> attached files). I can only reproduce it when I ran connectathon more >> than 500 iterations (-N 1000) >> I can NOT reproduce the problem with nfs client/server over IPoIB >> >> 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I >> hit the following BUG_ON() right away(see file attached - >> svcrdma_send.log) >> >> thanks, >> -vu >> > From tmtalpey at rcn.com Mon Feb 23 10:10:33 2009 From: tmtalpey at rcn.com (Tom Talpey) Date: Mon, 23 Feb 2009 13:10:33 -0500 Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status, In-Reply-To: <49A2E4EC.7010202@mellanox.com> References: <499FCA5F.5070604@mellanox.com> <49A2CF2D.6020002@opengridcomputing.com> <49A2E4EC.7010202@mellanox.com> Message-ID: <20090223181737.90686E61019@openfabrics.org> At 01:03 PM 2/23/2009, Vu Pham wrote: >Tom, > >> Vu: >> >> What memory registration model are you using? > >It is 6 (when the connection/mount established) i.e. all physical (get_dma_mr). Long chunk lists due to discontiguous physical pages. We'll try with ConnectX and frmr's later today here at Connectathon. This will reduce the chunk lists to roughly three entries (head, pages, tail). With the two assertions disabled, we're again passing all general and special tests from the OpenSolaris client, btw. :-) Tom. > > >> >> Vu Pham wrote: >>> Hi Tom, >>> >>> I have both nfsrdma client and server on 2.6.29-rc5 kernel, >>> nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and >>> ConnectX (mlx4_ib) HCAs >>> I have seen several problems during my testing at NFS Connectathon 2009 >>> >>> 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the >>> client can not mount. Talking to Tom Talpey and scanning the code, I >>> saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs >>> provider does not have the implementation for this verb. >>> If I have client on mlx4_ib and server on ib_mthca, I hit the >>> following crash because of bad error handling in xprtrdma (see file >>> attached - mlx4_mount_problem.log) >>> >>> Because of this problem, I use InfiniHost III (ib_mthca) for all of >>> my tests at Connectathon >>> >>> 2. Testing Linux nfsrdma client against both Linux and OpenSolaris >>> nfsrdma servers, I hit the process hung problem during the >>> connectathon's lock test (seeing sync_page_1.log and sync_page_2.log >>> attached files). I can only reproduce it when I ran connectathon more >>> than 500 iterations (-N 1000) >>> I can NOT reproduce the problem with nfs client/server over IPoIB >>> >>> 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I >>> hit the following BUG_ON() right away(see file attached - >>> svcrdma_send.log) >>> >>> thanks, >>> -vu >>> >> > > From rdreier at cisco.com Mon Feb 23 10:31:24 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 23 Feb 2009 10:31:24 -0800 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: <200902231330.29669.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 23 Feb 2009 13:30:29 +0200") References: <200902171742.38223.jackm@dev.mellanox.co.il> <200902221337.01172.jackm@dev.mellanox.co.il> <200902231330.29669.jackm@dev.mellanox.co.il> Message-ID: > I'm not sure that it does. This does not make sysfs access atomic wrt module unloading. > I think an app can still lose it's timeslice while inside the sysfs access, and module > unload can still occur while the app is waiting for a new time slice (although the code pages > will not be removed as yet -- see below). Not sure I follow... the low-level driver must handle requests until ib_unregister_device() returns, and with the change I proposed, ib_unregister_device() will not return until all sysfs files are gone (and no open file handles remain). > What about the patch I just submitted? I'd rather not add a superfluous mutex that adds complexity when a simpler solution is available. - R. From brian at sun.com Mon Feb 23 12:08:07 2009 From: brian at sun.com (Brian J. Murrell) Date: Mon, 23 Feb 2009 15:08:07 -0500 Subject: [ofa-general] build warnings on rhel4 U6 In-Reply-To: <200902230928.34846.jackm@dev.mellanox.co.il> References: <1233949198.3257.19.camel@pc.interlinx.bc.ca> <200902230928.34846.jackm@dev.mellanox.co.il> Message-ID: <1235419687.12136.111.camel@pc.interlinx.bc.ca> On Mon, 2009-02-23 at 09:28 +0200, Jack Morgenstein wrote: > In the backport spinlock.h file, try the following: > > #ifndef assert_spin_locked > #define assert_spin_locked(lock) do { (void)(lock); } while(0) > #endif Indeed. That would be a solution for the end-user but that doesn't help us as a third-party software developer (i.e. being restricted to building our software with "GA" releases of OFED -- so that our release doesn't turn into a patching nightmare for our end-users). Indeed, this probably should have been a BZ filing as my goal was equally as much to alert somebody to the problem to ensure future releases don't have the same problem. Cheers and many thanks for the input. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From john.russo at qlogic.com Mon Feb 23 13:06:32 2009 From: john.russo at qlogic.com (John Russo) Date: Mon, 23 Feb 2009 15:06:32 -0600 Subject: [ofa-general] ***SPAM*** OFED Minutes: 02/23/09 Message-ID: These are the OFED (EWG) meeting minutes for Feb 23 on OFED 1.4.1 release Meeting Summary: ============== 1. Update on 1.4.1. 2. Update on 1.4.1. PRs 3. Update on Sonoma agenda Details: ====== 1. Update on 1.4.1.: 1. SLES 11 - backport progress - Jeff Becker Just received access to RC4 source and started to build on Sunday Basic IB builds without change. MTHCA builds Connect-X next... Followed by ULPs 2. Open MPI 1.3.1 - Jeff Squyres 1.3.1 had not been released yet. Weekly Open MPI on Tuesdays Could release in 1 or 2 days if things go well Will send email and upload to Vlad 3. RDS with iWARP support - Steve Wise All of the latest updates pushed. Will begin testing with Oracle this week. CRTEST on 4 node cluster. Testing normally takes a couple of weeks Rupert asked about updating test plans for April event Steve will try to supply some info. Some tests are in OFED release May have to go to Oracle directly for other tests 4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise 2.6.25 & 2.6.22 backports pass basic tests. Will try to push changes out this week RedHat 5.2: Most tests passing. Will push after .25 and .26 RedHat 5.3: In queue behind Redat 5.2 Rupert asked for tests on this these changes also 2. Update on 1.4.1. PRs 1287 maj RHEL jackm at mellanox.co.il IPoIB datagram mode initial packet loss No one on the call to respond to this issue 1516 cri RHEL andy.grover at oracle.com Kernel panic on RHAS4.x loading RDS No one on the call to address this either. Was told that Andy will be pinged and asked to respond Numerous PRs are still listed in Bugzilla as Blocking or Critical. John asked all participants to look at the PRs assigned to them and adjust their status as appropriate. 3. Sonoma updates from Bill Boas: Still struggling to get attendees and speakers Hope to extend early bird discounts into early May A side conversation stated at this point which diverted off into general issues/wishlist for OFED as well as other topics to be discussed at Sonoma. I will not capture the details of those discussions here. Rupert reminded everyone of the UNH/IHL testing of OFED 1.4.1 the week of March 16-20 and pushed us to have as many patches in place at that time as possible. John Russo [cid:image001.jpg at 01C995CD.366283B0] __________________________ John F. Russo Manager, Engineering QLogic Corporation 780 Fifth Avenue, Suite 140 King of Prussia, PA 19406 Direct: 610-233-4866 Main: 610-233-4800 Fax: 610-233-4777 Cell: 610-246-9903 Email: John.Russo at qlogic.com www.qlogic.com True success is the undeniable truth that we have proved ourselves. -Joe Luppino-Esposito -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 3677 bytes Desc: image001.jpg URL: From vuhuong at mellanox.com Mon Feb 23 15:21:04 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 23 Feb 2009 15:21:04 -0800 Subject: [ofa-general] Re: NFSRDMA connectathon prelim. testing status, In-Reply-To: <49A2E4EC.7010202@mellanox.com> References: <499FCA5F.5070604@mellanox.com> <49A2CF2D.6020002@opengridcomputing.com> <49A2E4EC.7010202@mellanox.com> Message-ID: <49A32F60.2010803@mellanox.com> Tom, >> What memory registration model are you using? > > It is 6 (when the connection/mount established) > > >> >> Vu Pham wrote: >>> >>> >>> 2. Testing Linux nfsrdma client against both Linux and OpenSolaris >>> nfsrdma servers, I hit the process hung problem during the >>> connectathon's lock test (seeing sync_page_1.log and sync_page_2.log >>> attached files). I can only reproduce it when I ran connectathon >>> more than 500 iterations (-N 1000) >>> I can NOT reproduce the problem with nfs client/server over IPoIB With mem_reg=4, I can not reproduce this problem (running against both OpenSolaris and Linux servers. >>> >>> 3. Testing openSolaris nfsrdma client against linux nfsrdma server, >>> I hit the following BUG_ON() right away(see file attached - >>> svcrdma_send.log) >>> After disable two BUG_ON(), we can run test multiple times without problem yet -vu From swise at opengridcomputing.com Mon Feb 23 15:54:45 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 23 Feb 2009 17:54:45 -0600 Subject: [ofa-general] [PATCH v2] RDMA/cxgb3: Handle EEH events for active connections. Message-ID: <20090223235445.21618.85001.stgit@build.ogc.int> - wrapper calls into cxgb3 and fail them if we're in the middle of an eeh event. - correctly unwind and release endpoint and other resources when we are in an EEH event. - post QP_FATAL event on all active QPs when cxgb3 notifies iw_cxgb3 of a fatal error. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 10 ++-- drivers/infiniband/hw/cxgb3/cxio_hal.h | 6 ++ drivers/infiniband/hw/cxgb3/iwch.c | 26 +++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 5 ++ drivers/infiniband/hw/cxgb3/iwch_cm.c | 90 +++++++++++++++++++++++--------- drivers/infiniband/hw/cxgb3/iwch_qp.c | 4 + 6 files changed, 107 insertions(+), 34 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index eeae5f5..1db88dd 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -152,7 +152,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) sge_cmd = qpid << 8 | 3; wqe->sge_cmd = cpu_to_be64(sge_cmd); skb->priority = CPL_PRIORITY_CONTROL; - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); + return iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb); } int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) @@ -571,7 +571,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) (unsigned long long) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq, 1 << T3_CTRL_QP_SIZE_LOG2); skb->priority = CPL_PRIORITY_CONTROL; - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); + return iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb); err: kfree_skb(skb); return err; @@ -701,7 +701,7 @@ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, u32 stag_idx; u32 wptr; - if (rdev_p->flags) + if (cxio_fatal_error(rdev_p)) return -EIO; stag_state = stag_state > 0; @@ -858,7 +858,7 @@ int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); wqe->irs = cpu_to_be32(attr->irs); skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ - return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); + return iwch_cxgb3_ofld_send(rdev_p->t3cdev_p, skb); } void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb) @@ -1024,9 +1024,9 @@ void cxio_rdev_close(struct cxio_rdev *rdev_p) cxio_hal_pblpool_destroy(rdev_p); cxio_hal_rqtpool_destroy(rdev_p); list_del(&rdev_p->entry); - rdev_p->t3cdev_p->ulp = NULL; cxio_hal_destroy_ctrl_qp(rdev_p); cxio_hal_destroy_resource(rdev_p->rscp); + rdev_p->t3cdev_p->ulp = NULL; } } diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h index 9ed65b0..2fd5d03 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.h +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h @@ -112,6 +112,11 @@ struct cxio_rdev { #define CXIO_ERROR_FATAL 1 }; +static inline int cxio_fatal_error(struct cxio_rdev *rdev_p) +{ + return (rdev_p->flags & CXIO_ERROR_FATAL); +} + static inline int cxio_num_stags(struct cxio_rdev *rdev_p) { return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5)); @@ -185,6 +190,7 @@ void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); void cxio_flush_hw_cq(struct t3_cq *cq); int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, u8 *cqe_flushed, u64 *cookie, u32 *credit); +int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb); #define MOD "iw_cxgb3: " #define PDBG(fmt, args...) pr_debug(MOD fmt, ## args) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 37a4fc2..3548861 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -162,15 +162,37 @@ static void close_rnic_dev(struct t3cdev *tdev) mutex_unlock(&dev_mutex); } +static int iwch_post_qp_fatal(int id, void *p, void *data) +{ + struct ib_event event; + struct iwch_qp *qhp = p; + + event.event = IB_EVENT_QP_FATAL; + event.device = qhp->ibqp.device; + event.element.qp = &qhp->ibqp; + BUG_ON(qhp->rhp != data); + BUG_ON(qhp->wq.qpid != id); + if (qhp->ibqp.event_handler) { + PDBG("%s posting QP_FATAL for qpid %u\n", + __func__, qhp->wq.qpid); + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + } + return 0; +} + static void iwch_err_handler(struct t3cdev *tdev, u32 status, u32 error) { struct cxio_rdev *rdev = tdev->ulp; + struct iwch_dev *rnicp = rdev_to_iwch_dev(rdev); - if (status == OFFLOAD_STATUS_DOWN) + if (status == OFFLOAD_STATUS_DOWN) { rdev->flags = CXIO_ERROR_FATAL; + spin_lock_irq(&rnicp->lock); + idr_for_each(&rnicp->qpidr, iwch_post_qp_fatal, rnicp); + spin_unlock_irq(&rnicp->lock); + } return; - } static int __init iwch_init_module(void) diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index 3773453..8473550 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -117,6 +117,11 @@ static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) return container_of(ibdev, struct iwch_dev, ibdev); } +static inline struct iwch_dev *rdev_to_iwch_dev(struct cxio_rdev *rdev) +{ + return container_of(rdev, struct iwch_dev, rdev); +} + static inline int t3b_device(const struct iwch_dev *rhp) { return rhp->rdev.t3cdev_p->type == T3B; diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 8699947..ad38c45 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -139,6 +139,38 @@ static void stop_ep_timer(struct iwch_ep *ep) put_ep(&ep->com); } +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e) +{ + int error=0; + struct cxio_rdev *rdev; + + rdev = (struct cxio_rdev *)tdev->ulp; + if (cxio_fatal_error(rdev)) { + kfree_skb(skb); + return -EIO; + } + error = l2t_send(tdev, skb, l2e); + if (error) + kfree_skb(skb); + return error; +} + +int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb) +{ + int error=0; + struct cxio_rdev *rdev; + + rdev = (struct cxio_rdev *)tdev->ulp; + if (cxio_fatal_error(rdev)) { + kfree_skb(skb); + return -EIO; + } + error = cxgb3_ofld_send(tdev, skb); + if (error) + kfree_skb(skb); + return error; +} + static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) { struct cpl_tid_release *req; @@ -150,7 +182,7 @@ static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); skb->priority = CPL_PRIORITY_SETUP; - cxgb3_ofld_send(tdev, skb); + iwch_cxgb3_ofld_send(tdev, skb); return; } @@ -172,8 +204,7 @@ int iwch_quiesce_tid(struct iwch_ep *ep) req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); skb->priority = CPL_PRIORITY_DATA; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } int iwch_resume_tid(struct iwch_ep *ep) @@ -194,8 +225,7 @@ int iwch_resume_tid(struct iwch_ep *ep) req->val = 0; skb->priority = CPL_PRIORITY_DATA; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } static void set_emss(struct iwch_ep *ep, u16 opt) @@ -382,7 +412,7 @@ static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) PDBG("%s t3cdev %p\n", __func__, dev); req->cmd = CPL_ABORT_NO_RST; - cxgb3_ofld_send(dev, skb); + iwch_cxgb3_ofld_send(dev, skb); } static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) @@ -402,8 +432,7 @@ static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) @@ -424,8 +453,7 @@ static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); req->cmd = CPL_ABORT_SEND_RST; - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int send_connect(struct iwch_ep *ep) @@ -469,8 +497,7 @@ static int send_connect(struct iwch_ep *ep) req->opt0l = htonl(opt0l); req->params = 0; req->opt2 = htonl(opt2); - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) @@ -527,7 +554,7 @@ static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) req->sndseq = htonl(ep->snd_seq); BUG_ON(ep->mpa_skb); ep->mpa_skb = skb; - l2t_send(ep->com.tdev, skb, ep->l2t); + iwch_l2t_send(ep->com.tdev, skb, ep->l2t); start_ep_timer(ep); state_set(&ep->com, MPA_REQ_SENT); return; @@ -578,8 +605,7 @@ static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) req->sndseq = htonl(ep->snd_seq); BUG_ON(ep->mpa_skb); ep->mpa_skb = skb; - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) @@ -630,8 +656,7 @@ static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) req->sndseq = htonl(ep->snd_seq); ep->mpa_skb = skb; state_set(&ep->com, MPA_REP_SENT); - l2t_send(ep->com.tdev, skb, ep->l2t); - return 0; + return iwch_l2t_send(ep->com.tdev, skb, ep->l2t); } static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) @@ -795,7 +820,7 @@ static int update_rx_credits(struct iwch_ep *ep, u32 credits) OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); skb->priority = CPL_PRIORITY_ACK; - cxgb3_ofld_send(ep->com.tdev, skb); + iwch_cxgb3_ofld_send(ep->com.tdev, skb); return credits; } @@ -1203,8 +1228,7 @@ static int listen_start(struct iwch_listen_ep *ep) req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); skb->priority = 1; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) @@ -1237,8 +1261,7 @@ static int listen_stop(struct iwch_listen_ep *ep) req->cpu_idx = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); skb->priority = 1; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return iwch_cxgb3_ofld_send(ep->com.tdev, skb); } static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, @@ -1286,7 +1309,7 @@ static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) rpl->opt2 = htonl(opt2); rpl->rsvd = rpl->opt2; /* workaround for HW bug */ skb->priority = CPL_PRIORITY_SETUP; - l2t_send(ep->com.tdev, skb, ep->l2t); + iwch_l2t_send(ep->com.tdev, skb, ep->l2t); return; } @@ -1315,7 +1338,7 @@ static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip, rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); rpl->opt2 = 0; rpl->rsvd = rpl->opt2; - cxgb3_ofld_send(tdev, skb); + iwch_cxgb3_ofld_send(tdev, skb); } } @@ -1613,7 +1636,7 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); rpl->cmd = CPL_ABORT_NO_RST; - cxgb3_ofld_send(ep->com.tdev, rpl_skb); + iwch_cxgb3_ofld_send(ep->com.tdev, rpl_skb); out: if (release) release_ep_resources(ep); @@ -2017,8 +2040,11 @@ int iwch_destroy_listen(struct iw_cm_id *cm_id) ep->com.rpl_done = 0; ep->com.rpl_err = 0; err = listen_stop(ep); + if (err) + goto done; wait_event(ep->com.waitq, ep->com.rpl_done); cxgb3_free_stid(ep->com.tdev, ep->stid); +done: err = ep->com.rpl_err; cm_id->rem_ref(cm_id); put_ep(&ep->com); @@ -2030,12 +2056,22 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) int ret=0; unsigned long flags; int close = 0; + int fatal = 0; + struct t3cdev *tdev; + struct cxio_rdev *rdev; spin_lock_irqsave(&ep->com.lock, flags); PDBG("%s ep %p state %s, abrupt %d\n", __func__, ep, states[ep->com.state], abrupt); + tdev = (struct t3cdev *)ep->com.tdev; + rdev = (struct cxio_rdev *)tdev->ulp; + if (cxio_fatal_error(rdev)) { + fatal = 1; + close_complete_upcall(ep); + ep->com.state = DEAD; + } switch (ep->com.state) { case MPA_REQ_WAIT: case MPA_REQ_SENT: @@ -2075,7 +2111,11 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) ret = send_abort(ep, NULL, gfp); else ret = send_halfclose(ep, gfp); + if (ret) + fatal = 1; } + if (fatal) + release_ep_resources(ep); return ret; } diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index aa72d18..9324aa1 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -751,7 +751,7 @@ int iwch_post_zb_read(struct iwch_qp *qhp) wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid)| V_FW_RIWR_LEN(flit_cnt)); skb->priority = CPL_PRIORITY_DATA; - return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); + return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); } /* @@ -783,7 +783,7 @@ int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) V_FW_RIWR_FLAGS(T3_COMPLETION_FLAG | T3_NOTIFY_FLAG)); wqe->send.wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(qhp->ep->hwtid)); skb->priority = CPL_PRIORITY_DATA; - return cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); + return iwch_cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb); } /* From sean.hefty at intel.com Mon Feb 23 17:34:51 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Feb 2009 17:34:51 -0800 Subject: [ofa-general] [PATCH] [ib-diag] saquery: add support for WinOF Message-ID: <608DC3F308254BB78890D5FD91B7CB33@amr.corp.intel.com> A lot of type casting with include fix-ups. Luckily, because the macro CHECK_AND_SET_VAL() was added, I could add type casts into the macro and avoid sprinkling even more throughout the code. Signed-off-by: Sean Hefty --- infiniband-diags/src/saquery.c | 80 ++++++++++++++++++++++------------------ 1 files changed, 44 insertions(+), 36 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 9726d22..9d5f475 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -37,20 +37,25 @@ * */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + #include #include #include #include #include #include +#include #define _GNU_SOURCE #include #include #include -#include -#include +#include +#include #include "ibdiag_common.h" @@ -170,7 +175,7 @@ recv_mad: if (ibdebug > 1) xdump(stdout, "SA Response:\n", mad, len); - method = mad_get_field(mad, 0, IB_MAD_METHOD_F); + method = (uint8_t) mad_get_field(mad, 0, IB_MAD_METHOD_F); offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F); result.p_result_madw = mad; @@ -189,12 +194,12 @@ recv_mad: static void *get_query_rec(void *mad, unsigned i) { int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); - return mad + IB_SA_DATA_OFFS + i * (offset << 3); + return (char *) mad + IB_SA_DATA_OFFS + i * (offset << 3); } static unsigned valid_gid(ib_gid_t *gid) { - ib_gid_t zero_gid = { }; + ib_gid_t zero_gid = { 0 }; return memcmp(&zero_gid, gid, sizeof(*gid)); } @@ -442,7 +447,7 @@ static void dump_multicast_member_record(void *data) char gid_str2[INET6_ADDRSTRLEN]; ib_member_rec_t *p_mcmr = data; uint16_t mlid = cl_ntoh16(p_mcmr->mlid); - int i = 0; + unsigned i = 0; char *node_name = ""; /* go through the node records searching for a port guid which matches @@ -758,7 +763,7 @@ static void dump_one_mft_record(void *data) static void dump_results(struct query_res *r, void (*dump_func) (void *)) { - int i; + unsigned i; for (i = 0; i < r->result_cnt; i++) { void *data = get_query_rec(r->p_result_madw, i); dump_func(data); @@ -768,7 +773,7 @@ static void dump_results(struct query_res *r, void (*dump_func) (void *)) static void return_mad(void) { if (result.p_result_madw) { - free(result.p_result_madw - umad_size()); + free((char *) result.p_result_madw - umad_size()); result.p_result_madw = NULL; } } @@ -839,7 +844,8 @@ get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid) { ib_node_record_t *node_record = NULL; ib_node_info_t *p_ni = NULL; - int i = 0, ret; + unsigned i; + int ret; ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); if (ret) @@ -869,7 +875,7 @@ static uint16_t get_lid(bind_handle_t h, const char *name) if (isalpha(name[0])) assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS); else - rc_lid = atoi(name); + rc_lid = (uint16_t) atoi(name); if (rc_lid == 0) fprintf(stderr, "Failed to find lid for \"%s\"\n", name); return rc_lid; @@ -917,8 +923,8 @@ static int parse_lid_and_ports(bind_handle_t h, #define cl_hton8(x) (x) #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \ - if (val > comp_with) { \ - target = cl_hton##size(val); \ + if ((uint##size##_t) val > (uint##size##_t) comp_with) { \ + target = cl_hton##size((uint##size##_t) val); \ comp_mask |= IB_##name##_COMPMASK_##mask; \ } @@ -951,7 +957,8 @@ static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask) static int print_node_records(bind_handle_t h) { - int i = 0, ret; + unsigned i; + int ret; ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); if (ret) @@ -1027,7 +1034,7 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h, CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID); CHECK_AND_SET_VAL(p->hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT); CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, PR, FLOWLABEL); - pr.hop_flow_raw |= cl_hton32(flow << 8); + pr.hop_flow_raw |= (uint8_t) cl_hton32(flow << 8); CHECK_AND_SET_VAL(p->tclass, 8, 0, pr.tclass, PR, TCLASS); CHECK_AND_SET_VAL(p->reversible, 8, -1, reversible, PR, REVERSIBLE); CHECK_AND_SET_VAL(p->numb_path, 8, -1, pr.num_path, PR, NUMBPATH); @@ -1089,7 +1096,7 @@ static int print_multicast_member_records(bind_handle_t h) return_mc: if (mc_group_result.p_result_madw) - free(mc_group_result.p_result_madw - umad_size()); + free((char *) mc_group_result.p_result_madw - umad_size()); return ret; } @@ -1267,7 +1274,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q, memset(&pktr, 0, sizeof(pktr)); CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID); CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT); - CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK); + CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK); return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, comp_mask, &pktr, smkey, @@ -1503,13 +1510,13 @@ static int process_opt(void *context, int ch, char *optarg) query_type = IB_SA_ATTR_LINKRECORD; break; case 5: - p->slid = strtoul(optarg, NULL, 0); + p->slid = (uint16_t) strtoul(optarg, NULL, 0); break; case 6: - p->dlid = strtoul(optarg, NULL, 0); + p->dlid = (uint16_t) strtoul(optarg, NULL, 0); break; case 7: - p->mlid = strtoul(optarg, NULL, 0); + p->mlid = (uint16_t) strtoul(optarg, NULL, 0); break; case 14: if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0) @@ -1534,7 +1541,7 @@ static int process_opt(void *context, int ch, char *optarg) p->numb_path = strtoul(optarg, NULL, 0); break; case 18: - p->pkey = strtoul(optarg, NULL, 0); + p->pkey = (uint16_t) strtoul(optarg, NULL, 0); break; case 'Q': p->qos_class = strtoul(optarg, NULL, 0); @@ -1543,19 +1550,19 @@ static int process_opt(void *context, int ch, char *optarg) p->sl = strtoul(optarg, NULL, 0); break; case 'M': - p->mtu = strtoul(optarg, NULL, 0); + p->mtu = (uint8_t) strtoul(optarg, NULL, 0); break; case 'R': - p->rate = strtoul(optarg, NULL, 0); + p->rate = (uint8_t) strtoul(optarg, NULL, 0); break; case 20: - p->pkt_life = strtoul(optarg, NULL, 0); + p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0); break; case 'q': p->qkey = strtoul(optarg, NULL, 0); break; case 'T': - p->tclass = strtoul(optarg, NULL, 0); + p->tclass = (uint8_t) strtoul(optarg, NULL, 0); break; case 'F': p->flow_label = strtoul(optarg, NULL, 0); @@ -1564,10 +1571,10 @@ static int process_opt(void *context, int ch, char *optarg) p->hop_limit = strtoul(optarg, NULL, 0); break; case 21: - p->scope = strtoul(optarg, NULL, 0); + p->scope = (uint8_t) strtoul(optarg, NULL, 0); break; case 'J': - p->join_state = strtoul(optarg, NULL, 0); + p->join_state = (uint8_t) strtoul(optarg, NULL, 0); break; case 'X': p->proxy_join = strtoul(optarg, NULL, 0); @@ -1582,14 +1589,7 @@ int main(int argc, char **argv) { char usage_args[1024]; bind_handle_t h; - struct query_params params = { - .hop_limit = -1, - .reversible = -1, - .numb_path = -1, - .qos_class = -1, - .sl = -1, - .proxy_join = -1, - }; + struct query_params params; const struct query_cmd *q; ib_api_status_t status; int n; @@ -1643,9 +1643,17 @@ int main(int argc, char **argv) { "scope", 21, 1, NULL, "Scope (MCMemberRecord)" }, { "join_state", 'J', 1, NULL, "Join state (MCMemberRecord)" }, { "proxy_join", 'X', 1, NULL, "Proxy join (MCMemberRecord)" }, - {} + { 0 } }; + memset(¶ms, 0, sizeof params); + params.hop_limit = -1; + params.reversible = -1; + params.numb_path = -1; + params.qos_class = -1; + params.sl = -1; + params.proxy_join = -1; + n = sprintf(usage_args, "[query-name] [ | | ]\n" "\nSupported query names (and aliases):\n"); for (q = query_cmds; q->name; q++) { @@ -1680,7 +1688,7 @@ int main(int argc, char **argv) if (argc) { if (node_print_desc == NAME_OF_LID) { - requested_lid = strtoul(argv[0], NULL, 0); + requested_lid = (uint16_t) strtoul(argv[0], NULL, 0); requested_lid_flag++; } else if (node_print_desc == NAME_OF_GUID) { requested_guid = strtoul(argv[0], NULL, 0); From Jie.Cai at cs.anu.edu.au Mon Feb 23 20:34:04 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Tue, 24 Feb 2009 15:34:04 +1100 Subject: [ofa-general] Bandwidth of performance with multirail IB In-Reply-To: <20090223211155.730AFE28137@openfabrics.org> References: <20090223211155.730AFE28137@openfabrics.org> Message-ID: <49A378BC.5010806@cs.anu.edu.au> I have implemented a uDAPL program to measure the bandwidth on IB with multirail connections. The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two ports. The program utilize the two port on each node of cluster to build multirail IB connections. The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which is almost the same as single rail connections. Does anyone have similar experience? From jackm at dev.mellanox.co.il Mon Feb 23 23:07:09 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 24 Feb 2009 09:07:09 +0200 Subject: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver In-Reply-To: References: <200902171742.38223.jackm@dev.mellanox.co.il> <200902231330.29669.jackm@dev.mellanox.co.il> Message-ID: <200902240907.09398.jackm@dev.mellanox.co.il> On Monday 23 February 2009 20:31, Roland Dreier wrote: > > I'm not sure that it does. This does not make sysfs access atomic wrt module unloading. > > I think an app can still lose it's timeslice while inside the sysfs access, and module > > unload can still occur while the app is waiting for a new time slice (although the code pages > > will not be removed as yet -- see below). > > Not sure I follow... the low-level driver must handle requests until > ib_unregister_device() returns, and with the change I proposed, > ib_unregister_device() will not return until all sysfs files are gone > (and no open file handles remain). > > > What about the patch I just submitted? > > I'd rather not add a superfluous mutex that adds complexity when a > simpler solution is available. You're right, your solution does work. I was just concerned that the unregister-sysfs calls would simply prevent new accessors from seeing the files, but would return before the file reference count reached zero (thus allowing low-level driver cleanup while current accessors were still in progress). I checked, and this does not happen. As you mention in your answer, the unregister-sysfs calls do not return while someone still has an open file handle on these files. - Jack From cap at nsc.liu.se Tue Feb 24 00:41:53 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Tue, 24 Feb 2009 09:41:53 +0100 Subject: [ofa-general] Bandwidth of performance with multirail IB In-Reply-To: <49A378BC.5010806@cs.anu.edu.au> References: <20090223211155.730AFE28137@openfabrics.org> <49A378BC.5010806@cs.anu.edu.au> Message-ID: <200902240941.58634.cap@nsc.liu.se> On Tuesday 24 February 2009, Jie Cai wrote: > I have implemented a uDAPL program to measure the bandwidth on IB with > multirail connections. > > The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two > ports. > > The program utilize the two port on each node of cluster to build > multirail IB connections. > > The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which > is almost the same as single rail connections. Assuming you have a 2.5 GT/s pci-express x8 that speed is a result of the bus not being able to keep up with the HCA. Since the bus is holding even a single DDR IB port back you see no improvement with two ports. To fully drive a DDR IB port you need either 16x pci-express 2.5 GT/s or a 8x 5 GT/s. For one QDR or two DDR you'll need even more... /Peter > Does anyone have similar experience? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From tziporet at dev.mellanox.co.il Tue Feb 24 01:59:41 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 24 Feb 2009 11:59:41 +0200 Subject: [ofa-general] el5.3 backport of 1.4(.0) In-Reply-To: <1235400004.4588.43.camel@spike.ugent.be> References: <1235400004.4588.43.camel@spike.ugent.be> Message-ID: <49A3C50D.4050609@mellanox.co.il> Stijn De Weirdt wrote: > hi all, > > i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones). > one thing we would also like to look at is switching from OFED 1.3.2 to > OFED 1.4. and one thing i noticed is that the necessary 5.3 backport > fixes only exist in the current 1.4.1 daily snapshots. > did anyone already try to backport the el5.3 backport fixes from 1.4.1 > to 1.4.0? > > many thanks, > > stijn > > Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too Tziporet From stijn.deweirdt at ugent.be Tue Feb 24 02:26:15 2009 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Tue, 24 Feb 2009 11:26:15 +0100 Subject: [ofa-general] el5.3 backport of 1.4(.0) In-Reply-To: <49A3C50D.4050609@mellanox.co.il> References: <1235400004.4588.43.camel@spike.ugent.be> <49A3C50D.4050609@mellanox.co.il> Message-ID: <1235471175.21577.15.camel@spike.ugent.be> > Stijn De Weirdt wrote: > > hi all, > > > > i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones). > > one thing we would also like to look at is switching from OFED 1.3.2 to > > OFED 1.4. and one thing i noticed is that the necessary 5.3 backport > > fixes only exist in the current 1.4.1 daily snapshots. > > did anyone already try to backport the el5.3 backport fixes from 1.4.1 > > to 1.4.0? > > > > many thanks, > > > > stijn > > > > > Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too > hi tziporet, i actually already tried that, moving the following files from a recent 1.4.1 daily to the 1.4.0 ofa_kernel src rpm ofed_scripts/get_backport_dir.sh kernel_addons/backport/2.6.18-EL5.3/ kernel_addons/backport/2.6.18-EL5.3/ but rebuilding this gave the following error: (i have to say that the kernel i used was 2.6.18-128.1.1 instead the original el5.3 2.6.18-128) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch patching file drivers/net/mlx4/en_netdev.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] Skipping patch. 2 out of 2 hunks ignored -- saving rejects to file drivers/net/mlx4/en_netdev.c.rej patching file drivers/net/mlx4/en_tx.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] Skipping patch. 4 out of 4 hunks ignored -- saving rejects to file drivers/net/mlx4/en_tx.c.rej patching file drivers/net/mlx4/mlx4_en.h Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] Skipping patch. 1 out of 1 hunk ignored -- saving rejects to file drivers/net/mlx4/mlx4_en.h.rej Failed to apply patch: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch it is also a patch file that doesn't exist in the el5.2 backport, so i was thinking that this was a patch for 1.4.1, not 1.4.0, that's why i asked it here. anyway, many thanks for looking into this! stijn > Tziporet > -- The system will shutdown in 5 minutes. From vlad at lists.openfabrics.org Tue Feb 24 03:19:02 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 24 Feb 2009 03:19:02 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090224-0200 daily build status Message-ID: <20090224111902.841FAE61203@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From vlad at dev.mellanox.co.il Tue Feb 24 03:26:43 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 24 Feb 2009 13:26:43 +0200 Subject: [ofa-general] el5.3 backport of 1.4(.0) In-Reply-To: <1235471175.21577.15.camel@spike.ugent.be> References: <1235400004.4588.43.camel@spike.ugent.be> <49A3C50D.4050609@mellanox.co.il> <1235471175.21577.15.camel@spike.ugent.be> Message-ID: <49A3D973.9010601@dev.mellanox.co.il> Stijn De Weirdt wrote: >> Stijn De Weirdt wrote: >> >>> hi all, >>> >>> i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones). >>> one thing we would also like to look at is switching from OFED 1.3.2 to >>> OFED 1.4. and one thing i noticed is that the necessary 5.3 backport >>> fixes only exist in the current 1.4.1 daily snapshots. >>> did anyone already try to backport the el5.3 backport fixes from 1.4.1 >>> to 1.4.0? >>> >>> many thanks, >>> >>> stijn >>> >>> >>> >> Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too >> >> > hi tziporet, > > i actually already tried that, moving the following files from a recent > 1.4.1 daily to the 1.4.0 ofa_kernel src rpm > ofed_scripts/get_backport_dir.sh > kernel_addons/backport/2.6.18-EL5.3/ > kernel_addons/backport/2.6.18-EL5.3/ > > but rebuilding this gave the following error: > (i have to say that the kernel i used was 2.6.18-128.1.1 instead the > original el5.3 2.6.18-128) > > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch > patching file drivers/net/mlx4/en_netdev.c > Reversed (or previously applied) patch detected! Assume -R? [n] > Apply anyway? [n] > Skipping patch. > 2 out of 2 hunks ignored -- saving rejects to file > drivers/net/mlx4/en_netdev.c.rej > patching file drivers/net/mlx4/en_tx.c > Reversed (or previously applied) patch detected! Assume -R? [n] > Apply anyway? [n] > Skipping patch. > 4 out of 4 hunks ignored -- saving rejects to file > drivers/net/mlx4/en_tx.c.rej > patching file drivers/net/mlx4/mlx4_en.h > Reversed (or previously applied) patch detected! Assume -R? [n] > Apply anyway? [n] > Skipping patch. > 1 out of 1 hunk ignored -- saving rejects to file > drivers/net/mlx4/mlx4_en.h.rej > Failed to apply > patch: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch > > it is also a patch file that doesn't exist in the el5.2 backport, so i > was thinking that this was a patch for 1.4.1, not 1.4.0, that's why i > asked it here. > > anyway, many thanks for looking into this! > > stijn > > Hi Stijn, You have, probably, copied RHEL 5.3 backports into ofa_kernel-1.4 directory where the patches (RHEL5.0) already were applied. In any case, it is better to take the latest ofa_kernel src rpm instead of updating source rpm coming from OFED-1.4. The difference is RHEL5.3 support and some bug fixes (see git log). Regards, Vladimir From stijn.deweirdt at ugent.be Tue Feb 24 04:36:04 2009 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Tue, 24 Feb 2009 13:36:04 +0100 Subject: [ofa-general] el5.3 backport of 1.4(.0) In-Reply-To: <49A3D973.9010601@dev.mellanox.co.il> References: <1235400004.4588.43.camel@spike.ugent.be> <49A3C50D.4050609@mellanox.co.il> <1235471175.21577.15.camel@spike.ugent.be> <49A3D973.9010601@dev.mellanox.co.il> Message-ID: <1235478964.21577.69.camel@spike.ugent.be> On Tue, 2009-02-24 at 13:26 +0200, Vladimir Sokolovsky wrote: > Stijn De Weirdt wrote: > >> Stijn De Weirdt wrote: > >> > >>> hi all, > >>> > >>> i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones). > >>> one thing we would also like to look at is switching from OFED 1.3.2 to > >>> OFED 1.4. and one thing i noticed is that the necessary 5.3 backport > >>> fixes only exist in the current 1.4.1 daily snapshots. > >>> did anyone already try to backport the el5.3 backport fixes from 1.4.1 > >>> to 1.4.0? > >>> > >>> many thanks, > >>> > >>> stijn > >>> > >>> > >>> > >> Its the same tree so backports of RHEL 5.3 from 1.4.1 should work on 1.4 too > >> > >> > > hi tziporet, > > > > i actually already tried that, moving the following files from a recent > > 1.4.1 daily to the 1.4.0 ofa_kernel src rpm > > ofed_scripts/get_backport_dir.sh > > kernel_addons/backport/2.6.18-EL5.3/ > > kernel_addons/backport/2.6.18-EL5.3/ > > > > but rebuilding this gave the following error: > > (i have to say that the kernel i used was 2.6.18-128.1.1 instead the > > original el5.3 2.6.18-128) > > > > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch > > patching file drivers/net/mlx4/en_netdev.c > > Reversed (or previously applied) patch detected! Assume -R? [n] > > Apply anyway? [n] > > Skipping patch. > > 2 out of 2 hunks ignored -- saving rejects to file > > drivers/net/mlx4/en_netdev.c.rej > > patching file drivers/net/mlx4/en_tx.c > > Reversed (or previously applied) patch detected! Assume -R? [n] > > Apply anyway? [n] > > Skipping patch. > > 4 out of 4 hunks ignored -- saving rejects to file > > drivers/net/mlx4/en_tx.c.rej > > patching file drivers/net/mlx4/mlx4_en.h > > Reversed (or previously applied) patch detected! Assume -R? [n] > > Apply anyway? [n] > > Skipping patch. > > 1 out of 1 hunk ignored -- saving rejects to file > > drivers/net/mlx4/mlx4_en.h.rej > > Failed to apply > > patch: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_patches/backport/2.6.18-EL5.3/mlx4_en_0099_no_multiqueue.patch > > > > it is also a patch file that doesn't exist in the el5.2 backport, so i > > was thinking that this was a patch for 1.4.1, not 1.4.0, that's why i > > asked it here. > > > > anyway, many thanks for looking into this! > > > > stijn > > > > > Hi Stijn, > hi vladimir, > You have, probably, copied RHEL 5.3 backports into ofa_kernel-1.4 > directory where the patches (RHEL5.0) already were applied. > i did what the ofed_patch.sh script does to make a new src.rpm, but instead of patching i copied said file and directories. > In any case, it is better to take the latest ofa_kernel src rpm instead > of updating source rpm coming from OFED-1.4. > The difference is RHEL5.3 support and some bug fixes (see git log). thanks (and ofa_kernel builds ok) stijn > Regards, > Vladimir -- The system will shutdown in 5 minutes. From sashak at voltaire.com Tue Feb 24 06:37:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Feb 2009 16:37:06 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <499C7E2D.8050301@dev.mellanox.co.il> References: <499AB068.2020205@dev.mellanox.co.il> <20090218181955.GX5910@sashak.voltaire.com> <499C7E2D.8050301@dev.mellanox.co.il> Message-ID: <20090224143706.GO7641@sashak.voltaire.com> Hi Yevgeny, On 23:31 Wed 18 Feb , Yevgeny Kliteynik wrote: [snip...] > > Good point. > I'll repost the patch when we finish discussing it. Let's go this way now. Please resend the patch. After looking closer into scenario with SwithInfo/PortInfo race I'm thinking about two optimizations there: 1. Initialize all switch ports (and not only local and port 0) right on first NodeInfo receiving (via osm_node_new()) - this makes your patch unnecessary, but it is a bigger change which will definitely require some heavy testing, so it is fine IMO to do it subsequently. 2. Request PortInfo for all switch ports right on first NodeInfo receiving (not wait for SwitchInfo), just in parallel with SwitchInfo request. This should simplify subnet discovery flow and speed it up. And also this will require some heavy testing... What do you think about (1) and (2). Could you see any disadvantages? Sasha From cameron at harr.org Tue Feb 24 09:18:33 2009 From: cameron at harr.org (Cameron Harr) Date: Tue, 24 Feb 2009 10:18:33 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <4995D1EE.4000807@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> Message-ID: <49A42BE9.4030603@harr.org> Vladislav Bolkhovitin wrote: >> >> Vladislav Bolkhovitin wrote: >>> Try the following variants: >>> >>> 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 >>> and 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 >>> to CPU5, fct2-worker to CPU7 >>> >>> 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to >>> CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for >>> other processes. >>> >>> 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to >>> all CPUs, except CPU0 and CPU4, no affinity for other processes. >> These are tests 1, 2 and 3, respectively >>> Or other similar variants you'd like (even CPUs relate to physical >>> CPU0, odd CPUs relate to physical CPU1). For instance, you can try >>> to affine IRQs 169 and 177 to CPU1. >> I did two other tests (Tests 4,5), that has the mlx4_core (comp) IRQ >> (formerly known as IRQ 82) pinned to CPU0, the two ioDrive IRQs (169, >> 177) pinned to CPU 4, fct0 and scsi_tgt0 on CPUs 2&3, fct1 and >> scsi_tgt1 on CPUs 4&6 (test 4) OR fct1 and scsi_tgt1 on CPUs 5&6. >>> No points to run for srptthread=1, for it just produce a baseline >>> with no affinity at all. >> I ran with these anyway to look at differences among the tests. >> Having this thread enabled always results in better performance. >>> Please do each run several times and write down an average result >>> between runs and approximate variation between them in %%. Otherwise >>> we can't make any reliable conclusions. >> I ran each test 3 times and took the averages. In order to get a >> quick look at performance per run, I added a column in the summary >> that sums the IOPs for each test with SRPT thread enabled and then >> not enabled. Test 4 seems to give the best results. Here's a brief >> summary of that summary with just SRPT thread=0: >> >> Baseline: 356226.39 >> Test 1: 371217.6533 >> Test 2: 370553.78 >> Test 3: 373295.2033 >> Test 4: 399385.2233 >> Test 5: 393204.5833 > > Linux CPU scheduler does really impressive job! > > Interesting, will something change with: > > 1. The latest SVN. It has some changes, which might make a difference. Sorry for the delay. This is with SVN rev 673. I don't hit the high I hit before, but at a 1.8% difference (with test 4), it's statistically noise. Test 1: 390631.5133 Test 2: 386125.4133 Test 3: 356268.0267 Test 4: 392237.7867 Test 5: 390012.1467 > > 2. Pass-through dev handler instead of BLOCKIO, which you are using. > The ioDrive driver doesn't provide a full SCSI emulation layer and shows up as /dev/fio[abc...]. From my understanding of the pass-through handler, I need to have the SCSI Host:Channel:ID:LUN and those aren't available to me. Cameron From daniel.miles at rnanetworks.com Tue Feb 24 09:48:03 2009 From: daniel.miles at rnanetworks.com (Daniel Miles) Date: Tue, 24 Feb 2009 09:48:03 -0800 Subject: [ofa-general] how do I take IB interfaces offline? Message-ID: Hello, everyone. I wonder if anyone can tell me how to take an IB interface offline on a running Linux (CENTOS 5 with OFED 1.3.1) system? I can cause it to loose its IP address with ifdown but it seems that the IP address is only involved in establishing new connections and removing it doesn¹t prevent the device from fielding traffic on established connections. Does anybody know how this is done? -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at harr.org Tue Feb 24 09:50:55 2009 From: cameron at harr.org (Cameron Harr) Date: Tue, 24 Feb 2009 10:50:55 -0700 Subject: [ofa-general] how do I take IB interfaces offline? In-Reply-To: References: Message-ID: <49A4337F.8040304@harr.org> An HTML attachment was scrubbed... URL: From vst at vlnb.net Tue Feb 24 09:54:01 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Tue, 24 Feb 2009 20:54:01 +0300 Subject: ***SPAM*** Re: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A42BE9.4030603@harr.org> References: <48E386F6.5040502@fusionio.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> Message-ID: <49A43439.7080405@vlnb.net> Cameron Harr, on 02/24/2009 08:18 PM wrote: > Vladislav Bolkhovitin wrote: >>> Vladislav Bolkhovitin wrote: >>>> Try the following variants: >>>> >>>> 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 >>>> and 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 >>>> to CPU5, fct2-worker to CPU7 >>>> >>>> 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to >>>> CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for >>>> other processes. >>>> >>>> 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to >>>> all CPUs, except CPU0 and CPU4, no affinity for other processes. >>> These are tests 1, 2 and 3, respectively >>>> Or other similar variants you'd like (even CPUs relate to physical >>>> CPU0, odd CPUs relate to physical CPU1). For instance, you can try >>>> to affine IRQs 169 and 177 to CPU1. >>> I did two other tests (Tests 4,5), that has the mlx4_core (comp) IRQ >>> (formerly known as IRQ 82) pinned to CPU0, the two ioDrive IRQs (169, >>> 177) pinned to CPU 4, fct0 and scsi_tgt0 on CPUs 2&3, fct1 and >>> scsi_tgt1 on CPUs 4&6 (test 4) OR fct1 and scsi_tgt1 on CPUs 5&6. >>>> No points to run for srptthread=1, for it just produce a baseline >>>> with no affinity at all. >>> I ran with these anyway to look at differences among the tests. >>> Having this thread enabled always results in better performance. >>>> Please do each run several times and write down an average result >>>> between runs and approximate variation between them in %%. Otherwise >>>> we can't make any reliable conclusions. >>> I ran each test 3 times and took the averages. In order to get a >>> quick look at performance per run, I added a column in the summary >>> that sums the IOPs for each test with SRPT thread enabled and then >>> not enabled. Test 4 seems to give the best results. Here's a brief >>> summary of that summary with just SRPT thread=0: >>> >>> Baseline: 356226.39 >>> Test 1: 371217.6533 >>> Test 2: 370553.78 >>> Test 3: 373295.2033 >>> Test 4: 399385.2233 >>> Test 5: 393204.5833 >> Linux CPU scheduler does really impressive job! >> >> Interesting, will something change with: >> >> 1. The latest SVN. It has some changes, which might make a difference. > Sorry for the delay. > This is with SVN rev 673. I don't hit the high I hit before, but at a > 1.8% difference (with test 4), it's statistically noise. > > Test 1: 390631.5133 > Test 2: 386125.4133 > Test 3: 356268.0267 > Test 4: 392237.7867 > Test 5: 390012.1467 >> 2. Pass-through dev handler instead of BLOCKIO, which you are using. >> > The ioDrive driver doesn't provide a full SCSI emulation layer and shows > up as /dev/fio[abc...]. From my understanding of the pass-through > handler, I need to have the SCSI Host:Channel:ID:LUN and those aren't > available to me. Yes. Although this is strange, because you use sdX devices, hence they should have full SCSI emulation and lsscsi should show the Host:Channel:ID:LUN numbers. Thanks, Vlad From cameron at harr.org Tue Feb 24 09:55:25 2009 From: cameron at harr.org (Cameron Harr) Date: Tue, 24 Feb 2009 10:55:25 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A43439.7080405@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vl nb.net> Message-ID: <49A4348D.6020303@harr.org> Vladislav Bolkhovitin wrote: >> 2. Pass-through dev handler instead of BLOCKIO, which you are using. >>> >> The ioDrive driver doesn't provide a full SCSI emulation layer and >> shows up as /dev/fio[abc...]. From my understanding of the >> pass-through handler, I need to have the SCSI Host:Channel:ID:LUN and >> those aren't available to me. > > Yes. Although this is strange, because you use sdX devices, hence they > should have full SCSI emulation and lsscsi should show the > Host:Channel:ID:LUN numbers. I actually don't have sdX devices unless they are SRP targets on an initiator. On the target server, the native drive is /dev/fioX. Cameron From daniel.miles at rnanetworks.com Tue Feb 24 09:56:13 2009 From: daniel.miles at rnanetworks.com (Daniel Miles) Date: Tue, 24 Feb 2009 09:56:13 -0800 Subject: [ofa-general] how do I take IB interfaces offline? In-Reply-To: <49A4337F.8040304@harr.org> Message-ID: Well, that would work, but it fails, telling me the mlx4_ib module is in use. I suspect this is because there are active RDMA connections on it, which is the reason I want to bring it down (I¹m doing QA, I need to know what happens if the card goes offline). On 2/24/09 9:50 AM, "Cameron Harr" wrote: > Have you tried /etc/init.d/openibd stop, or are you wanting something that > doesn't shut down the whole IB system? > > Daniel Miles wrote: >> how do I take IB interfaces offline? Hello, everyone. I wonder if anyone can >> tell me how to take an IB interface offline on a running Linux (CENTOS 5 with >> OFED 1.3.1) system? I can cause it to loose its IP address with ifdown but it >> seems that the IP address is only involved in establishing new connections >> and removing it doesn¹t prevent the device from fielding traffic on >> established connections. >> >> Does anybody know how this is done? >> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at harr.org Tue Feb 24 09:58:36 2009 From: cameron at harr.org (Cameron Harr) Date: Tue, 24 Feb 2009 10:58:36 -0700 Subject: [ofa-general] how do I take IB interfaces offline? In-Reply-To: References: Message-ID: <49A4354C.4050904@harr.org> An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Tue Feb 24 11:14:37 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 24 Feb 2009 11:14:37 -0800 Subject: [ofa-general] how do I take IB interfaces offline? In-Reply-To: <49A4354C.4050904@harr.org> References: <49A4354C.4050904@harr.org> Message-ID: <20090224111437.1f10eaa6.weiny2@llnl.gov> On Tue, 24 Feb 2009 10:58:36 -0700 Cameron Harr wrote: > This is may be because you have the SM running. Try /etc/init.d/opensmd stop. If that doesn't work you'll want to find out what is actually using it. When you say RDMA, are you doing iSER or SRP? If that's the case, you'll need to free it up by removing it as a target or just unloading the modules. Stopping the SM will not stop the traffic. Try "ibportstate disable" on the switch/port the HCA is plugged into. This will simulate the port going down. You can then use "enable" to re-enable it. Ira > Cameron > > Daniel Miles wrote: Re: [ofa-general] how do I take IB interfaces offline? Well, that would work, but it fails, telling me the mlx4_ib module is in use. I suspect this is because there are active RDMA connections on it, which is the reason I want to bring it down (I’m doing QA, I need to know what happens if the card goes offline). > -- Ira Weiny Math Programer/Computer Scientist Larence Livermore National Lab weiny2 at llnl.gov From cameron at harr.org Tue Feb 24 15:22:18 2009 From: cameron at harr.org (Cameron Harr) Date: Tue, 24 Feb 2009 16:22:18 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A43439.7080405@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vl nb.net> Message-ID: <49A4812A.8050202@harr.org> Vladislav Bolkhovitin wrote: >>>> I ran each test 3 times and took the averages. In order to get a >>>> quick look at performance per run, I added a column in the summary >>>> that sums the IOPs for each test with SRPT thread enabled and then >>>> not enabled. Test 4 seems to give the best results. Here's a brief >>>> summary of that summary with just SRPT thread=0: >>>> >>>> Baseline: 356226.39 >>>> Test 1: 371217.6533 >>>> Test 2: 370553.78 >>>> Test 3: 373295.2033 >>>> Test 4: 399385.2233 >>>> Test 5: 393204.5833 >>> Linux CPU scheduler does really impressive job! >>> >>> Interesting, will something change with: >>> >>> 1. The latest SVN. It has some changes, which might make a difference. >> Sorry for the delay. >> This is with SVN rev 673. I don't hit the high I hit before, but at a >> 1.8% difference (with test 4), it's statistically noise. >> >> Test 1: 390631.5133 >> Test 2: 386125.4133 >> Test 3: 356268.0267 >> Test 4: 392237.7867 >> Test 5: 390012.1467 I just ran again, this time with rev 680 and am a little concerned to see the drop in performance. I verified that debug is not on. I'll try to start another run on 680 to see if I get similar results. Test 1:368342.41 Test 2:366787.2067 Test 3:345334.68 Test 4:372684.58 Test 5:372184.8333 From Jie.Cai at cs.anu.edu.au Tue Feb 24 16:44:08 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Wed, 25 Feb 2009 11:44:08 +1100 Subject: [ofa-general] Bandwidth of performance with multirail IB In-Reply-To: <200902240941.58634.cap@nsc.liu.se> References: <20090223211155.730AFE28137@openfabrics.org> <49A378BC.5010806@cs.anu.edu.au> <200902240941.58634.cap@nsc.liu.se> Message-ID: <49A49458.9070003@cs.anu.edu.au> Peter Kjellstrom wrote: > On Tuesday 24 February 2009, Jie Cai wrote: > >> I have implemented a uDAPL program to measure the bandwidth on IB with >> multirail connections. >> >> The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two >> ports. >> >> The program utilize the two port on each node of cluster to build >> multirail IB connections. >> >> The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which >> is almost the same as single rail connections. >> > > Assuming you have a 2.5 GT/s pci-express x8 that speed is a result of the bus > not being able to keep up with the HCA. Since the bus is holding even a > single DDR IB port back you see no improvement with two ports. > > I do connect HCA in a 16x pci-e slot on each node. However, I am trying to drive 2 ports simultaneously. The workstation i am using is Sun Ultra 24, and the HCA is Mellanox ConnectX MHGH28-XTC. The data for the HCA and Ultra 24 is MHGH28-XTC IB ports: Dual Copper 4X 20Gb/s Host Bus: PCIe 2.0 2.5GT/s Ultra 24 workstation: 1333 MHz frontside bus with DDR2 memory support upto (10.67 GB per second bandwidth) PCI Express Slots * Two full-length x16 Gen-2 slots (where the HCA has been connected to) * One full-length x8 slot * One full-length x1 slot So, it may not be the problem of bottleneck in bus. > To fully drive a DDR IB port you need either 16x pci-express 2.5 GT/s or a 8x > 5 GT/s. For one QDR or two DDR you'll need even more... > > The pci-e slot in Ultra 24 is PCI Express Gen2 x16. The data transfer rate is 5 Gpbs. Will this be sufficient to drive the 2 ddr ports on MHGH28-XTC ? Or is there any other possible reasons? > /Peter > > >> Does anyone have similar experience? >> From andy.grover at oracle.com Tue Feb 24 17:30:17 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:17 -0800 Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 Message-ID: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Hi, This patchset against net-next adds support for RDS sockets. RDS is an Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably, and is used currently in Oracle RAC and Exadata products. I've addressed all the issues from comments on take 1. (thanks!) This patchset squashes the changes into the original changeset, but I've also included a tree where the un-squashed changes since last time may be reviewed: git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6.git rds-broken-out-fixes Major changes since last time include moving to net/rds, and the additional inclusion of iwarp transport support. shortlog for patchseries follows. Thanks -- Regards -- Andy Andy Grover (26): RDS: Socket interface RDS: Main header file RDS: Congestion-handling code RDS: Transport code RDS: Info and stats RDS: Connection handling RDS: loopback RDS: sysctls RDS: Message parsing RDS: send.c RDS: recv.c RDS: RDMA support RDS/IB: Infiniband transport RDS/IB: Ring-handling code. RDS/IB: Implement RDMA ops using FMRs RDS/IB: Implement IB-specific datagram send. RDS/IB: Receive datagrams via IB RDS/IB: Stats and sysctls RDS: Add iWARP support RDS: Common RDMA transport code RDS: Documentation RDS: Kconfig and Makefile RDS: Add AF and PF #defines for RDS sockets RDS: Add MAINTAINERS entry RDS: Add userspace header RDS: Add RDS to AF key strings Documentation/networking/rds.txt | 356 ++++++++++++++ MAINTAINERS | 6 + include/linux/rds.h | 250 ++++++++++ include/linux/socket.h | 5 +- net/Kconfig | 1 + net/Makefile | 1 + net/core/sock.c | 6 +- net/rds/Kconfig | 13 + net/rds/Makefile | 14 + net/rds/af_rds.c | 586 ++++++++++++++++++++++ net/rds/bind.c | 199 ++++++++ net/rds/cong.c | 402 +++++++++++++++ net/rds/connection.c | 487 ++++++++++++++++++ net/rds/ib.c | 323 ++++++++++++ net/rds/ib.h | 367 ++++++++++++++ net/rds/ib_cm.c | 726 +++++++++++++++++++++++++++ net/rds/ib_rdma.c | 641 ++++++++++++++++++++++++ net/rds/ib_recv.c | 869 +++++++++++++++++++++++++++++++++ net/rds/ib_ring.c | 168 +++++++ net/rds/ib_send.c | 874 +++++++++++++++++++++++++++++++++ net/rds/ib_stats.c | 95 ++++ net/rds/ib_sysctl.c | 137 ++++++ net/rds/info.c | 241 +++++++++ net/rds/info.h | 30 ++ net/rds/iw.c | 333 +++++++++++++ net/rds/iw.h | 395 +++++++++++++++ net/rds/iw_cm.c | 750 ++++++++++++++++++++++++++++ net/rds/iw_rdma.c | 888 +++++++++++++++++++++++++++++++++ net/rds/iw_recv.c | 869 +++++++++++++++++++++++++++++++++ net/rds/iw_ring.c | 169 +++++++ net/rds/iw_send.c | 975 ++++++++++++++++++++++++++++++++++++ net/rds/iw_stats.c | 95 ++++ net/rds/iw_sysctl.c | 137 ++++++ net/rds/loop.c | 188 +++++++ net/rds/loop.h | 9 + net/rds/message.c | 402 +++++++++++++++ net/rds/page.c | 221 +++++++++ net/rds/rdma.c | 679 ++++++++++++++++++++++++++ net/rds/rdma.h | 84 ++++ net/rds/rdma_transport.c | 214 ++++++++ net/rds/rdma_transport.h | 28 + net/rds/rds.h | 686 ++++++++++++++++++++++++++ net/rds/recv.c | 542 ++++++++++++++++++++ net/rds/send.c | 1003 ++++++++++++++++++++++++++++++++++++++ net/rds/stats.c | 148 ++++++ net/rds/sysctl.c | 122 +++++ net/rds/threads.c | 265 ++++++++++ net/rds/transport.c | 117 +++++ 48 files changed, 16112 insertions(+), 4 deletions(-) create mode 100644 Documentation/networking/rds.txt create mode 100644 include/linux/rds.h create mode 100644 net/rds/Kconfig create mode 100644 net/rds/Makefile create mode 100644 net/rds/af_rds.c create mode 100644 net/rds/bind.c create mode 100644 net/rds/cong.c create mode 100644 net/rds/connection.c create mode 100644 net/rds/ib.c create mode 100644 net/rds/ib.h create mode 100644 net/rds/ib_cm.c create mode 100644 net/rds/ib_rdma.c create mode 100644 net/rds/ib_recv.c create mode 100644 net/rds/ib_ring.c create mode 100644 net/rds/ib_send.c create mode 100644 net/rds/ib_stats.c create mode 100644 net/rds/ib_sysctl.c create mode 100644 net/rds/info.c create mode 100644 net/rds/info.h create mode 100644 net/rds/iw.c create mode 100644 net/rds/iw.h create mode 100644 net/rds/iw_cm.c create mode 100644 net/rds/iw_rdma.c create mode 100644 net/rds/iw_recv.c create mode 100644 net/rds/iw_ring.c create mode 100644 net/rds/iw_send.c create mode 100644 net/rds/iw_stats.c create mode 100644 net/rds/iw_sysctl.c create mode 100644 net/rds/loop.c create mode 100644 net/rds/loop.h create mode 100644 net/rds/message.c create mode 100644 net/rds/page.c create mode 100644 net/rds/rdma.c create mode 100644 net/rds/rdma.h create mode 100644 net/rds/rdma_transport.c create mode 100644 net/rds/rdma_transport.h create mode 100644 net/rds/rds.h create mode 100644 net/rds/recv.c create mode 100644 net/rds/send.c create mode 100644 net/rds/stats.c create mode 100644 net/rds/sysctl.c create mode 100644 net/rds/threads.c create mode 100644 net/rds/transport.c end From andy.grover at oracle.com Tue Feb 24 17:30:18 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:18 -0800 Subject: [ofa-general] [PATCH 01/26] RDS: Socket interface In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-2-git-send-email-andy.grover@oracle.com> Implement the RDS (Reliable Datagram Sockets) interface. Signed-off-by: Andy Grover --- net/rds/af_rds.c | 586 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/bind.c | 199 ++++++++++++++++++ 2 files changed, 785 insertions(+), 0 deletions(-) create mode 100644 net/rds/af_rds.c create mode 100644 net/rds/bind.c diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c new file mode 100644 index 0000000..20cf16f --- /dev/null +++ b/net/rds/af_rds.c @@ -0,0 +1,586 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include + +#include "rds.h" +#include "rdma.h" +#include "rdma_transport.h" + +/* this is just used for stats gathering :/ */ +static DEFINE_SPINLOCK(rds_sock_lock); +static unsigned long rds_sock_count; +static LIST_HEAD(rds_sock_list); +DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq); + +/* + * This is called as the final descriptor referencing this socket is closed. + * We have to unbind the socket so that another socket can be bound to the + * address it was using. + * + * We have to be careful about racing with the incoming path. sock_orphan() + * sets SOCK_DEAD and we use that as an indicator to the rx path that new + * messages shouldn't be queued. + */ +static int rds_release(struct socket *sock) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs; + unsigned long flags; + + if (sk == NULL) + goto out; + + rs = rds_sk_to_rs(sk); + + sock_orphan(sk); + /* Note - rds_clear_recv_queue grabs rs_recv_lock, so + * that ensures the recv path has completed messing + * with the socket. */ + rds_clear_recv_queue(rs); + rds_cong_remove_socket(rs); + rds_remove_bound(rs); + rds_send_drop_to(rs, NULL); + rds_rdma_drop_keys(rs); + rds_notify_queue_get(rs, NULL); + + spin_lock_irqsave(&rds_sock_lock, flags); + list_del_init(&rs->rs_item); + rds_sock_count--; + spin_unlock_irqrestore(&rds_sock_lock, flags); + + sock->sk = NULL; + sock_put(sk); +out: + return 0; +} + +/* + * Careful not to race with rds_release -> sock_orphan which clears sk_sleep. + * _bh() isn't OK here, we're called from interrupt handlers. It's probably OK + * to wake the waitqueue after sk_sleep is clear as we hold a sock ref, but + * this seems more conservative. + * NB - normally, one would use sk_callback_lock for this, but we can + * get here from interrupts, whereas the network code grabs sk_callback_lock + * with _lock_bh only - so relying on sk_callback_lock introduces livelocks. + */ +void rds_wake_sk_sleep(struct rds_sock *rs) +{ + unsigned long flags; + + read_lock_irqsave(&rs->rs_recv_lock, flags); + __rds_wake_sk_sleep(rds_rs_to_sk(rs)); + read_unlock_irqrestore(&rs->rs_recv_lock, flags); +} + +static int rds_getname(struct socket *sock, struct sockaddr *uaddr, + int *uaddr_len, int peer) +{ + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + struct rds_sock *rs = rds_sk_to_rs(sock->sk); + + memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); + + /* racey, don't care */ + if (peer) { + if (!rs->rs_conn_addr) + return -ENOTCONN; + + sin->sin_port = rs->rs_conn_port; + sin->sin_addr.s_addr = rs->rs_conn_addr; + } else { + sin->sin_port = rs->rs_bound_port; + sin->sin_addr.s_addr = rs->rs_bound_addr; + } + + sin->sin_family = AF_INET; + + *uaddr_len = sizeof(*sin); + return 0; +} + +/* + * RDS' poll is without a doubt the least intuitive part of the interface, + * as POLLIN and POLLOUT do not behave entirely as you would expect from + * a network protocol. + * + * POLLIN is asserted if + * - there is data on the receive queue. + * - to signal that a previously congested destination may have become + * uncongested + * - A notification has been queued to the socket (this can be a congestion + * update, or a RDMA completion). + * + * POLLOUT is asserted if there is room on the send queue. This does not mean + * however, that the next sendmsg() call will succeed. If the application tries + * to send to a congested destination, the system call may still fail (and + * return ENOBUFS). + */ +static unsigned int rds_poll(struct file *file, struct socket *sock, + poll_table *wait) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs = rds_sk_to_rs(sk); + unsigned int mask = 0; + unsigned long flags; + + poll_wait(file, sk->sk_sleep, wait); + + poll_wait(file, &rds_poll_waitq, wait); + + read_lock_irqsave(&rs->rs_recv_lock, flags); + if (!rs->rs_cong_monitor) { + /* When a congestion map was updated, we signal POLLIN for + * "historical" reasons. Applications can also poll for + * WRBAND instead. */ + if (rds_cong_updated_since(&rs->rs_cong_track)) + mask |= (POLLIN | POLLRDNORM | POLLWRBAND); + } else { + spin_lock(&rs->rs_lock); + if (rs->rs_cong_notify) + mask |= (POLLIN | POLLRDNORM); + spin_unlock(&rs->rs_lock); + } + if (!list_empty(&rs->rs_recv_queue) + || !list_empty(&rs->rs_notify_queue)) + mask |= (POLLIN | POLLRDNORM); + if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) + mask |= (POLLOUT | POLLWRNORM); + read_unlock_irqrestore(&rs->rs_recv_lock, flags); + + return mask; +} + +static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) +{ + return -ENOIOCTLCMD; +} + +static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval, + int len) +{ + struct sockaddr_in sin; + int ret = 0; + + /* racing with another thread binding seems ok here */ + if (rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + if (len < sizeof(struct sockaddr_in)) { + ret = -EINVAL; + goto out; + } + + if (copy_from_user(&sin, optval, sizeof(sin))) { + ret = -EFAULT; + goto out; + } + + rds_send_drop_to(rs, &sin); +out: + return ret; +} + +static int rds_set_bool_option(unsigned char *optvar, char __user *optval, + int optlen) +{ + int value; + + if (optlen < sizeof(int)) + return -EINVAL; + if (get_user(value, (int __user *) optval)) + return -EFAULT; + *optvar = !!value; + return 0; +} + +static int rds_cong_monitor(struct rds_sock *rs, char __user *optval, + int optlen) +{ + int ret; + + ret = rds_set_bool_option(&rs->rs_cong_monitor, optval, optlen); + if (ret == 0) { + if (rs->rs_cong_monitor) { + rds_cong_add_socket(rs); + } else { + rds_cong_remove_socket(rs); + rs->rs_cong_mask = 0; + rs->rs_cong_notify = 0; + } + } + return ret; +} + +static int rds_setsockopt(struct socket *sock, int level, int optname, + char __user *optval, int optlen) +{ + struct rds_sock *rs = rds_sk_to_rs(sock->sk); + int ret; + + if (level != SOL_RDS) { + ret = -ENOPROTOOPT; + goto out; + } + + switch (optname) { + case RDS_CANCEL_SENT_TO: + ret = rds_cancel_sent_to(rs, optval, optlen); + break; + case RDS_GET_MR: + ret = rds_get_mr(rs, optval, optlen); + break; + case RDS_FREE_MR: + ret = rds_free_mr(rs, optval, optlen); + break; + case RDS_RECVERR: + ret = rds_set_bool_option(&rs->rs_recverr, optval, optlen); + break; + case RDS_CONG_MONITOR: + ret = rds_cong_monitor(rs, optval, optlen); + break; + default: + ret = -ENOPROTOOPT; + } +out: + return ret; +} + +static int rds_getsockopt(struct socket *sock, int level, int optname, + char __user *optval, int __user *optlen) +{ + struct rds_sock *rs = rds_sk_to_rs(sock->sk); + int ret = -ENOPROTOOPT, len; + + if (level != SOL_RDS) + goto out; + + if (get_user(len, optlen)) { + ret = -EFAULT; + goto out; + } + + switch (optname) { + case RDS_INFO_FIRST ... RDS_INFO_LAST: + ret = rds_info_getsockopt(sock, optname, optval, + optlen); + break; + + case RDS_RECVERR: + if (len < sizeof(int)) + ret = -EINVAL; + else + if (put_user(rs->rs_recverr, (int __user *) optval) + || put_user(sizeof(int), optlen)) + ret = -EFAULT; + else + ret = 0; + break; + default: + break; + } + +out: + return ret; + +} + +static int rds_connect(struct socket *sock, struct sockaddr *uaddr, + int addr_len, int flags) +{ + struct sock *sk = sock->sk; + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + struct rds_sock *rs = rds_sk_to_rs(sk); + int ret = 0; + + lock_sock(sk); + + if (addr_len != sizeof(struct sockaddr_in)) { + ret = -EINVAL; + goto out; + } + + if (sin->sin_family != AF_INET) { + ret = -EAFNOSUPPORT; + goto out; + } + + if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) { + ret = -EDESTADDRREQ; + goto out; + } + + rs->rs_conn_addr = sin->sin_addr.s_addr; + rs->rs_conn_port = sin->sin_port; + +out: + release_sock(sk); + return ret; +} + +static struct proto rds_proto = { + .name = "RDS", + .owner = THIS_MODULE, + .obj_size = sizeof(struct rds_sock), +}; + +static struct proto_ops rds_proto_ops = { + .family = AF_RDS, + .owner = THIS_MODULE, + .release = rds_release, + .bind = rds_bind, + .connect = rds_connect, + .socketpair = sock_no_socketpair, + .accept = sock_no_accept, + .getname = rds_getname, + .poll = rds_poll, + .ioctl = rds_ioctl, + .listen = sock_no_listen, + .shutdown = sock_no_shutdown, + .setsockopt = rds_setsockopt, + .getsockopt = rds_getsockopt, + .sendmsg = rds_sendmsg, + .recvmsg = rds_recvmsg, + .mmap = sock_no_mmap, + .sendpage = sock_no_sendpage, +}; + +static int __rds_create(struct socket *sock, struct sock *sk, int protocol) +{ + unsigned long flags; + struct rds_sock *rs; + + sock_init_data(sock, sk); + sock->ops = &rds_proto_ops; + sk->sk_protocol = protocol; + + rs = rds_sk_to_rs(sk); + spin_lock_init(&rs->rs_lock); + rwlock_init(&rs->rs_recv_lock); + INIT_LIST_HEAD(&rs->rs_send_queue); + INIT_LIST_HEAD(&rs->rs_recv_queue); + INIT_LIST_HEAD(&rs->rs_notify_queue); + INIT_LIST_HEAD(&rs->rs_cong_list); + spin_lock_init(&rs->rs_rdma_lock); + rs->rs_rdma_keys = RB_ROOT; + + spin_lock_irqsave(&rds_sock_lock, flags); + list_add_tail(&rs->rs_item, &rds_sock_list); + rds_sock_count++; + spin_unlock_irqrestore(&rds_sock_lock, flags); + + return 0; +} + +static int rds_create(struct net *net, struct socket *sock, int protocol) +{ + struct sock *sk; + + if (sock->type != SOCK_SEQPACKET || protocol) + return -ESOCKTNOSUPPORT; + + sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto); + if (!sk) + return -ENOMEM; + + return __rds_create(sock, sk, protocol); +} + +void rds_sock_addref(struct rds_sock *rs) +{ + sock_hold(rds_rs_to_sk(rs)); +} + +void rds_sock_put(struct rds_sock *rs) +{ + sock_put(rds_rs_to_sk(rs)); +} + +static struct net_proto_family rds_family_ops = { + .family = AF_RDS, + .create = rds_create, + .owner = THIS_MODULE, +}; + +static void rds_sock_inc_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + struct rds_sock *rs; + struct sock *sk; + struct rds_incoming *inc; + unsigned long flags; + unsigned int total = 0; + + len /= sizeof(struct rds_info_message); + + spin_lock_irqsave(&rds_sock_lock, flags); + + list_for_each_entry(rs, &rds_sock_list, rs_item) { + sk = rds_rs_to_sk(rs); + read_lock(&rs->rs_recv_lock); + + /* XXX too lazy to maintain counts.. */ + list_for_each_entry(inc, &rs->rs_recv_queue, i_item) { + total++; + if (total <= len) + rds_inc_info_copy(inc, iter, inc->i_saddr, + rs->rs_bound_addr, 1); + } + + read_unlock(&rs->rs_recv_lock); + } + + spin_unlock_irqrestore(&rds_sock_lock, flags); + + lens->nr = total; + lens->each = sizeof(struct rds_info_message); +} + +static void rds_sock_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + struct rds_info_socket sinfo; + struct rds_sock *rs; + unsigned long flags; + + len /= sizeof(struct rds_info_socket); + + spin_lock_irqsave(&rds_sock_lock, flags); + + if (len < rds_sock_count) + goto out; + + list_for_each_entry(rs, &rds_sock_list, rs_item) { + sinfo.sndbuf = rds_sk_sndbuf(rs); + sinfo.rcvbuf = rds_sk_rcvbuf(rs); + sinfo.bound_addr = rs->rs_bound_addr; + sinfo.connected_addr = rs->rs_conn_addr; + sinfo.bound_port = rs->rs_bound_port; + sinfo.connected_port = rs->rs_conn_port; + sinfo.inum = sock_i_ino(rds_rs_to_sk(rs)); + + rds_info_copy(iter, &sinfo, sizeof(sinfo)); + } + +out: + lens->nr = rds_sock_count; + lens->each = sizeof(struct rds_info_socket); + + spin_unlock_irqrestore(&rds_sock_lock, flags); +} + +static void __exit rds_exit(void) +{ + rds_rdma_exit(); + sock_unregister(rds_family_ops.family); + proto_unregister(&rds_proto); + rds_conn_exit(); + rds_cong_exit(); + rds_sysctl_exit(); + rds_threads_exit(); + rds_stats_exit(); + rds_page_exit(); + rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info); + rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); +} +module_exit(rds_exit); + +static int __init rds_init(void) +{ + int ret; + + ret = rds_conn_init(); + if (ret) + goto out; + ret = rds_threads_init(); + if (ret) + goto out_conn; + ret = rds_sysctl_init(); + if (ret) + goto out_threads; + ret = rds_stats_init(); + if (ret) + goto out_sysctl; + ret = proto_register(&rds_proto, 1); + if (ret) + goto out_stats; + ret = sock_register(&rds_family_ops); + if (ret) + goto out_proto; + + rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info); + rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); + + /* ib/iwarp transports currently compiled-in */ + ret = rds_rdma_init(); + if (ret) + goto out_sock; + goto out; + +out_sock: + sock_unregister(rds_family_ops.family); +out_proto: + proto_unregister(&rds_proto); +out_stats: + rds_stats_exit(); +out_sysctl: + rds_sysctl_exit(); +out_threads: + rds_threads_exit(); +out_conn: + rds_conn_exit(); + rds_cong_exit(); + rds_page_exit(); +out: + return ret; +} +module_init(rds_init); + +#define DRV_VERSION "4.0" +#define DRV_RELDATE "Feb 12, 2009" + +MODULE_AUTHOR("Oracle Corporation "); +MODULE_DESCRIPTION("RDS: Reliable Datagram Sockets" + " v" DRV_VERSION " (" DRV_RELDATE ")"); +MODULE_VERSION(DRV_VERSION); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_ALIAS_NETPROTO(PF_RDS); diff --git a/net/rds/bind.c b/net/rds/bind.c new file mode 100644 index 0000000..c17cc39 --- /dev/null +++ b/net/rds/bind.c @@ -0,0 +1,199 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include "rds.h" + +/* + * XXX this probably still needs more work.. no INADDR_ANY, and rbtrees aren't + * particularly zippy. + * + * This is now called for every incoming frame so we arguably care much more + * about it than we used to. + */ +static DEFINE_SPINLOCK(rds_bind_lock); +static struct rb_root rds_bind_tree = RB_ROOT; + +static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port, + struct rds_sock *insert) +{ + struct rb_node **p = &rds_bind_tree.rb_node; + struct rb_node *parent = NULL; + struct rds_sock *rs; + u64 cmp; + u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port); + + while (*p) { + parent = *p; + rs = rb_entry(parent, struct rds_sock, rs_bound_node); + + cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | + be16_to_cpu(rs->rs_bound_port); + + if (needle < cmp) + p = &(*p)->rb_left; + else if (needle > cmp) + p = &(*p)->rb_right; + else + return rs; + } + + if (insert) { + rb_link_node(&insert->rs_bound_node, parent, p); + rb_insert_color(&insert->rs_bound_node, &rds_bind_tree); + } + return NULL; +} + +/* + * Return the rds_sock bound at the given local address. + * + * The rx path can race with rds_release. We notice if rds_release() has + * marked this socket and don't return a rs ref to the rx path. + */ +struct rds_sock *rds_find_bound(__be32 addr, __be16 port) +{ + struct rds_sock *rs; + unsigned long flags; + + spin_lock_irqsave(&rds_bind_lock, flags); + rs = rds_bind_tree_walk(addr, port, NULL); + if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) + rds_sock_addref(rs); + else + rs = NULL; + spin_unlock_irqrestore(&rds_bind_lock, flags); + + rdsdebug("returning rs %p for %pI4:%u\n", rs, &addr, + ntohs(port)); + return rs; +} + +/* returns -ve errno or +ve port */ +static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) +{ + unsigned long flags; + int ret = -EADDRINUSE; + u16 rover, last; + + if (*port != 0) { + rover = be16_to_cpu(*port); + last = rover; + } else { + rover = max_t(u16, net_random(), 2); + last = rover - 1; + } + + spin_lock_irqsave(&rds_bind_lock, flags); + + do { + if (rover == 0) + rover++; + if (rds_bind_tree_walk(addr, cpu_to_be16(rover), rs) == NULL) { + *port = cpu_to_be16(rover); + ret = 0; + break; + } + } while (rover++ != last); + + if (ret == 0) { + rs->rs_bound_addr = addr; + rs->rs_bound_port = *port; + rds_sock_addref(rs); + + rdsdebug("rs %p binding to %pI4:%d\n", + rs, &addr, (int)ntohs(*port)); + } + + spin_unlock_irqrestore(&rds_bind_lock, flags); + + return ret; +} + +void rds_remove_bound(struct rds_sock *rs) +{ + unsigned long flags; + + spin_lock_irqsave(&rds_bind_lock, flags); + + if (rs->rs_bound_addr) { + rdsdebug("rs %p unbinding from %pI4:%d\n", + rs, &rs->rs_bound_addr, + ntohs(rs->rs_bound_port)); + + rb_erase(&rs->rs_bound_node, &rds_bind_tree); + rds_sock_put(rs); + rs->rs_bound_addr = 0; + } + + spin_unlock_irqrestore(&rds_bind_lock, flags); +} + +int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) +{ + struct sock *sk = sock->sk; + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + struct rds_sock *rs = rds_sk_to_rs(sk); + struct rds_transport *trans; + int ret = 0; + + lock_sock(sk); + + if (addr_len != sizeof(struct sockaddr_in) || + sin->sin_family != AF_INET || + rs->rs_bound_addr || + sin->sin_addr.s_addr == htonl(INADDR_ANY)) { + ret = -EINVAL; + goto out; + } + + ret = rds_add_bound(rs, sin->sin_addr.s_addr, &sin->sin_port); + if (ret) + goto out; + + trans = rds_trans_get_preferred(sin->sin_addr.s_addr); + if (trans == NULL) { + ret = -EADDRNOTAVAIL; + rds_remove_bound(rs); + goto out; + } + + rs->rs_transport = trans; + ret = 0; + +out: + release_sock(sk); + return ret; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:19 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:19 -0800 Subject: [ofa-general] [PATCH 02/26] RDS: Main header file In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-3-git-send-email-andy.grover@oracle.com> RDS's main data structure definitions and exported functions. Signed-off-by: Andy Grover --- net/rds/rds.h | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 686 insertions(+), 0 deletions(-) create mode 100644 net/rds/rds.h diff --git a/net/rds/rds.h b/net/rds/rds.h new file mode 100644 index 0000000..0604007 --- /dev/null +++ b/net/rds/rds.h @@ -0,0 +1,686 @@ +#ifndef _RDS_RDS_H +#define _RDS_RDS_H + +#include +#include +#include +#include +#include +#include + +#include "info.h" + +/* + * RDS Network protocol version + */ +#define RDS_PROTOCOL_3_0 0x0300 +#define RDS_PROTOCOL_3_1 0x0301 +#define RDS_PROTOCOL_VERSION RDS_PROTOCOL_3_1 +#define RDS_PROTOCOL_MAJOR(v) ((v) >> 8) +#define RDS_PROTOCOL_MINOR(v) ((v) & 255) +#define RDS_PROTOCOL(maj, min) (((maj) << 8) | min) + +/* + * XXX randomly chosen, but at least seems to be unused: + * # 18464-18768 Unassigned + * We should do better. We want a reserved port to discourage unpriv'ed + * userspace from listening. + */ +#define RDS_PORT 18634 + +#ifdef DEBUG +#define rdsdebug(fmt, args...) pr_debug("%s(): " fmt, __func__ , ##args) +#else +/* sigh, pr_debug() causes unused variable warnings */ +static inline void __attribute__ ((format (printf, 1, 2))) +rdsdebug(char *fmt, ...) +{ +} +#endif + +/* XXX is there one of these somewhere? */ +#define ceil(x, y) \ + ({ unsigned long __x = (x), __y = (y); (__x + __y - 1) / __y; }) + +#define RDS_FRAG_SHIFT 12 +#define RDS_FRAG_SIZE ((unsigned int)(1 << RDS_FRAG_SHIFT)) + +#define RDS_CONG_MAP_BYTES (65536 / 8) +#define RDS_CONG_MAP_LONGS (RDS_CONG_MAP_BYTES / sizeof(unsigned long)) +#define RDS_CONG_MAP_PAGES (PAGE_ALIGN(RDS_CONG_MAP_BYTES) / PAGE_SIZE) +#define RDS_CONG_MAP_PAGE_BITS (PAGE_SIZE * 8) + +struct rds_cong_map { + struct rb_node m_rb_node; + __be32 m_addr; + wait_queue_head_t m_waitq; + struct list_head m_conn_list; + unsigned long m_page_addrs[RDS_CONG_MAP_PAGES]; +}; + + +/* + * This is how we will track the connection state: + * A connection is always in one of the following + * states. Updates to the state are atomic and imply + * a memory barrier. + */ +enum { + RDS_CONN_DOWN = 0, + RDS_CONN_CONNECTING, + RDS_CONN_DISCONNECTING, + RDS_CONN_UP, + RDS_CONN_ERROR, +}; + +/* Bits for c_flags */ +#define RDS_LL_SEND_FULL 0 +#define RDS_RECONNECT_PENDING 1 + +struct rds_connection { + struct hlist_node c_hash_node; + __be32 c_laddr; + __be32 c_faddr; + unsigned int c_loopback:1; + struct rds_connection *c_passive; + + struct rds_cong_map *c_lcong; + struct rds_cong_map *c_fcong; + + struct mutex c_send_lock; /* protect send ring */ + struct rds_message *c_xmit_rm; + unsigned long c_xmit_sg; + unsigned int c_xmit_hdr_off; + unsigned int c_xmit_data_off; + unsigned int c_xmit_rdma_sent; + + spinlock_t c_lock; /* protect msg queues */ + u64 c_next_tx_seq; + struct list_head c_send_queue; + struct list_head c_retrans; + + u64 c_next_rx_seq; + + struct rds_transport *c_trans; + void *c_transport_data; + + atomic_t c_state; + unsigned long c_flags; + unsigned long c_reconnect_jiffies; + struct delayed_work c_send_w; + struct delayed_work c_recv_w; + struct delayed_work c_conn_w; + struct work_struct c_down_w; + struct mutex c_cm_lock; /* protect conn state & cm */ + + struct list_head c_map_item; + unsigned long c_map_queued; + unsigned long c_map_offset; + unsigned long c_map_bytes; + + unsigned int c_unacked_packets; + unsigned int c_unacked_bytes; + + /* Protocol version */ + unsigned int c_version; +}; + +#define RDS_FLAG_CONG_BITMAP 0x01 +#define RDS_FLAG_ACK_REQUIRED 0x02 +#define RDS_FLAG_RETRANSMITTED 0x04 +#define RDS_MAX_ADV_CREDIT 127 + +/* + * Maximum space available for extension headers. + */ +#define RDS_HEADER_EXT_SPACE 16 + +struct rds_header { + __be64 h_sequence; + __be64 h_ack; + __be32 h_len; + __be16 h_sport; + __be16 h_dport; + u8 h_flags; + u8 h_credit; + u8 h_padding[4]; + __sum16 h_csum; + + u8 h_exthdr[RDS_HEADER_EXT_SPACE]; +}; + +/* + * Reserved - indicates end of extensions + */ +#define RDS_EXTHDR_NONE 0 + +/* + * This extension header is included in the very + * first message that is sent on a new connection, + * and identifies the protocol level. This will help + * rolling updates if a future change requires breaking + * the protocol. + * NB: This is no longer true for IB, where we do a version + * negotiation during the connection setup phase (protocol + * version information is included in the RDMA CM private data). + */ +#define RDS_EXTHDR_VERSION 1 +struct rds_ext_header_version { + __be32 h_version; +}; + +/* + * This extension header is included in the RDS message + * chasing an RDMA operation. + */ +#define RDS_EXTHDR_RDMA 2 +struct rds_ext_header_rdma { + __be32 h_rdma_rkey; +}; + +/* + * This extension header tells the peer about the + * destination of the requested RDMA + * operation. + */ +#define RDS_EXTHDR_RDMA_DEST 3 +struct rds_ext_header_rdma_dest { + __be32 h_rdma_rkey; + __be32 h_rdma_offset; +}; + +#define __RDS_EXTHDR_MAX 16 /* for now */ + +struct rds_incoming { + atomic_t i_refcount; + struct list_head i_item; + struct rds_connection *i_conn; + struct rds_header i_hdr; + unsigned long i_rx_jiffies; + __be32 i_saddr; + + rds_rdma_cookie_t i_rdma_cookie; +}; + +/* + * m_sock_item and m_conn_item are on lists that are serialized under + * conn->c_lock. m_sock_item has additional meaning in that once it is empty + * the message will not be put back on the retransmit list after being sent. + * messages that are canceled while being sent rely on this. + * + * m_inc is used by loopback so that it can pass an incoming message straight + * back up into the rx path. It embeds a wire header which is also used by + * the send path, which is kind of awkward. + * + * m_sock_item indicates the message's presence on a socket's send or receive + * queue. m_rs will point to that socket. + * + * m_daddr is used by cancellation to prune messages to a given destination. + * + * The RDS_MSG_ON_SOCK and RDS_MSG_ON_CONN flags are used to avoid lock + * nesting. As paths iterate over messages on a sock, or conn, they must + * also lock the conn, or sock, to remove the message from those lists too. + * Testing the flag to determine if the message is still on the lists lets + * us avoid testing the list_head directly. That means each path can use + * the message's list_head to keep it on a local list while juggling locks + * without confusing the other path. + * + * m_ack_seq is an optional field set by transports who need a different + * sequence number range to invalidate. They can use this in a callback + * that they pass to rds_send_drop_acked() to see if each message has been + * acked. The HAS_ACK_SEQ flag can be used to detect messages which haven't + * had ack_seq set yet. + */ +#define RDS_MSG_ON_SOCK 1 +#define RDS_MSG_ON_CONN 2 +#define RDS_MSG_HAS_ACK_SEQ 3 +#define RDS_MSG_ACK_REQUIRED 4 +#define RDS_MSG_RETRANSMITTED 5 +#define RDS_MSG_MAPPED 6 +#define RDS_MSG_PAGEVEC 7 + +struct rds_message { + atomic_t m_refcount; + struct list_head m_sock_item; + struct list_head m_conn_item; + struct rds_incoming m_inc; + u64 m_ack_seq; + __be32 m_daddr; + unsigned long m_flags; + + /* Never access m_rs without holding m_rs_lock. + * Lock nesting is + * rm->m_rs_lock + * -> rs->rs_lock + */ + spinlock_t m_rs_lock; + struct rds_sock *m_rs; + struct rds_rdma_op *m_rdma_op; + rds_rdma_cookie_t m_rdma_cookie; + struct rds_mr *m_rdma_mr; + unsigned int m_nents; + unsigned int m_count; + struct scatterlist m_sg[0]; +}; + +/* + * The RDS notifier is used (optionally) to tell the application about + * completed RDMA operations. Rather than keeping the whole rds message + * around on the queue, we allocate a small notifier that is put on the + * socket's notifier_list. Notifications are delivered to the application + * through control messages. + */ +struct rds_notifier { + struct list_head n_list; + uint64_t n_user_token; + int n_status; +}; + +/** + * struct rds_transport - transport specific behavioural hooks + * + * @xmit: .xmit is called by rds_send_xmit() to tell the transport to send + * part of a message. The caller serializes on the send_sem so this + * doesn't need to be reentrant for a given conn. The header must be + * sent before the data payload. .xmit must be prepared to send a + * message with no data payload. .xmit should return the number of + * bytes that were sent down the connection, including header bytes. + * Returning 0 tells the caller that it doesn't need to perform any + * additional work now. This is usually the case when the transport has + * filled the sending queue for its connection and will handle + * triggering the rds thread to continue the send when space becomes + * available. Returning -EAGAIN tells the caller to retry the send + * immediately. Returning -ENOMEM tells the caller to retry the send at + * some point in the future. + * + * @conn_shutdown: conn_shutdown stops traffic on the given connection. Once + * it returns the connection can not call rds_recv_incoming(). + * This will only be called once after conn_connect returns + * non-zero success and will The caller serializes this with + * the send and connecting paths (xmit_* and conn_*). The + * transport is responsible for other serialization, including + * rds_recv_incoming(). This is called in process context but + * should try hard not to block. + * + * @xmit_cong_map: This asks the transport to send the local bitmap down the + * given connection. XXX get a better story about the bitmap + * flag and header. + */ + +struct rds_transport { + char t_name[TRANSNAMSIZ]; + struct list_head t_item; + struct module *t_owner; + unsigned int t_prefer_loopback:1; + + int (*laddr_check)(__be32 addr); + int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp); + void (*conn_free)(void *data); + int (*conn_connect)(struct rds_connection *conn); + void (*conn_shutdown)(struct rds_connection *conn); + void (*xmit_prepare)(struct rds_connection *conn); + void (*xmit_complete)(struct rds_connection *conn); + int (*xmit)(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off); + int (*xmit_cong_map)(struct rds_connection *conn, + struct rds_cong_map *map, unsigned long offset); + int (*xmit_rdma)(struct rds_connection *conn, struct rds_rdma_op *op); + int (*recv)(struct rds_connection *conn); + int (*inc_copy_to_user)(struct rds_incoming *inc, struct iovec *iov, + size_t size); + void (*inc_purge)(struct rds_incoming *inc); + void (*inc_free)(struct rds_incoming *inc); + + int (*cm_handle_connect)(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event); + int (*cm_initiate_connect)(struct rdma_cm_id *cm_id); + void (*cm_connect_complete)(struct rds_connection *conn, + struct rdma_cm_event *event); + + unsigned int (*stats_info_copy)(struct rds_info_iterator *iter, + unsigned int avail); + void (*exit)(void); + void *(*get_mr)(struct scatterlist *sg, unsigned long nr_sg, + struct rds_sock *rs, u32 *key_ret); + void (*sync_mr)(void *trans_private, int direction); + void (*free_mr)(void *trans_private, int invalidate); + void (*flush_mrs)(void); +}; + +struct rds_sock { + struct sock rs_sk; + + u64 rs_user_addr; + u64 rs_user_bytes; + + /* + * bound_addr used for both incoming and outgoing, no INADDR_ANY + * support. + */ + struct rb_node rs_bound_node; + __be32 rs_bound_addr; + __be32 rs_conn_addr; + __be16 rs_bound_port; + __be16 rs_conn_port; + + /* + * This is only used to communicate the transport between bind and + * initiating connections. All other trans use is referenced through + * the connection. + */ + struct rds_transport *rs_transport; + + /* + * rds_sendmsg caches the conn it used the last time around. + * This helps avoid costly lookups. + */ + struct rds_connection *rs_conn; + + /* flag indicating we were congested or not */ + int rs_congested; + + /* rs_lock protects all these adjacent members before the newline */ + spinlock_t rs_lock; + struct list_head rs_send_queue; + u32 rs_snd_bytes; + int rs_rcv_bytes; + struct list_head rs_notify_queue; /* currently used for failed RDMAs */ + + /* Congestion wake_up. If rs_cong_monitor is set, we use cong_mask + * to decide whether the application should be woken up. + * If not set, we use rs_cong_track to find out whether a cong map + * update arrived. + */ + uint64_t rs_cong_mask; + uint64_t rs_cong_notify; + struct list_head rs_cong_list; + unsigned long rs_cong_track; + + /* + * rs_recv_lock protects the receive queue, and is + * used to serialize with rds_release. + */ + rwlock_t rs_recv_lock; + struct list_head rs_recv_queue; + + /* just for stats reporting */ + struct list_head rs_item; + + /* these have their own lock */ + spinlock_t rs_rdma_lock; + struct rb_root rs_rdma_keys; + + /* Socket options - in case there will be more */ + unsigned char rs_recverr, + rs_cong_monitor; +}; + +static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk) +{ + return container_of(sk, struct rds_sock, rs_sk); +} +static inline struct sock *rds_rs_to_sk(struct rds_sock *rs) +{ + return &rs->rs_sk; +} + +/* + * The stack assigns sk_sndbuf and sk_rcvbuf to twice the specified value + * to account for overhead. We don't account for overhead, we just apply + * the number of payload bytes to the specified value. + */ +static inline int rds_sk_sndbuf(struct rds_sock *rs) +{ + return rds_rs_to_sk(rs)->sk_sndbuf / 2; +} +static inline int rds_sk_rcvbuf(struct rds_sock *rs) +{ + return rds_rs_to_sk(rs)->sk_rcvbuf / 2; +} + +struct rds_statistics { + uint64_t s_conn_reset; + uint64_t s_recv_drop_bad_checksum; + uint64_t s_recv_drop_old_seq; + uint64_t s_recv_drop_no_sock; + uint64_t s_recv_drop_dead_sock; + uint64_t s_recv_deliver_raced; + uint64_t s_recv_delivered; + uint64_t s_recv_queued; + uint64_t s_recv_immediate_retry; + uint64_t s_recv_delayed_retry; + uint64_t s_recv_ack_required; + uint64_t s_recv_rdma_bytes; + uint64_t s_recv_ping; + uint64_t s_send_queue_empty; + uint64_t s_send_queue_full; + uint64_t s_send_sem_contention; + uint64_t s_send_sem_queue_raced; + uint64_t s_send_immediate_retry; + uint64_t s_send_delayed_retry; + uint64_t s_send_drop_acked; + uint64_t s_send_ack_required; + uint64_t s_send_queued; + uint64_t s_send_rdma; + uint64_t s_send_rdma_bytes; + uint64_t s_send_pong; + uint64_t s_page_remainder_hit; + uint64_t s_page_remainder_miss; + uint64_t s_copy_to_user; + uint64_t s_copy_from_user; + uint64_t s_cong_update_queued; + uint64_t s_cong_update_received; + uint64_t s_cong_send_error; + uint64_t s_cong_send_blocked; +}; + +/* af_rds.c */ +void rds_sock_addref(struct rds_sock *rs); +void rds_sock_put(struct rds_sock *rs); +void rds_wake_sk_sleep(struct rds_sock *rs); +static inline void __rds_wake_sk_sleep(struct sock *sk) +{ + wait_queue_head_t *waitq = sk->sk_sleep; + + if (!sock_flag(sk, SOCK_DEAD) && waitq) + wake_up(waitq); +} +extern wait_queue_head_t rds_poll_waitq; + + +/* bind.c */ +int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len); +void rds_remove_bound(struct rds_sock *rs); +struct rds_sock *rds_find_bound(__be32 addr, __be16 port); + +/* cong.c */ +int rds_cong_get_maps(struct rds_connection *conn); +void rds_cong_add_conn(struct rds_connection *conn); +void rds_cong_remove_conn(struct rds_connection *conn); +void rds_cong_set_bit(struct rds_cong_map *map, __be16 port); +void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port); +int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock, struct rds_sock *rs); +void rds_cong_queue_updates(struct rds_cong_map *map); +void rds_cong_map_updated(struct rds_cong_map *map, uint64_t); +int rds_cong_updated_since(unsigned long *recent); +void rds_cong_add_socket(struct rds_sock *); +void rds_cong_remove_socket(struct rds_sock *); +void rds_cong_exit(void); +struct rds_message *rds_cong_update_alloc(struct rds_connection *conn); + +/* conn.c */ +int __init rds_conn_init(void); +void rds_conn_exit(void); +struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp); +struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp); +void rds_conn_destroy(struct rds_connection *conn); +void rds_conn_reset(struct rds_connection *conn); +void rds_conn_drop(struct rds_connection *conn); +void rds_for_each_conn_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens, + int (*visitor)(struct rds_connection *, void *), + size_t item_len); +void __rds_conn_error(struct rds_connection *conn, const char *, ...) + __attribute__ ((format (printf, 2, 3))); +#define rds_conn_error(conn, fmt...) \ + __rds_conn_error(conn, KERN_WARNING "RDS: " fmt) + +static inline int +rds_conn_transition(struct rds_connection *conn, int old, int new) +{ + return atomic_cmpxchg(&conn->c_state, old, new) == old; +} + +static inline int +rds_conn_state(struct rds_connection *conn) +{ + return atomic_read(&conn->c_state); +} + +static inline int +rds_conn_up(struct rds_connection *conn) +{ + return atomic_read(&conn->c_state) == RDS_CONN_UP; +} + +static inline int +rds_conn_connecting(struct rds_connection *conn) +{ + return atomic_read(&conn->c_state) == RDS_CONN_CONNECTING; +} + +/* message.c */ +struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp); +struct rds_message *rds_message_copy_from_user(struct iovec *first_iov, + size_t total_len); +struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len); +void rds_message_populate_header(struct rds_header *hdr, __be16 sport, + __be16 dport, u64 seq); +int rds_message_add_extension(struct rds_header *hdr, + unsigned int type, const void *data, unsigned int len); +int rds_message_next_extension(struct rds_header *hdr, + unsigned int *pos, void *buf, unsigned int *buflen); +int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version); +int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version); +int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset); +int rds_message_inc_copy_to_user(struct rds_incoming *inc, + struct iovec *first_iov, size_t size); +void rds_message_inc_purge(struct rds_incoming *inc); +void rds_message_inc_free(struct rds_incoming *inc); +void rds_message_addref(struct rds_message *rm); +void rds_message_put(struct rds_message *rm); +void rds_message_wait(struct rds_message *rm); +void rds_message_unmapped(struct rds_message *rm); + +static inline void rds_message_make_checksum(struct rds_header *hdr) +{ + hdr->h_csum = 0; + hdr->h_csum = ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2); +} + +static inline int rds_message_verify_checksum(const struct rds_header *hdr) +{ + return !hdr->h_csum || ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2) == 0; +} + + +/* page.c */ +int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes, + gfp_t gfp); +int rds_page_copy_user(struct page *page, unsigned long offset, + void __user *ptr, unsigned long bytes, + int to_user); +#define rds_page_copy_to_user(page, offset, ptr, bytes) \ + rds_page_copy_user(page, offset, ptr, bytes, 1) +#define rds_page_copy_from_user(page, offset, ptr, bytes) \ + rds_page_copy_user(page, offset, ptr, bytes, 0) +void rds_page_exit(void); + +/* recv.c */ +void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, + __be32 saddr); +void rds_inc_addref(struct rds_incoming *inc); +void rds_inc_put(struct rds_incoming *inc); +void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, + struct rds_incoming *inc, gfp_t gfp, enum km_type km); +int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t size, int msg_flags); +void rds_clear_recv_queue(struct rds_sock *rs); +int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msg); +void rds_inc_info_copy(struct rds_incoming *inc, + struct rds_info_iterator *iter, + __be32 saddr, __be32 daddr, int flip); + +/* send.c */ +int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t payload_len); +void rds_send_reset(struct rds_connection *conn); +int rds_send_xmit(struct rds_connection *conn); +struct sockaddr_in; +void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest); +typedef int (*is_acked_func)(struct rds_message *rm, uint64_t ack); +void rds_send_drop_acked(struct rds_connection *conn, u64 ack, + is_acked_func is_acked); +int rds_send_acked_before(struct rds_connection *conn, u64 seq); +void rds_send_remove_from_sock(struct list_head *messages, int status); +int rds_send_pong(struct rds_connection *conn, __be16 dport); +struct rds_message *rds_send_get_message(struct rds_connection *, + struct rds_rdma_op *); + +/* rdma.c */ +void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force); + +/* stats.c */ +DECLARE_PER_CPU(struct rds_statistics, rds_stats); +#define rds_stats_inc_which(which, member) do { \ + per_cpu(which, get_cpu()).member++; \ + put_cpu(); \ +} while (0) +#define rds_stats_inc(member) rds_stats_inc_which(rds_stats, member) +#define rds_stats_add_which(which, member, count) do { \ + per_cpu(which, get_cpu()).member += count; \ + put_cpu(); \ +} while (0) +#define rds_stats_add(member, count) rds_stats_add_which(rds_stats, member, count) +int __init rds_stats_init(void); +void rds_stats_exit(void); +void rds_stats_info_copy(struct rds_info_iterator *iter, + uint64_t *values, char **names, size_t nr); + +/* sysctl.c */ +int __init rds_sysctl_init(void); +void rds_sysctl_exit(void); +extern unsigned long rds_sysctl_sndbuf_min; +extern unsigned long rds_sysctl_sndbuf_default; +extern unsigned long rds_sysctl_sndbuf_max; +extern unsigned long rds_sysctl_reconnect_min_jiffies; +extern unsigned long rds_sysctl_reconnect_max_jiffies; +extern unsigned int rds_sysctl_max_unacked_packets; +extern unsigned int rds_sysctl_max_unacked_bytes; +extern unsigned int rds_sysctl_ping_enable; +extern unsigned long rds_sysctl_trace_flags; +extern unsigned int rds_sysctl_trace_level; + +/* threads.c */ +int __init rds_threads_init(void); +void rds_threads_exit(void); +extern struct workqueue_struct *rds_wq; +void rds_connect_worker(struct work_struct *); +void rds_shutdown_worker(struct work_struct *); +void rds_send_worker(struct work_struct *); +void rds_recv_worker(struct work_struct *); +void rds_connect_complete(struct rds_connection *conn); + +/* transport.c */ +int rds_trans_register(struct rds_transport *trans); +void rds_trans_unregister(struct rds_transport *trans); +struct rds_transport *rds_trans_get_preferred(__be32 addr); +unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail); +int __init rds_trans_init(void); +void rds_trans_exit(void); + +#endif -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:20 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:20 -0800 Subject: [ofa-general] [PATCH 03/26] RDS: Congestion-handling code In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-4-git-send-email-andy.grover@oracle.com> RDS handles per-socket congestion by updating peers with a complete congestion map (8KB). This code keeps track of these maps for itself and ones received from peers. Signed-off-by: Andy Grover --- net/rds/cong.c | 402 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 402 insertions(+), 0 deletions(-) create mode 100644 net/rds/cong.c diff --git a/net/rds/cong.c b/net/rds/cong.c new file mode 100644 index 0000000..90e6b31 --- /dev/null +++ b/net/rds/cong.c @@ -0,0 +1,402 @@ +/* + * Copyright (c) 2007 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include + +#include "rds.h" + +/* + * This file implements the receive side of the unconventional congestion + * management in RDS. + * + * Messages waiting in the receive queue on the receiving socket are accounted + * against the sockets SO_RCVBUF option value. Only the payload bytes in the + * message are accounted for. If the number of bytes queued equals or exceeds + * rcvbuf then the socket is congested. All sends attempted to this socket's + * address should return block or return -EWOULDBLOCK. + * + * Applications are expected to be reasonably tuned such that this situation + * very rarely occurs. An application encountering this "back-pressure" is + * considered a bug. + * + * This is implemented by having each node maintain bitmaps which indicate + * which ports on bound addresses are congested. As the bitmap changes it is + * sent through all the connections which terminate in the local address of the + * bitmap which changed. + * + * The bitmaps are allocated as connections are brought up. This avoids + * allocation in the interrupt handling path which queues messages on sockets. + * The dense bitmaps let transports send the entire bitmap on any bitmap change + * reasonably efficiently. This is much easier to implement than some + * finer-grained communication of per-port congestion. The sender does a very + * inexpensive bit test to test if the port it's about to send to is congested + * or not. + */ + +/* + * Interaction with poll is a tad tricky. We want all processes stuck in + * poll to wake up and check whether a congested destination became uncongested. + * The really sad thing is we have no idea which destinations the application + * wants to send to - we don't even know which rds_connections are involved. + * So until we implement a more flexible rds poll interface, we have to make + * do with this: + * We maintain a global counter that is incremented each time a congestion map + * update is received. Each rds socket tracks this value, and if rds_poll + * finds that the saved generation number is smaller than the global generation + * number, it wakes up the process. + */ +static atomic_t rds_cong_generation = ATOMIC_INIT(0); + +/* + * Congestion monitoring + */ +static LIST_HEAD(rds_cong_monitor); +static DEFINE_RWLOCK(rds_cong_monitor_lock); + +/* + * Yes, a global lock. It's used so infrequently that it's worth keeping it + * global to simplify the locking. It's only used in the following + * circumstances: + * + * - on connection buildup to associate a conn with its maps + * - on map changes to inform conns of a new map to send + * + * It's sadly ordered under the socket callback lock and the connection lock. + * Receive paths can mark ports congested from interrupt context so the + * lock masks interrupts. + */ +static DEFINE_SPINLOCK(rds_cong_lock); +static struct rb_root rds_cong_tree = RB_ROOT; + +static struct rds_cong_map *rds_cong_tree_walk(__be32 addr, + struct rds_cong_map *insert) +{ + struct rb_node **p = &rds_cong_tree.rb_node; + struct rb_node *parent = NULL; + struct rds_cong_map *map; + + while (*p) { + parent = *p; + map = rb_entry(parent, struct rds_cong_map, m_rb_node); + + if (addr < map->m_addr) + p = &(*p)->rb_left; + else if (addr > map->m_addr) + p = &(*p)->rb_right; + else + return map; + } + + if (insert) { + rb_link_node(&insert->m_rb_node, parent, p); + rb_insert_color(&insert->m_rb_node, &rds_cong_tree); + } + return NULL; +} + +/* + * There is only ever one bitmap for any address. Connections try and allocate + * these bitmaps in the process getting pointers to them. The bitmaps are only + * ever freed as the module is removed after all connections have been freed. + */ +static struct rds_cong_map *rds_cong_from_addr(__be32 addr) +{ + struct rds_cong_map *map; + struct rds_cong_map *ret = NULL; + unsigned long zp; + unsigned long i; + unsigned long flags; + + map = kzalloc(sizeof(struct rds_cong_map), GFP_KERNEL); + if (map == NULL) + return NULL; + + map->m_addr = addr; + init_waitqueue_head(&map->m_waitq); + INIT_LIST_HEAD(&map->m_conn_list); + + for (i = 0; i < RDS_CONG_MAP_PAGES; i++) { + zp = get_zeroed_page(GFP_KERNEL); + if (zp == 0) + goto out; + map->m_page_addrs[i] = zp; + } + + spin_lock_irqsave(&rds_cong_lock, flags); + ret = rds_cong_tree_walk(addr, map); + spin_unlock_irqrestore(&rds_cong_lock, flags); + + if (ret == NULL) { + ret = map; + map = NULL; + } + +out: + if (map) { + for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++) + free_page(map->m_page_addrs[i]); + kfree(map); + } + + rdsdebug("map %p for addr %x\n", ret, be32_to_cpu(addr)); + + return ret; +} + +/* + * Put the conn on its local map's list. This is called when the conn is + * really added to the hash. It's nested under the rds_conn_lock, sadly. + */ +void rds_cong_add_conn(struct rds_connection *conn) +{ + unsigned long flags; + + rdsdebug("conn %p now on map %p\n", conn, conn->c_lcong); + spin_lock_irqsave(&rds_cong_lock, flags); + list_add_tail(&conn->c_map_item, &conn->c_lcong->m_conn_list); + spin_unlock_irqrestore(&rds_cong_lock, flags); +} + +void rds_cong_remove_conn(struct rds_connection *conn) +{ + unsigned long flags; + + rdsdebug("removing conn %p from map %p\n", conn, conn->c_lcong); + spin_lock_irqsave(&rds_cong_lock, flags); + list_del_init(&conn->c_map_item); + spin_unlock_irqrestore(&rds_cong_lock, flags); +} + +int rds_cong_get_maps(struct rds_connection *conn) +{ + conn->c_lcong = rds_cong_from_addr(conn->c_laddr); + conn->c_fcong = rds_cong_from_addr(conn->c_faddr); + + if (conn->c_lcong == NULL || conn->c_fcong == NULL) + return -ENOMEM; + + return 0; +} + +void rds_cong_queue_updates(struct rds_cong_map *map) +{ + struct rds_connection *conn; + unsigned long flags; + + spin_lock_irqsave(&rds_cong_lock, flags); + + list_for_each_entry(conn, &map->m_conn_list, c_map_item) { + if (!test_and_set_bit(0, &conn->c_map_queued)) { + rds_stats_inc(s_cong_update_queued); + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + } + } + + spin_unlock_irqrestore(&rds_cong_lock, flags); +} + +void rds_cong_map_updated(struct rds_cong_map *map, uint64_t portmask) +{ + rdsdebug("waking map %p for %pI4\n", + map, &map->m_addr); + rds_stats_inc(s_cong_update_received); + atomic_inc(&rds_cong_generation); + if (waitqueue_active(&map->m_waitq)) + wake_up(&map->m_waitq); + if (waitqueue_active(&rds_poll_waitq)) + wake_up_all(&rds_poll_waitq); + + if (portmask && !list_empty(&rds_cong_monitor)) { + unsigned long flags; + struct rds_sock *rs; + + read_lock_irqsave(&rds_cong_monitor_lock, flags); + list_for_each_entry(rs, &rds_cong_monitor, rs_cong_list) { + spin_lock(&rs->rs_lock); + rs->rs_cong_notify |= (rs->rs_cong_mask & portmask); + rs->rs_cong_mask &= ~portmask; + spin_unlock(&rs->rs_lock); + if (rs->rs_cong_notify) + rds_wake_sk_sleep(rs); + } + read_unlock_irqrestore(&rds_cong_monitor_lock, flags); + } +} + +int rds_cong_updated_since(unsigned long *recent) +{ + unsigned long gen = atomic_read(&rds_cong_generation); + + if (likely(*recent == gen)) + return 0; + *recent = gen; + return 1; +} + +/* + * We're called under the locking that protects the sockets receive buffer + * consumption. This makes it a lot easier for the caller to only call us + * when it knows that an existing set bit needs to be cleared, and vice versa. + * We can't block and we need to deal with concurrent sockets working against + * the same per-address map. + */ +void rds_cong_set_bit(struct rds_cong_map *map, __be16 port) +{ + unsigned long i; + unsigned long off; + + rdsdebug("setting congestion for %pI4:%u in map %p\n", + &map->m_addr, ntohs(port), map); + + i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; + off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; + + generic___set_le_bit(off, (void *)map->m_page_addrs[i]); +} + +void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) +{ + unsigned long i; + unsigned long off; + + rdsdebug("clearing congestion for %pI4:%u in map %p\n", + &map->m_addr, ntohs(port), map); + + i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; + off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; + + generic___clear_le_bit(off, (void *)map->m_page_addrs[i]); +} + +static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port) +{ + unsigned long i; + unsigned long off; + + i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; + off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; + + return generic_test_le_bit(off, (void *)map->m_page_addrs[i]); +} + +void rds_cong_add_socket(struct rds_sock *rs) +{ + unsigned long flags; + + write_lock_irqsave(&rds_cong_monitor_lock, flags); + if (list_empty(&rs->rs_cong_list)) + list_add(&rs->rs_cong_list, &rds_cong_monitor); + write_unlock_irqrestore(&rds_cong_monitor_lock, flags); +} + +void rds_cong_remove_socket(struct rds_sock *rs) +{ + unsigned long flags; + struct rds_cong_map *map; + + write_lock_irqsave(&rds_cong_monitor_lock, flags); + list_del_init(&rs->rs_cong_list); + write_unlock_irqrestore(&rds_cong_monitor_lock, flags); + + /* update congestion map for now-closed port */ + spin_lock_irqsave(&rds_cong_lock, flags); + map = rds_cong_tree_walk(rs->rs_bound_addr, NULL); + spin_unlock_irqrestore(&rds_cong_lock, flags); + + if (map && rds_cong_test_bit(map, rs->rs_bound_port)) { + rds_cong_clear_bit(map, rs->rs_bound_port); + rds_cong_queue_updates(map); + } +} + +int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock, + struct rds_sock *rs) +{ + if (!rds_cong_test_bit(map, port)) + return 0; + if (nonblock) { + if (rs && rs->rs_cong_monitor) { + unsigned long flags; + + /* It would have been nice to have an atomic set_bit on + * a uint64_t. */ + spin_lock_irqsave(&rs->rs_lock, flags); + rs->rs_cong_mask |= RDS_CONG_MONITOR_MASK(ntohs(port)); + spin_unlock_irqrestore(&rs->rs_lock, flags); + + /* Test again - a congestion update may have arrived in + * the meantime. */ + if (!rds_cong_test_bit(map, port)) + return 0; + } + rds_stats_inc(s_cong_send_error); + return -ENOBUFS; + } + + rds_stats_inc(s_cong_send_blocked); + rdsdebug("waiting on map %p for port %u\n", map, be16_to_cpu(port)); + + return wait_event_interruptible(map->m_waitq, + !rds_cong_test_bit(map, port)); +} + +void rds_cong_exit(void) +{ + struct rb_node *node; + struct rds_cong_map *map; + unsigned long i; + + while ((node = rb_first(&rds_cong_tree))) { + map = rb_entry(node, struct rds_cong_map, m_rb_node); + rdsdebug("freeing map %p\n", map); + rb_erase(&map->m_rb_node, &rds_cong_tree); + for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++) + free_page(map->m_page_addrs[i]); + kfree(map); + } +} + +/* + * Allocate a RDS message containing a congestion update. + */ +struct rds_message *rds_cong_update_alloc(struct rds_connection *conn) +{ + struct rds_cong_map *map = conn->c_lcong; + struct rds_message *rm; + + rm = rds_message_map_pages(map->m_page_addrs, RDS_CONG_MAP_BYTES); + if (!IS_ERR(rm)) + rm->m_inc.i_hdr.h_flags = RDS_FLAG_CONG_BITMAP; + + return rm; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:21 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:21 -0800 Subject: [ofa-general] [PATCH 04/26] RDS: Transport code In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-5-git-send-email-andy.grover@oracle.com> RDS supports multiple transports. While this initial submission only supports Infiniband transport, this abstraction allows others to be added. We're working on an iWARP transport, and also see UDP over DCB as another possibility. This code handles transport registration. Signed-off-by: Andy Grover --- net/rds/transport.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 117 insertions(+), 0 deletions(-) create mode 100644 net/rds/transport.c diff --git a/net/rds/transport.c b/net/rds/transport.c new file mode 100644 index 0000000..767da61 --- /dev/null +++ b/net/rds/transport.c @@ -0,0 +1,117 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "loop.h" + +static LIST_HEAD(rds_transports); +static DECLARE_RWSEM(rds_trans_sem); + +int rds_trans_register(struct rds_transport *trans) +{ + BUG_ON(strlen(trans->t_name) + 1 > TRANSNAMSIZ); + + down_write(&rds_trans_sem); + + list_add_tail(&trans->t_item, &rds_transports); + printk(KERN_INFO "Registered RDS/%s transport\n", trans->t_name); + + up_write(&rds_trans_sem); + + return 0; +} + +void rds_trans_unregister(struct rds_transport *trans) +{ + down_write(&rds_trans_sem); + + list_del_init(&trans->t_item); + printk(KERN_INFO "Unregistered RDS/%s transport\n", trans->t_name); + + up_write(&rds_trans_sem); +} + +struct rds_transport *rds_trans_get_preferred(__be32 addr) +{ + struct rds_transport *trans; + struct rds_transport *ret = NULL; + + if (IN_LOOPBACK(ntohl(addr))) + return &rds_loop_transport; + + down_read(&rds_trans_sem); + list_for_each_entry(trans, &rds_transports, t_item) { + if (trans->laddr_check(addr) == 0) { + ret = trans; + break; + } + } + up_read(&rds_trans_sem); + + return ret; +} + +/* + * This returns the number of stats entries in the snapshot and only + * copies them using the iter if there is enough space for them. The + * caller passes in the global stats so that we can size and copy while + * holding the lock. + */ +unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail) + +{ + struct rds_transport *trans; + unsigned int total = 0; + unsigned int part; + + rds_info_iter_unmap(iter); + down_read(&rds_trans_sem); + + list_for_each_entry(trans, &rds_transports, t_item) { + if (trans->stats_info_copy == NULL) + continue; + + part = trans->stats_info_copy(iter, avail); + avail -= min(avail, part); + total += part; + } + + up_read(&rds_trans_sem); + + return total; +} + -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:22 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:22 -0800 Subject: [ofa-general] [PATCH 05/26] RDS: Info and stats In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-6-git-send-email-andy.grover@oracle.com> RDS currently generates a lot of stats that are accessible via the rds-info utility. This code implements the support for this. Signed-off-by: Andy Grover --- net/rds/info.c | 241 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/info.h | 30 +++++++ net/rds/stats.c | 148 ++++++++++++++++++++++++++++++++++ 3 files changed, 419 insertions(+), 0 deletions(-) create mode 100644 net/rds/info.c create mode 100644 net/rds/info.h create mode 100644 net/rds/stats.c diff --git a/net/rds/info.c b/net/rds/info.c new file mode 100644 index 0000000..1d88553 --- /dev/null +++ b/net/rds/info.c @@ -0,0 +1,241 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" + +/* + * This file implements a getsockopt() call which copies a set of fixed + * sized structs into a user-specified buffer as a means of providing + * read-only information about RDS. + * + * For a given information source there are a given number of fixed sized + * structs at a given time. The structs are only copied if the user-specified + * buffer is big enough. The destination pages that make up the buffer + * are pinned for the duration of the copy. + * + * This gives us the following benefits: + * + * - simple implementation, no copy "position" across multiple calls + * - consistent snapshot of an info source + * - atomic copy works well with whatever locking info source has + * - one portable tool to get rds info across implementations + * - long-lived tool can get info without allocating + * + * at the following costs: + * + * - info source copy must be pinned, may be "large" + */ + +struct rds_info_iterator { + struct page **pages; + void *addr; + unsigned long offset; +}; + +static DEFINE_SPINLOCK(rds_info_lock); +static rds_info_func rds_info_funcs[RDS_INFO_LAST - RDS_INFO_FIRST + 1]; + +void rds_info_register_func(int optname, rds_info_func func) +{ + int offset = optname - RDS_INFO_FIRST; + + BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST); + + spin_lock(&rds_info_lock); + BUG_ON(rds_info_funcs[offset] != NULL); + rds_info_funcs[offset] = func; + spin_unlock(&rds_info_lock); +} + +void rds_info_deregister_func(int optname, rds_info_func func) +{ + int offset = optname - RDS_INFO_FIRST; + + BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST); + + spin_lock(&rds_info_lock); + BUG_ON(rds_info_funcs[offset] != func); + rds_info_funcs[offset] = NULL; + spin_unlock(&rds_info_lock); +} + +/* + * Typically we hold an atomic kmap across multiple rds_info_copy() calls + * because the kmap is so expensive. This must be called before using blocking + * operations while holding the mapping and as the iterator is torn down. + */ +void rds_info_iter_unmap(struct rds_info_iterator *iter) +{ + if (iter->addr != NULL) { + kunmap_atomic(iter->addr, KM_USER0); + iter->addr = NULL; + } +} + +/* + * get_user_pages() called flush_dcache_page() on the pages for us. + */ +void rds_info_copy(struct rds_info_iterator *iter, void *data, + unsigned long bytes) +{ + unsigned long this; + + while (bytes) { + if (iter->addr == NULL) + iter->addr = kmap_atomic(*iter->pages, KM_USER0); + + this = min(bytes, PAGE_SIZE - iter->offset); + + rdsdebug("page %p addr %p offset %lu this %lu data %p " + "bytes %lu\n", *iter->pages, iter->addr, + iter->offset, this, data, bytes); + + memcpy(iter->addr + iter->offset, data, this); + + data += this; + bytes -= this; + iter->offset += this; + + if (iter->offset == PAGE_SIZE) { + kunmap_atomic(iter->addr, KM_USER0); + iter->addr = NULL; + iter->offset = 0; + iter->pages++; + } + } +} + +/* + * @optval points to the userspace buffer that the information snapshot + * will be copied into. + * + * @optlen on input is the size of the buffer in userspace. @optlen + * on output is the size of the requested snapshot in bytes. + * + * This function returns -errno if there is a failure, particularly -ENOSPC + * if the given userspace buffer was not large enough to fit the snapshot. + * On success it returns the positive number of bytes of each array element + * in the snapshot. + */ +int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval, + int __user *optlen) +{ + struct rds_info_iterator iter; + struct rds_info_lengths lens; + unsigned long nr_pages = 0; + unsigned long start; + unsigned long i; + rds_info_func func; + struct page **pages = NULL; + int ret; + int len; + int total; + + if (get_user(len, optlen)) { + ret = -EFAULT; + goto out; + } + + /* check for all kinds of wrapping and the like */ + start = (unsigned long)optval; + if (len < 0 || len + PAGE_SIZE - 1 < len || start + len < start) { + ret = -EINVAL; + goto out; + } + + /* a 0 len call is just trying to probe its length */ + if (len == 0) + goto call_func; + + nr_pages = (PAGE_ALIGN(start + len) - (start & PAGE_MASK)) + >> PAGE_SHIFT; + + pages = kmalloc(nr_pages * sizeof(struct page *), GFP_KERNEL); + if (pages == NULL) { + ret = -ENOMEM; + goto out; + } + down_read(¤t->mm->mmap_sem); + ret = get_user_pages(current, current->mm, start, nr_pages, 1, 0, + pages, NULL); + up_read(¤t->mm->mmap_sem); + if (ret != nr_pages) { + if (ret > 0) + nr_pages = ret; + else + nr_pages = 0; + ret = -EAGAIN; /* XXX ? */ + goto out; + } + + rdsdebug("len %d nr_pages %lu\n", len, nr_pages); + +call_func: + func = rds_info_funcs[optname - RDS_INFO_FIRST]; + if (func == NULL) { + ret = -ENOPROTOOPT; + goto out; + } + + iter.pages = pages; + iter.addr = NULL; + iter.offset = start & (PAGE_SIZE - 1); + + func(sock, len, &iter, &lens); + BUG_ON(lens.each == 0); + + total = lens.nr * lens.each; + + rds_info_iter_unmap(&iter); + + if (total > len) { + len = total; + ret = -ENOSPC; + } else { + len = total; + ret = lens.each; + } + + if (put_user(len, optlen)) + ret = -EFAULT; + +out: + for (i = 0; pages != NULL && i < nr_pages; i++) + put_page(pages[i]); + kfree(pages); + + return ret; +} diff --git a/net/rds/info.h b/net/rds/info.h new file mode 100644 index 0000000..b6c052c --- /dev/null +++ b/net/rds/info.h @@ -0,0 +1,30 @@ +#ifndef _RDS_INFO_H +#define _RDS_INFO_H + +struct rds_info_lengths { + unsigned int nr; + unsigned int each; +}; + +struct rds_info_iterator; + +/* + * These functions must fill in the fields of @lens to reflect the size + * of the available info source. If the snapshot fits in @len then it + * should be copied using @iter. The caller will deduce if it was copied + * or not by comparing the lengths. + */ +typedef void (*rds_info_func)(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens); + +void rds_info_register_func(int optname, rds_info_func func); +void rds_info_deregister_func(int optname, rds_info_func func); +int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval, + int __user *optlen); +void rds_info_copy(struct rds_info_iterator *iter, void *data, + unsigned long bytes); +void rds_info_iter_unmap(struct rds_info_iterator *iter); + + +#endif diff --git a/net/rds/stats.c b/net/rds/stats.c new file mode 100644 index 0000000..6371468 --- /dev/null +++ b/net/rds/stats.c @@ -0,0 +1,148 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" + +DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_statistics, rds_stats); + +/* :.,$s/unsigned long\>.*\= sizeof(ctr.name)); + strncpy(ctr.name, names[i], sizeof(ctr.name) - 1); + ctr.value = values[i]; + + rds_info_copy(iter, &ctr, sizeof(ctr)); + } +} + +/* + * This gives global counters across all the transports. The strings + * are copied in so that the tool doesn't need knowledge of the specific + * stats that we're exporting. Some are pretty implementation dependent + * and may change over time. That doesn't stop them from being useful. + * + * This is the only function in the chain that knows about the byte granular + * length in userspace. It converts it to number of stat entries that the + * rest of the functions operate in. + */ +static void rds_stats_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + struct rds_statistics stats = {0, }; + uint64_t *src; + uint64_t *sum; + size_t i; + int cpu; + unsigned int avail; + + avail = len / sizeof(struct rds_info_counter); + + if (avail < ARRAY_SIZE(rds_stat_names)) { + avail = 0; + goto trans; + } + + for_each_online_cpu(cpu) { + src = (uint64_t *)&(per_cpu(rds_stats, cpu)); + sum = (uint64_t *)&stats; + for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++) + *(sum++) += *(src++); + } + + rds_stats_info_copy(iter, (uint64_t *)&stats, rds_stat_names, + ARRAY_SIZE(rds_stat_names)); + avail -= ARRAY_SIZE(rds_stat_names); + +trans: + lens->each = sizeof(struct rds_info_counter); + lens->nr = rds_trans_stats_info_copy(iter, avail) + + ARRAY_SIZE(rds_stat_names); +} + +void rds_stats_exit(void) +{ + rds_info_deregister_func(RDS_INFO_COUNTERS, rds_stats_info); +} + +int __init rds_stats_init(void) +{ + rds_info_register_func(RDS_INFO_COUNTERS, rds_stats_info); + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:23 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:23 -0800 Subject: [ofa-general] [PATCH 06/26] RDS: Connection handling In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-7-git-send-email-andy.grover@oracle.com> While arguably the fact that the underlying transport needs a connection to convey RDS's datagrame reliably is not important to rds proper, the transports implemented so far (IB and TCP) have both been connection-oriented, and so the connection state machine-related code is in the common rds code. This patch also includes several work items, to handle connecting, sending, receiving, and shutdown. Signed-off-by: Andy Grover --- net/rds/connection.c | 487 ++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/threads.c | 265 +++++++++++++++++++++++++++ 2 files changed, 752 insertions(+), 0 deletions(-) create mode 100644 net/rds/connection.c create mode 100644 net/rds/threads.c diff --git a/net/rds/connection.c b/net/rds/connection.c new file mode 100644 index 0000000..273f064 --- /dev/null +++ b/net/rds/connection.c @@ -0,0 +1,487 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "loop.h" +#include "rdma.h" + +#define RDS_CONNECTION_HASH_BITS 12 +#define RDS_CONNECTION_HASH_ENTRIES (1 << RDS_CONNECTION_HASH_BITS) +#define RDS_CONNECTION_HASH_MASK (RDS_CONNECTION_HASH_ENTRIES - 1) + +/* converting this to RCU is a chore for another day.. */ +static DEFINE_SPINLOCK(rds_conn_lock); +static unsigned long rds_conn_count; +static struct hlist_head rds_conn_hash[RDS_CONNECTION_HASH_ENTRIES]; +static struct kmem_cache *rds_conn_slab; + +static struct hlist_head *rds_conn_bucket(__be32 laddr, __be32 faddr) +{ + /* Pass NULL, don't need struct net for hash */ + unsigned long hash = inet_ehashfn(NULL, + be32_to_cpu(laddr), 0, + be32_to_cpu(faddr), 0); + return &rds_conn_hash[hash & RDS_CONNECTION_HASH_MASK]; +} + +#define rds_conn_info_set(var, test, suffix) do { \ + if (test) \ + var |= RDS_INFO_CONNECTION_FLAG_##suffix; \ +} while (0) + +static inline int rds_conn_is_sending(struct rds_connection *conn) +{ + int ret = 0; + + if (!mutex_trylock(&conn->c_send_lock)) + ret = 1; + else + mutex_unlock(&conn->c_send_lock); + + return ret; +} + +static struct rds_connection *rds_conn_lookup(struct hlist_head *head, + __be32 laddr, __be32 faddr, + struct rds_transport *trans) +{ + struct rds_connection *conn, *ret = NULL; + struct hlist_node *pos; + + hlist_for_each_entry(conn, pos, head, c_hash_node) { + if (conn->c_faddr == faddr && conn->c_laddr == laddr && + conn->c_trans == trans) { + ret = conn; + break; + } + } + rdsdebug("returning conn %p for %pI4 -> %pI4\n", ret, + &laddr, &faddr); + return ret; +} + +/* + * This is called by transports as they're bringing down a connection. + * It clears partial message state so that the transport can start sending + * and receiving over this connection again in the future. It is up to + * the transport to have serialized this call with its send and recv. + */ +void rds_conn_reset(struct rds_connection *conn) +{ + rdsdebug("connection %pI4 to %pI4 reset\n", + &conn->c_laddr, &conn->c_faddr); + + rds_stats_inc(s_conn_reset); + rds_send_reset(conn); + conn->c_flags = 0; + + /* Do not clear next_rx_seq here, else we cannot distinguish + * retransmitted packets from new packets, and will hand all + * of them to the application. That is not consistent with the + * reliability guarantees of RDS. */ +} + +/* + * There is only every one 'conn' for a given pair of addresses in the + * system at a time. They contain messages to be retransmitted and so + * span the lifetime of the actual underlying transport connections. + * + * For now they are not garbage collected once they're created. They + * are torn down as the module is removed, if ever. + */ +static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp, + int is_outgoing) +{ + struct rds_connection *conn, *tmp, *parent = NULL; + struct hlist_head *head = rds_conn_bucket(laddr, faddr); + unsigned long flags; + int ret; + + spin_lock_irqsave(&rds_conn_lock, flags); + conn = rds_conn_lookup(head, laddr, faddr, trans); + if (conn + && conn->c_loopback + && conn->c_trans != &rds_loop_transport + && !is_outgoing) { + /* This is a looped back IB connection, and we're + * called by the code handling the incoming connect. + * We need a second connection object into which we + * can stick the other QP. */ + parent = conn; + conn = parent->c_passive; + } + spin_unlock_irqrestore(&rds_conn_lock, flags); + if (conn) + goto out; + + conn = kmem_cache_alloc(rds_conn_slab, gfp); + if (conn == NULL) { + conn = ERR_PTR(-ENOMEM); + goto out; + } + + memset(conn, 0, sizeof(*conn)); + + INIT_HLIST_NODE(&conn->c_hash_node); + conn->c_version = RDS_PROTOCOL_3_0; + conn->c_laddr = laddr; + conn->c_faddr = faddr; + spin_lock_init(&conn->c_lock); + conn->c_next_tx_seq = 1; + + mutex_init(&conn->c_send_lock); + INIT_LIST_HEAD(&conn->c_send_queue); + INIT_LIST_HEAD(&conn->c_retrans); + + ret = rds_cong_get_maps(conn); + if (ret) { + kmem_cache_free(rds_conn_slab, conn); + conn = ERR_PTR(ret); + goto out; + } + + /* + * This is where a connection becomes loopback. If *any* RDS sockets + * can bind to the destination address then we'd rather the messages + * flow through loopback rather than either transport. + */ + if (rds_trans_get_preferred(faddr)) { + conn->c_loopback = 1; + if (is_outgoing && trans->t_prefer_loopback) { + /* "outgoing" connection - and the transport + * says it wants the connection handled by the + * loopback transport. This is what TCP does. + */ + trans = &rds_loop_transport; + } + } + + conn->c_trans = trans; + + ret = trans->conn_alloc(conn, gfp); + if (ret) { + kmem_cache_free(rds_conn_slab, conn); + conn = ERR_PTR(ret); + goto out; + } + + atomic_set(&conn->c_state, RDS_CONN_DOWN); + conn->c_reconnect_jiffies = 0; + INIT_DELAYED_WORK(&conn->c_send_w, rds_send_worker); + INIT_DELAYED_WORK(&conn->c_recv_w, rds_recv_worker); + INIT_DELAYED_WORK(&conn->c_conn_w, rds_connect_worker); + INIT_WORK(&conn->c_down_w, rds_shutdown_worker); + mutex_init(&conn->c_cm_lock); + conn->c_flags = 0; + + rdsdebug("allocated conn %p for %pI4 -> %pI4 over %s %s\n", + conn, &laddr, &faddr, + trans->t_name ? trans->t_name : "[unknown]", + is_outgoing ? "(outgoing)" : ""); + + spin_lock_irqsave(&rds_conn_lock, flags); + if (parent == NULL) { + tmp = rds_conn_lookup(head, laddr, faddr, trans); + if (tmp == NULL) + hlist_add_head(&conn->c_hash_node, head); + } else { + tmp = parent->c_passive; + if (!tmp) + parent->c_passive = conn; + } + + if (tmp) { + trans->conn_free(conn->c_transport_data); + kmem_cache_free(rds_conn_slab, conn); + conn = tmp; + } else { + rds_cong_add_conn(conn); + rds_conn_count++; + } + + spin_unlock_irqrestore(&rds_conn_lock, flags); + +out: + return conn; +} + +struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp) +{ + return __rds_conn_create(laddr, faddr, trans, gfp, 0); +} + +struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp) +{ + return __rds_conn_create(laddr, faddr, trans, gfp, 1); +} + +void rds_conn_destroy(struct rds_connection *conn) +{ + struct rds_message *rm, *rtmp; + + rdsdebug("freeing conn %p for %pI4 -> " + "%pI4\n", conn, &conn->c_laddr, + &conn->c_faddr); + + hlist_del_init(&conn->c_hash_node); + + /* wait for the rds thread to shut it down */ + atomic_set(&conn->c_state, RDS_CONN_ERROR); + cancel_delayed_work(&conn->c_conn_w); + queue_work(rds_wq, &conn->c_down_w); + flush_workqueue(rds_wq); + + /* tear down queued messages */ + list_for_each_entry_safe(rm, rtmp, + &conn->c_send_queue, + m_conn_item) { + list_del_init(&rm->m_conn_item); + BUG_ON(!list_empty(&rm->m_sock_item)); + rds_message_put(rm); + } + if (conn->c_xmit_rm) + rds_message_put(conn->c_xmit_rm); + + conn->c_trans->conn_free(conn->c_transport_data); + + /* + * The congestion maps aren't freed up here. They're + * freed by rds_cong_exit() after all the connections + * have been freed. + */ + rds_cong_remove_conn(conn); + + BUG_ON(!list_empty(&conn->c_retrans)); + kmem_cache_free(rds_conn_slab, conn); + + rds_conn_count--; +} + +static void rds_conn_message_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens, + int want_send) +{ + struct hlist_head *head; + struct hlist_node *pos; + struct list_head *list; + struct rds_connection *conn; + struct rds_message *rm; + unsigned long flags; + unsigned int total = 0; + size_t i; + + len /= sizeof(struct rds_info_message); + + spin_lock_irqsave(&rds_conn_lock, flags); + + for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash); + i++, head++) { + hlist_for_each_entry(conn, pos, head, c_hash_node) { + if (want_send) + list = &conn->c_send_queue; + else + list = &conn->c_retrans; + + spin_lock(&conn->c_lock); + + /* XXX too lazy to maintain counts.. */ + list_for_each_entry(rm, list, m_conn_item) { + total++; + if (total <= len) + rds_inc_info_copy(&rm->m_inc, iter, + conn->c_laddr, + conn->c_faddr, 0); + } + + spin_unlock(&conn->c_lock); + } + } + + spin_unlock_irqrestore(&rds_conn_lock, flags); + + lens->nr = total; + lens->each = sizeof(struct rds_info_message); +} + +static void rds_conn_message_info_send(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_conn_message_info(sock, len, iter, lens, 1); +} + +static void rds_conn_message_info_retrans(struct socket *sock, + unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_conn_message_info(sock, len, iter, lens, 0); +} + +void rds_for_each_conn_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens, + int (*visitor)(struct rds_connection *, void *), + size_t item_len) +{ + uint64_t buffer[(item_len + 7) / 8]; + struct hlist_head *head; + struct hlist_node *pos; + struct hlist_node *tmp; + struct rds_connection *conn; + unsigned long flags; + size_t i; + + spin_lock_irqsave(&rds_conn_lock, flags); + + lens->nr = 0; + lens->each = item_len; + + for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash); + i++, head++) { + hlist_for_each_entry_safe(conn, pos, tmp, head, c_hash_node) { + + /* XXX no c_lock usage.. */ + if (!visitor(conn, buffer)) + continue; + + /* We copy as much as we can fit in the buffer, + * but we count all items so that the caller + * can resize the buffer. */ + if (len >= item_len) { + rds_info_copy(iter, buffer, item_len); + len -= item_len; + } + lens->nr++; + } + } + + spin_unlock_irqrestore(&rds_conn_lock, flags); +} + +static int rds_conn_info_visitor(struct rds_connection *conn, + void *buffer) +{ + struct rds_info_connection *cinfo = buffer; + + cinfo->next_tx_seq = conn->c_next_tx_seq; + cinfo->next_rx_seq = conn->c_next_rx_seq; + cinfo->laddr = conn->c_laddr; + cinfo->faddr = conn->c_faddr; + strncpy(cinfo->transport, conn->c_trans->t_name, + sizeof(cinfo->transport)); + cinfo->flags = 0; + + rds_conn_info_set(cinfo->flags, + rds_conn_is_sending(conn), SENDING); + /* XXX Future: return the state rather than these funky bits */ + rds_conn_info_set(cinfo->flags, + atomic_read(&conn->c_state) == RDS_CONN_CONNECTING, + CONNECTING); + rds_conn_info_set(cinfo->flags, + atomic_read(&conn->c_state) == RDS_CONN_UP, + CONNECTED); + return 1; +} + +static void rds_conn_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_for_each_conn_info(sock, len, iter, lens, + rds_conn_info_visitor, + sizeof(struct rds_info_connection)); +} + +int __init rds_conn_init(void) +{ + rds_conn_slab = kmem_cache_create("rds_connection", + sizeof(struct rds_connection), + 0, 0, NULL); + if (rds_conn_slab == NULL) + return -ENOMEM; + + rds_info_register_func(RDS_INFO_CONNECTIONS, rds_conn_info); + rds_info_register_func(RDS_INFO_SEND_MESSAGES, + rds_conn_message_info_send); + rds_info_register_func(RDS_INFO_RETRANS_MESSAGES, + rds_conn_message_info_retrans); + + return 0; +} + +void rds_conn_exit(void) +{ + rds_loop_exit(); + + WARN_ON(!hlist_empty(rds_conn_hash)); + + kmem_cache_destroy(rds_conn_slab); + + rds_info_deregister_func(RDS_INFO_CONNECTIONS, rds_conn_info); + rds_info_deregister_func(RDS_INFO_SEND_MESSAGES, + rds_conn_message_info_send); + rds_info_deregister_func(RDS_INFO_RETRANS_MESSAGES, + rds_conn_message_info_retrans); +} + +/* + * Force a disconnect + */ +void rds_conn_drop(struct rds_connection *conn) +{ + atomic_set(&conn->c_state, RDS_CONN_ERROR); + queue_work(rds_wq, &conn->c_down_w); +} + +/* + * An error occurred on the connection + */ +void +__rds_conn_error(struct rds_connection *conn, const char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vprintk(fmt, ap); + va_end(ap); + + rds_conn_drop(conn); +} diff --git a/net/rds/threads.c b/net/rds/threads.c new file mode 100644 index 0000000..828a1bf --- /dev/null +++ b/net/rds/threads.c @@ -0,0 +1,265 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include + +#include "rds.h" + +/* + * All of connection management is simplified by serializing it through + * work queues that execute in a connection managing thread. + * + * TCP wants to send acks through sendpage() in response to data_ready(), + * but it needs a process context to do so. + * + * The receive paths need to allocate but can't drop packets (!) so we have + * a thread around to block allocating if the receive fast path sees an + * allocation failure. + */ + +/* Grand Unified Theory of connection life cycle: + * At any point in time, the connection can be in one of these states: + * DOWN, CONNECTING, UP, DISCONNECTING, ERROR + * + * The following transitions are possible: + * ANY -> ERROR + * UP -> DISCONNECTING + * ERROR -> DISCONNECTING + * DISCONNECTING -> DOWN + * DOWN -> CONNECTING + * CONNECTING -> UP + * + * Transition to state DISCONNECTING/DOWN: + * - Inside the shutdown worker; synchronizes with xmit path + * through c_send_lock, and with connection management callbacks + * via c_cm_lock. + * + * For receive callbacks, we rely on the underlying transport + * (TCP, IB/RDMA) to provide the necessary synchronisation. + */ +struct workqueue_struct *rds_wq; + +void rds_connect_complete(struct rds_connection *conn) +{ + if (!rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_UP)) { + printk(KERN_WARNING "%s: Cannot transition to state UP, " + "current state is %d\n", + __func__, + atomic_read(&conn->c_state)); + atomic_set(&conn->c_state, RDS_CONN_ERROR); + queue_work(rds_wq, &conn->c_down_w); + return; + } + + rdsdebug("conn %p for %pI4 to %pI4 complete\n", + conn, &conn->c_laddr, &conn->c_faddr); + + conn->c_reconnect_jiffies = 0; + set_bit(0, &conn->c_map_queued); + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + queue_delayed_work(rds_wq, &conn->c_recv_w, 0); +} + +/* + * This random exponential backoff is relied on to eventually resolve racing + * connects. + * + * If connect attempts race then both parties drop both connections and come + * here to wait for a random amount of time before trying again. Eventually + * the backoff range will be so much greater than the time it takes to + * establish a connection that one of the pair will establish the connection + * before the other's random delay fires. + * + * Connection attempts that arrive while a connection is already established + * are also considered to be racing connects. This lets a connection from + * a rebooted machine replace an existing stale connection before the transport + * notices that the connection has failed. + * + * We should *always* start with a random backoff; otherwise a broken connection + * will always take several iterations to be re-established. + */ +static void rds_queue_reconnect(struct rds_connection *conn) +{ + unsigned long rand; + + rdsdebug("conn %p for %pI4 to %pI4 reconnect jiffies %lu\n", + conn, &conn->c_laddr, &conn->c_faddr, + conn->c_reconnect_jiffies); + + set_bit(RDS_RECONNECT_PENDING, &conn->c_flags); + if (conn->c_reconnect_jiffies == 0) { + conn->c_reconnect_jiffies = rds_sysctl_reconnect_min_jiffies; + queue_delayed_work(rds_wq, &conn->c_conn_w, 0); + return; + } + + get_random_bytes(&rand, sizeof(rand)); + rdsdebug("%lu delay %lu ceil conn %p for %pI4 -> %pI4\n", + rand % conn->c_reconnect_jiffies, conn->c_reconnect_jiffies, + conn, &conn->c_laddr, &conn->c_faddr); + queue_delayed_work(rds_wq, &conn->c_conn_w, + rand % conn->c_reconnect_jiffies); + + conn->c_reconnect_jiffies = min(conn->c_reconnect_jiffies * 2, + rds_sysctl_reconnect_max_jiffies); +} + +void rds_connect_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_conn_w.work); + int ret; + + clear_bit(RDS_RECONNECT_PENDING, &conn->c_flags); + if (rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) { + ret = conn->c_trans->conn_connect(conn); + rdsdebug("conn %p for %pI4 to %pI4 dispatched, ret %d\n", + conn, &conn->c_laddr, &conn->c_faddr, ret); + + if (ret) { + if (rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_DOWN)) + rds_queue_reconnect(conn); + else + rds_conn_error(conn, "RDS: connect failed\n"); + } + } +} + +void rds_shutdown_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_down_w); + + /* shut it down unless it's down already */ + if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_DOWN)) { + /* + * Quiesce the connection mgmt handlers before we start tearing + * things down. We don't hold the mutex for the entire + * duration of the shutdown operation, else we may be + * deadlocking with the CM handler. Instead, the CM event + * handler is supposed to check for state DISCONNECTING + */ + mutex_lock(&conn->c_cm_lock); + if (!rds_conn_transition(conn, RDS_CONN_UP, RDS_CONN_DISCONNECTING) + && !rds_conn_transition(conn, RDS_CONN_ERROR, RDS_CONN_DISCONNECTING)) { + rds_conn_error(conn, "shutdown called in state %d\n", + atomic_read(&conn->c_state)); + mutex_unlock(&conn->c_cm_lock); + return; + } + mutex_unlock(&conn->c_cm_lock); + + mutex_lock(&conn->c_send_lock); + conn->c_trans->conn_shutdown(conn); + rds_conn_reset(conn); + mutex_unlock(&conn->c_send_lock); + + if (!rds_conn_transition(conn, RDS_CONN_DISCONNECTING, RDS_CONN_DOWN)) { + /* This can happen - eg when we're in the middle of tearing + * down the connection, and someone unloads the rds module. + * Quite reproduceable with loopback connections. + * Mostly harmless. + */ + rds_conn_error(conn, + "%s: failed to transition to state DOWN, " + "current state is %d\n", + __func__, + atomic_read(&conn->c_state)); + return; + } + } + + /* Then reconnect if it's still live. + * The passive side of an IB loopback connection is never added + * to the conn hash, so we never trigger a reconnect on this + * conn - the reconnect is always triggered by the active peer. */ + cancel_delayed_work(&conn->c_conn_w); + if (!hlist_unhashed(&conn->c_hash_node)) + rds_queue_reconnect(conn); +} + +void rds_send_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_send_w.work); + int ret; + + if (rds_conn_state(conn) == RDS_CONN_UP) { + ret = rds_send_xmit(conn); + rdsdebug("conn %p ret %d\n", conn, ret); + switch (ret) { + case -EAGAIN: + rds_stats_inc(s_send_immediate_retry); + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + break; + case -ENOMEM: + rds_stats_inc(s_send_delayed_retry); + queue_delayed_work(rds_wq, &conn->c_send_w, 2); + default: + break; + } + } +} + +void rds_recv_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_recv_w.work); + int ret; + + if (rds_conn_state(conn) == RDS_CONN_UP) { + ret = conn->c_trans->recv(conn); + rdsdebug("conn %p ret %d\n", conn, ret); + switch (ret) { + case -EAGAIN: + rds_stats_inc(s_recv_immediate_retry); + queue_delayed_work(rds_wq, &conn->c_recv_w, 0); + break; + case -ENOMEM: + rds_stats_inc(s_recv_delayed_retry); + queue_delayed_work(rds_wq, &conn->c_recv_w, 2); + default: + break; + } + } +} + +void rds_threads_exit(void) +{ + destroy_workqueue(rds_wq); +} + +int __init rds_threads_init(void) +{ + rds_wq = create_singlethread_workqueue("krdsd"); + if (rds_wq == NULL) + return -ENOMEM; + + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:24 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:24 -0800 Subject: [ofa-general] [PATCH 07/26] RDS: loopback In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-8-git-send-email-andy.grover@oracle.com> A simple rds transport to handle loopback connections. Signed-off-by: Andy Grover --- net/rds/loop.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/loop.h | 9 +++ 2 files changed, 197 insertions(+), 0 deletions(-) create mode 100644 net/rds/loop.c create mode 100644 net/rds/loop.h diff --git a/net/rds/loop.c b/net/rds/loop.c new file mode 100644 index 0000000..4a61997 --- /dev/null +++ b/net/rds/loop.c @@ -0,0 +1,188 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include + +#include "rds.h" +#include "loop.h" + +static DEFINE_SPINLOCK(loop_conns_lock); +static LIST_HEAD(loop_conns); + +/* + * This 'loopback' transport is a special case for flows that originate + * and terminate on the same machine. + * + * Connection build-up notices if the destination address is thought of + * as a local address by a transport. At that time it decides to use the + * loopback transport instead of the bound transport of the sending socket. + * + * The loopback transport's sending path just hands the sent rds_message + * straight to the receiving path via an embedded rds_incoming. + */ + +/* + * Usually a message transits both the sender and receiver's conns as it + * flows to the receiver. In the loopback case, though, the receive path + * is handed the sending conn so the sense of the addresses is reversed. + */ +static int rds_loop_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, + unsigned int off) +{ + BUG_ON(hdr_off || sg || off); + + rds_inc_init(&rm->m_inc, conn, conn->c_laddr); + rds_message_addref(rm); /* for the inc */ + + rds_recv_incoming(conn, conn->c_laddr, conn->c_faddr, &rm->m_inc, + GFP_KERNEL, KM_USER0); + + rds_send_drop_acked(conn, be64_to_cpu(rm->m_inc.i_hdr.h_sequence), + NULL); + + rds_inc_put(&rm->m_inc); + + return sizeof(struct rds_header) + be32_to_cpu(rm->m_inc.i_hdr.h_len); +} + +static int rds_loop_xmit_cong_map(struct rds_connection *conn, + struct rds_cong_map *map, + unsigned long offset) +{ + unsigned long i; + + BUG_ON(offset); + BUG_ON(map != conn->c_lcong); + + for (i = 0; i < RDS_CONG_MAP_PAGES; i++) { + memcpy((void *)conn->c_fcong->m_page_addrs[i], + (void *)map->m_page_addrs[i], PAGE_SIZE); + } + + rds_cong_map_updated(conn->c_fcong, ~(u64) 0); + + return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES; +} + +/* we need to at least give the thread something to succeed */ +static int rds_loop_recv(struct rds_connection *conn) +{ + return 0; +} + +struct rds_loop_connection { + struct list_head loop_node; + struct rds_connection *conn; +}; + +/* + * Even the loopback transport needs to keep track of its connections, + * so it can call rds_conn_destroy() on them on exit. N.B. there are + * 1+ loopback addresses (127.*.*.*) so it's not a bug to have + * multiple loopback conns allocated, although rather useless. + */ +static int rds_loop_conn_alloc(struct rds_connection *conn, gfp_t gfp) +{ + struct rds_loop_connection *lc; + unsigned long flags; + + lc = kzalloc(sizeof(struct rds_loop_connection), GFP_KERNEL); + if (lc == NULL) + return -ENOMEM; + + INIT_LIST_HEAD(&lc->loop_node); + lc->conn = conn; + conn->c_transport_data = lc; + + spin_lock_irqsave(&loop_conns_lock, flags); + list_add_tail(&lc->loop_node, &loop_conns); + spin_unlock_irqrestore(&loop_conns_lock, flags); + + return 0; +} + +static void rds_loop_conn_free(void *arg) +{ + struct rds_loop_connection *lc = arg; + rdsdebug("lc %p\n", lc); + list_del(&lc->loop_node); + kfree(lc); +} + +static int rds_loop_conn_connect(struct rds_connection *conn) +{ + rds_connect_complete(conn); + return 0; +} + +static void rds_loop_conn_shutdown(struct rds_connection *conn) +{ +} + +void rds_loop_exit(void) +{ + struct rds_loop_connection *lc, *_lc; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&loop_conns_lock); + list_splice(&loop_conns, &tmp_list); + INIT_LIST_HEAD(&loop_conns); + spin_unlock_irq(&loop_conns_lock); + + list_for_each_entry_safe(lc, _lc, &tmp_list, loop_node) { + WARN_ON(lc->conn->c_passive); + rds_conn_destroy(lc->conn); + } +} + +/* + * This is missing .xmit_* because loop doesn't go through generic + * rds_send_xmit() and doesn't call rds_recv_incoming(). .listen_stop and + * .laddr_check are missing because transport.c doesn't iterate over + * rds_loop_transport. + */ +struct rds_transport rds_loop_transport = { + .xmit = rds_loop_xmit, + .xmit_cong_map = rds_loop_xmit_cong_map, + .recv = rds_loop_recv, + .conn_alloc = rds_loop_conn_alloc, + .conn_free = rds_loop_conn_free, + .conn_connect = rds_loop_conn_connect, + .conn_shutdown = rds_loop_conn_shutdown, + .inc_copy_to_user = rds_message_inc_copy_to_user, + .inc_purge = rds_message_inc_purge, + .inc_free = rds_message_inc_free, + .t_name = "loopback", +}; diff --git a/net/rds/loop.h b/net/rds/loop.h new file mode 100644 index 0000000..f32b093 --- /dev/null +++ b/net/rds/loop.h @@ -0,0 +1,9 @@ +#ifndef _RDS_LOOP_H +#define _RDS_LOOP_H + +/* loop.c */ +extern struct rds_transport rds_loop_transport; + +void rds_loop_exit(void); + +#endif -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:25 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:25 -0800 Subject: [ofa-general] [PATCH 08/26] RDS: sysctls In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-9-git-send-email-andy.grover@oracle.com> RDS exposes a few tunable parameters via sysctls. Signed-off-by: Andy Grover --- net/rds/sysctl.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 122 insertions(+), 0 deletions(-) create mode 100644 net/rds/sysctl.c diff --git a/net/rds/sysctl.c b/net/rds/sysctl.c new file mode 100644 index 0000000..307dc5c --- /dev/null +++ b/net/rds/sysctl.c @@ -0,0 +1,122 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" + +static struct ctl_table_header *rds_sysctl_reg_table; + +static unsigned long rds_sysctl_reconnect_min = 1; +static unsigned long rds_sysctl_reconnect_max = ~0UL; + +unsigned long rds_sysctl_reconnect_min_jiffies; +unsigned long rds_sysctl_reconnect_max_jiffies = HZ; + +unsigned int rds_sysctl_max_unacked_packets = 8; +unsigned int rds_sysctl_max_unacked_bytes = (16 << 20); + +unsigned int rds_sysctl_ping_enable = 1; + +static ctl_table rds_sysctl_rds_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "reconnect_min_delay_ms", + .data = &rds_sysctl_reconnect_min_jiffies, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &rds_sysctl_reconnect_min, + .extra2 = &rds_sysctl_reconnect_max_jiffies, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "reconnect_max_delay_ms", + .data = &rds_sysctl_reconnect_max_jiffies, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &rds_sysctl_reconnect_min_jiffies, + .extra2 = &rds_sysctl_reconnect_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unacked_packets", + .data = &rds_sysctl_max_unacked_packets, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unacked_bytes", + .data = &rds_sysctl_max_unacked_bytes, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "ping_enable", + .data = &rds_sysctl_ping_enable, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = 0} +}; + +static struct ctl_path rds_sysctl_path[] = { + { .procname = "net", .ctl_name = CTL_NET, }, + { .procname = "rds", .ctl_name = CTL_UNNUMBERED, }, + { } +}; + + +void rds_sysctl_exit(void) +{ + if (rds_sysctl_reg_table) + unregister_sysctl_table(rds_sysctl_reg_table); +} + +int __init rds_sysctl_init(void) +{ + rds_sysctl_reconnect_min = msecs_to_jiffies(1); + rds_sysctl_reconnect_min_jiffies = rds_sysctl_reconnect_min; + + rds_sysctl_reg_table = register_sysctl_paths(rds_sysctl_path, rds_sysctl_rds_table); + if (rds_sysctl_reg_table == NULL) + return -ENOMEM; + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:26 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:26 -0800 Subject: [ofa-general] [PATCH 09/26] RDS: Message parsing In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-10-git-send-email-andy.grover@oracle.com> Parsing of newly-received RDS message headers (including ext. headers) and copy-to/from-user routines. page.c implements a per-cpu page remainder cache, to reduce the number of allocations needed for small datagrams. Signed-off-by: Andy Grover --- net/rds/message.c | 402 +++++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/page.c | 221 +++++++++++++++++++++++++++++ 2 files changed, 623 insertions(+), 0 deletions(-) create mode 100644 net/rds/message.c create mode 100644 net/rds/page.c diff --git a/net/rds/message.c b/net/rds/message.c new file mode 100644 index 0000000..5a15dc8 --- /dev/null +++ b/net/rds/message.c @@ -0,0 +1,402 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "rdma.h" + +static DECLARE_WAIT_QUEUE_HEAD(rds_message_flush_waitq); + +static unsigned int rds_exthdr_size[__RDS_EXTHDR_MAX] = { +[RDS_EXTHDR_NONE] = 0, +[RDS_EXTHDR_VERSION] = sizeof(struct rds_ext_header_version), +[RDS_EXTHDR_RDMA] = sizeof(struct rds_ext_header_rdma), +[RDS_EXTHDR_RDMA_DEST] = sizeof(struct rds_ext_header_rdma_dest), +}; + + +void rds_message_addref(struct rds_message *rm) +{ + rdsdebug("addref rm %p ref %d\n", rm, atomic_read(&rm->m_refcount)); + atomic_inc(&rm->m_refcount); +} + +/* + * This relies on dma_map_sg() not touching sg[].page during merging. + */ +static void rds_message_purge(struct rds_message *rm) +{ + unsigned long i; + + if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags))) + return; + + for (i = 0; i < rm->m_nents; i++) { + rdsdebug("putting data page %p\n", (void *)sg_page(&rm->m_sg[i])); + /* XXX will have to put_page for page refs */ + __free_page(sg_page(&rm->m_sg[i])); + } + rm->m_nents = 0; + + if (rm->m_rdma_op) + rds_rdma_free_op(rm->m_rdma_op); + if (rm->m_rdma_mr) + rds_mr_put(rm->m_rdma_mr); +} + +void rds_message_inc_purge(struct rds_incoming *inc) +{ + struct rds_message *rm = container_of(inc, struct rds_message, m_inc); + rds_message_purge(rm); +} + +void rds_message_put(struct rds_message *rm) +{ + rdsdebug("put rm %p ref %d\n", rm, atomic_read(&rm->m_refcount)); + + if (atomic_dec_and_test(&rm->m_refcount)) { + BUG_ON(!list_empty(&rm->m_sock_item)); + BUG_ON(!list_empty(&rm->m_conn_item)); + rds_message_purge(rm); + + kfree(rm); + } +} + +void rds_message_inc_free(struct rds_incoming *inc) +{ + struct rds_message *rm = container_of(inc, struct rds_message, m_inc); + rds_message_put(rm); +} + +void rds_message_populate_header(struct rds_header *hdr, __be16 sport, + __be16 dport, u64 seq) +{ + hdr->h_flags = 0; + hdr->h_sport = sport; + hdr->h_dport = dport; + hdr->h_sequence = cpu_to_be64(seq); + hdr->h_exthdr[0] = RDS_EXTHDR_NONE; +} + +int rds_message_add_extension(struct rds_header *hdr, + unsigned int type, const void *data, unsigned int len) +{ + unsigned int ext_len = sizeof(u8) + len; + unsigned char *dst; + + /* For now, refuse to add more than one extension header */ + if (hdr->h_exthdr[0] != RDS_EXTHDR_NONE) + return 0; + + if (type >= __RDS_EXTHDR_MAX + || len != rds_exthdr_size[type]) + return 0; + + if (ext_len >= RDS_HEADER_EXT_SPACE) + return 0; + dst = hdr->h_exthdr; + + *dst++ = type; + memcpy(dst, data, len); + + dst[len] = RDS_EXTHDR_NONE; + return 1; +} + +/* + * If a message has extension headers, retrieve them here. + * Call like this: + * + * unsigned int pos = 0; + * + * while (1) { + * buflen = sizeof(buffer); + * type = rds_message_next_extension(hdr, &pos, buffer, &buflen); + * if (type == RDS_EXTHDR_NONE) + * break; + * ... + * } + */ +int rds_message_next_extension(struct rds_header *hdr, + unsigned int *pos, void *buf, unsigned int *buflen) +{ + unsigned int offset, ext_type, ext_len; + u8 *src = hdr->h_exthdr; + + offset = *pos; + if (offset >= RDS_HEADER_EXT_SPACE) + goto none; + + /* Get the extension type and length. For now, the + * length is implied by the extension type. */ + ext_type = src[offset++]; + + if (ext_type == RDS_EXTHDR_NONE || ext_type >= __RDS_EXTHDR_MAX) + goto none; + ext_len = rds_exthdr_size[ext_type]; + if (offset + ext_len > RDS_HEADER_EXT_SPACE) + goto none; + + *pos = offset + ext_len; + if (ext_len < *buflen) + *buflen = ext_len; + memcpy(buf, src + offset, *buflen); + return ext_type; + +none: + *pos = RDS_HEADER_EXT_SPACE; + *buflen = 0; + return RDS_EXTHDR_NONE; +} + +int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version) +{ + struct rds_ext_header_version ext_hdr; + + ext_hdr.h_version = cpu_to_be32(version); + return rds_message_add_extension(hdr, RDS_EXTHDR_VERSION, &ext_hdr, sizeof(ext_hdr)); +} + +int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version) +{ + struct rds_ext_header_version ext_hdr; + unsigned int pos = 0, len = sizeof(ext_hdr); + + /* We assume the version extension is the only one present */ + if (rds_message_next_extension(hdr, &pos, &ext_hdr, &len) != RDS_EXTHDR_VERSION) + return 0; + *version = be32_to_cpu(ext_hdr.h_version); + return 1; +} + +int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset) +{ + struct rds_ext_header_rdma_dest ext_hdr; + + ext_hdr.h_rdma_rkey = cpu_to_be32(r_key); + ext_hdr.h_rdma_offset = cpu_to_be32(offset); + return rds_message_add_extension(hdr, RDS_EXTHDR_RDMA_DEST, &ext_hdr, sizeof(ext_hdr)); +} + +struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp) +{ + struct rds_message *rm; + + rm = kzalloc(sizeof(struct rds_message) + + (nents * sizeof(struct scatterlist)), gfp); + if (!rm) + goto out; + + if (nents) + sg_init_table(rm->m_sg, nents); + atomic_set(&rm->m_refcount, 1); + INIT_LIST_HEAD(&rm->m_sock_item); + INIT_LIST_HEAD(&rm->m_conn_item); + spin_lock_init(&rm->m_rs_lock); + +out: + return rm; +} + +struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len) +{ + struct rds_message *rm; + unsigned int i; + + rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL); + if (rm == NULL) + return ERR_PTR(-ENOMEM); + + set_bit(RDS_MSG_PAGEVEC, &rm->m_flags); + rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len); + rm->m_nents = ceil(total_len, PAGE_SIZE); + + for (i = 0; i < rm->m_nents; ++i) { + sg_set_page(&rm->m_sg[i], + virt_to_page(page_addrs[i]), + PAGE_SIZE, 0); + } + + return rm; +} + +struct rds_message *rds_message_copy_from_user(struct iovec *first_iov, + size_t total_len) +{ + unsigned long to_copy; + unsigned long iov_off; + unsigned long sg_off; + struct rds_message *rm; + struct iovec *iov; + struct scatterlist *sg; + int ret; + + rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL); + if (rm == NULL) { + ret = -ENOMEM; + goto out; + } + + rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len); + + /* + * now allocate and copy in the data payload. + */ + sg = rm->m_sg; + iov = first_iov; + iov_off = 0; + sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */ + + while (total_len) { + if (sg_page(sg) == NULL) { + ret = rds_page_remainder_alloc(sg, total_len, + GFP_HIGHUSER); + if (ret) + goto out; + rm->m_nents++; + sg_off = 0; + } + + while (iov_off == iov->iov_len) { + iov_off = 0; + iov++; + } + + to_copy = min(iov->iov_len - iov_off, sg->length - sg_off); + to_copy = min_t(size_t, to_copy, total_len); + + rdsdebug("copying %lu bytes from user iov [%p, %zu] + %lu to " + "sg [%p, %u, %u] + %lu\n", + to_copy, iov->iov_base, iov->iov_len, iov_off, + (void *)sg_page(sg), sg->offset, sg->length, sg_off); + + ret = rds_page_copy_from_user(sg_page(sg), sg->offset + sg_off, + iov->iov_base + iov_off, + to_copy); + if (ret) + goto out; + + iov_off += to_copy; + total_len -= to_copy; + sg_off += to_copy; + + if (sg_off == sg->length) + sg++; + } + + ret = 0; +out: + if (ret) { + if (rm) + rds_message_put(rm); + rm = ERR_PTR(ret); + } + return rm; +} + +int rds_message_inc_copy_to_user(struct rds_incoming *inc, + struct iovec *first_iov, size_t size) +{ + struct rds_message *rm; + struct iovec *iov; + struct scatterlist *sg; + unsigned long to_copy; + unsigned long iov_off; + unsigned long vec_off; + int copied; + int ret; + u32 len; + + rm = container_of(inc, struct rds_message, m_inc); + len = be32_to_cpu(rm->m_inc.i_hdr.h_len); + + iov = first_iov; + iov_off = 0; + sg = rm->m_sg; + vec_off = 0; + copied = 0; + + while (copied < size && copied < len) { + while (iov_off == iov->iov_len) { + iov_off = 0; + iov++; + } + + to_copy = min(iov->iov_len - iov_off, sg->length - vec_off); + to_copy = min_t(size_t, to_copy, size - copied); + to_copy = min_t(unsigned long, to_copy, len - copied); + + rdsdebug("copying %lu bytes to user iov [%p, %zu] + %lu to " + "sg [%p, %u, %u] + %lu\n", + to_copy, iov->iov_base, iov->iov_len, iov_off, + sg_page(sg), sg->offset, sg->length, vec_off); + + ret = rds_page_copy_to_user(sg_page(sg), sg->offset + vec_off, + iov->iov_base + iov_off, + to_copy); + if (ret) { + copied = ret; + break; + } + + iov_off += to_copy; + vec_off += to_copy; + copied += to_copy; + + if (vec_off == sg->length) { + vec_off = 0; + sg++; + } + } + + return copied; +} + +/* + * If the message is still on the send queue, wait until the transport + * is done with it. This is particularly important for RDMA operations. + */ +void rds_message_wait(struct rds_message *rm) +{ + wait_event(rds_message_flush_waitq, + !test_bit(RDS_MSG_MAPPED, &rm->m_flags)); +} + +void rds_message_unmapped(struct rds_message *rm) +{ + clear_bit(RDS_MSG_MAPPED, &rm->m_flags); + if (waitqueue_active(&rds_message_flush_waitq)) + wake_up(&rds_message_flush_waitq); +} + diff --git a/net/rds/page.c b/net/rds/page.c new file mode 100644 index 0000000..c460743 --- /dev/null +++ b/net/rds/page.c @@ -0,0 +1,221 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" + +struct rds_page_remainder { + struct page *r_page; + unsigned long r_offset; +}; + +DEFINE_PER_CPU(struct rds_page_remainder, rds_page_remainders) ____cacheline_aligned; + +/* + * returns 0 on success or -errno on failure. + * + * We don't have to worry about flush_dcache_page() as this only works + * with private pages. If, say, we were to do directed receive to pinned + * user pages we'd have to worry more about cache coherence. (Though + * the flush_dcache_page() in get_user_pages() would probably be enough). + */ +int rds_page_copy_user(struct page *page, unsigned long offset, + void __user *ptr, unsigned long bytes, + int to_user) +{ + unsigned long ret; + void *addr; + + if (to_user) + rds_stats_add(s_copy_to_user, bytes); + else + rds_stats_add(s_copy_from_user, bytes); + + addr = kmap_atomic(page, KM_USER0); + if (to_user) + ret = __copy_to_user_inatomic(ptr, addr + offset, bytes); + else + ret = __copy_from_user_inatomic(addr + offset, ptr, bytes); + kunmap_atomic(addr, KM_USER0); + + if (ret) { + addr = kmap(page); + if (to_user) + ret = copy_to_user(ptr, addr + offset, bytes); + else + ret = copy_from_user(addr + offset, ptr, bytes); + kunmap(page); + if (ret) + return -EFAULT; + } + + return 0; +} + +/* + * Message allocation uses this to build up regions of a message. + * + * @bytes - the number of bytes needed. + * @gfp - the waiting behaviour of the allocation + * + * @gfp is always ored with __GFP_HIGHMEM. Callers must be prepared to + * kmap the pages, etc. + * + * If @bytes is at least a full page then this just returns a page from + * alloc_page(). + * + * If @bytes is a partial page then this stores the unused region of the + * page in a per-cpu structure. Future partial-page allocations may be + * satisfied from that cached region. This lets us waste less memory on + * small allocations with minimal complexity. It works because the transmit + * path passes read-only page regions down to devices. They hold a page + * reference until they are done with the region. + */ +int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes, + gfp_t gfp) +{ + struct rds_page_remainder *rem; + unsigned long flags; + struct page *page; + int ret; + + gfp |= __GFP_HIGHMEM; + + /* jump straight to allocation if we're trying for a huge page */ + if (bytes >= PAGE_SIZE) { + page = alloc_page(gfp); + if (page == NULL) { + ret = -ENOMEM; + } else { + sg_set_page(scat, page, PAGE_SIZE, 0); + ret = 0; + } + goto out; + } + + rem = &per_cpu(rds_page_remainders, get_cpu()); + local_irq_save(flags); + + while (1) { + /* avoid a tiny region getting stuck by tossing it */ + if (rem->r_page && bytes > (PAGE_SIZE - rem->r_offset)) { + rds_stats_inc(s_page_remainder_miss); + __free_page(rem->r_page); + rem->r_page = NULL; + } + + /* hand out a fragment from the cached page */ + if (rem->r_page && bytes <= (PAGE_SIZE - rem->r_offset)) { + sg_set_page(scat, rem->r_page, bytes, rem->r_offset); + get_page(sg_page(scat)); + + if (rem->r_offset != 0) + rds_stats_inc(s_page_remainder_hit); + + rem->r_offset += bytes; + if (rem->r_offset == PAGE_SIZE) { + __free_page(rem->r_page); + rem->r_page = NULL; + } + ret = 0; + break; + } + + /* alloc if there is nothing for us to use */ + local_irq_restore(flags); + put_cpu(); + + page = alloc_page(gfp); + + rem = &per_cpu(rds_page_remainders, get_cpu()); + local_irq_save(flags); + + if (page == NULL) { + ret = -ENOMEM; + break; + } + + /* did someone race to fill the remainder before us? */ + if (rem->r_page) { + __free_page(page); + continue; + } + + /* otherwise install our page and loop around to alloc */ + rem->r_page = page; + rem->r_offset = 0; + } + + local_irq_restore(flags); + put_cpu(); +out: + rdsdebug("bytes %lu ret %d %p %u %u\n", bytes, ret, + ret ? NULL : sg_page(scat), ret ? 0 : scat->offset, + ret ? 0 : scat->length); + return ret; +} + +static int rds_page_remainder_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + struct rds_page_remainder *rem; + long cpu = (long)hcpu; + + rem = &per_cpu(rds_page_remainders, cpu); + + rdsdebug("cpu %ld action 0x%lx\n", cpu, action); + + switch (action) { + case CPU_DEAD: + if (rem->r_page) + __free_page(rem->r_page); + rem->r_page = NULL; + break; + } + + return 0; +} + +static struct notifier_block rds_page_remainder_nb = { + .notifier_call = rds_page_remainder_cpu_notify, +}; + +void rds_page_exit(void) +{ + int i; + + for_each_possible_cpu(i) + rds_page_remainder_cpu_notify(&rds_page_remainder_nb, + (unsigned long)CPU_DEAD, + (void *)(long)i); +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:27 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:27 -0800 Subject: [ofa-general] [PATCH 10/26] RDS: send.c In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-11-git-send-email-andy.grover@oracle.com> This is the code to send an RDS datagram. Signed-off-by: Andy Grover --- net/rds/send.c | 1003 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 1003 insertions(+), 0 deletions(-) create mode 100644 net/rds/send.c diff --git a/net/rds/send.c b/net/rds/send.c new file mode 100644 index 0000000..1b37364 --- /dev/null +++ b/net/rds/send.c @@ -0,0 +1,1003 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "rdma.h" + +/* When transmitting messages in rds_send_xmit, we need to emerge from + * time to time and briefly release the CPU. Otherwise the softlock watchdog + * will kick our shin. + * Also, it seems fairer to not let one busy connection stall all the + * others. + * + * send_batch_count is the number of times we'll loop in send_xmit. Setting + * it to 0 will restore the old behavior (where we looped until we had + * drained the queue). + */ +static int send_batch_count = 64; +module_param(send_batch_count, int, 0444); +MODULE_PARM_DESC(send_batch_count, " batch factor when working the send queue"); + +/* + * Reset the send state. Caller must hold c_send_lock when calling here. + */ +void rds_send_reset(struct rds_connection *conn) +{ + struct rds_message *rm, *tmp; + unsigned long flags; + + if (conn->c_xmit_rm) { + /* Tell the user the RDMA op is no longer mapped by the + * transport. This isn't entirely true (it's flushed out + * independently) but as the connection is down, there's + * no ongoing RDMA to/from that memory */ + rds_message_unmapped(conn->c_xmit_rm); + rds_message_put(conn->c_xmit_rm); + conn->c_xmit_rm = NULL; + } + conn->c_xmit_sg = 0; + conn->c_xmit_hdr_off = 0; + conn->c_xmit_data_off = 0; + conn->c_xmit_rdma_sent = 0; + + conn->c_map_queued = 0; + + conn->c_unacked_packets = rds_sysctl_max_unacked_packets; + conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes; + + /* Mark messages as retransmissions, and move them to the send q */ + spin_lock_irqsave(&conn->c_lock, flags); + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + set_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags); + } + list_splice_init(&conn->c_retrans, &conn->c_send_queue); + spin_unlock_irqrestore(&conn->c_lock, flags); +} + +/* + * We're making the concious trade-off here to only send one message + * down the connection at a time. + * Pro: + * - tx queueing is a simple fifo list + * - reassembly is optional and easily done by transports per conn + * - no per flow rx lookup at all, straight to the socket + * - less per-frag memory and wire overhead + * Con: + * - queued acks can be delayed behind large messages + * Depends: + * - small message latency is higher behind queued large messages + * - large message latency isn't starved by intervening small sends + */ +int rds_send_xmit(struct rds_connection *conn) +{ + struct rds_message *rm; + unsigned long flags; + unsigned int tmp; + unsigned int send_quota = send_batch_count; + struct scatterlist *sg; + int ret = 0; + int was_empty = 0; + LIST_HEAD(to_be_dropped); + + /* + * sendmsg calls here after having queued its message on the send + * queue. We only have one task feeding the connection at a time. If + * another thread is already feeding the queue then we back off. This + * avoids blocking the caller and trading per-connection data between + * caches per message. + * + * The sem holder will issue a retry if they notice that someone queued + * a message after they stopped walking the send queue but before they + * dropped the sem. + */ + if (!mutex_trylock(&conn->c_send_lock)) { + rds_stats_inc(s_send_sem_contention); + ret = -ENOMEM; + goto out; + } + + if (conn->c_trans->xmit_prepare) + conn->c_trans->xmit_prepare(conn); + + /* + * spin trying to push headers and data down the connection until + * the connection doens't make forward progress. + */ + while (--send_quota) { + /* + * See if need to send a congestion map update if we're + * between sending messages. The send_sem protects our sole + * use of c_map_offset and _bytes. + * Note this is used only by transports that define a special + * xmit_cong_map function. For all others, we create allocate + * a cong_map message and treat it just like any other send. + */ + if (conn->c_map_bytes) { + ret = conn->c_trans->xmit_cong_map(conn, conn->c_lcong, + conn->c_map_offset); + if (ret <= 0) + break; + + conn->c_map_offset += ret; + conn->c_map_bytes -= ret; + if (conn->c_map_bytes) + continue; + } + + /* If we're done sending the current message, clear the + * offset and S/G temporaries. + */ + rm = conn->c_xmit_rm; + if (rm != NULL && + conn->c_xmit_hdr_off == sizeof(struct rds_header) && + conn->c_xmit_sg == rm->m_nents) { + conn->c_xmit_rm = NULL; + conn->c_xmit_sg = 0; + conn->c_xmit_hdr_off = 0; + conn->c_xmit_data_off = 0; + conn->c_xmit_rdma_sent = 0; + + /* Release the reference to the previous message. */ + rds_message_put(rm); + rm = NULL; + } + + /* If we're asked to send a cong map update, do so. + */ + if (rm == NULL && test_and_clear_bit(0, &conn->c_map_queued)) { + if (conn->c_trans->xmit_cong_map != NULL) { + conn->c_map_offset = 0; + conn->c_map_bytes = sizeof(struct rds_header) + + RDS_CONG_MAP_BYTES; + continue; + } + + rm = rds_cong_update_alloc(conn); + if (IS_ERR(rm)) { + ret = PTR_ERR(rm); + break; + } + + conn->c_xmit_rm = rm; + } + + /* + * Grab the next message from the send queue, if there is one. + * + * c_xmit_rm holds a ref while we're sending this message down + * the connction. We can use this ref while holding the + * send_sem.. rds_send_reset() is serialized with it. + */ + if (rm == NULL) { + unsigned int len; + + spin_lock_irqsave(&conn->c_lock, flags); + + if (!list_empty(&conn->c_send_queue)) { + rm = list_entry(conn->c_send_queue.next, + struct rds_message, + m_conn_item); + rds_message_addref(rm); + + /* + * Move the message from the send queue to the retransmit + * list right away. + */ + list_move_tail(&rm->m_conn_item, &conn->c_retrans); + } + + spin_unlock_irqrestore(&conn->c_lock, flags); + + if (rm == NULL) { + was_empty = 1; + break; + } + + /* Unfortunately, the way Infiniband deals with + * RDMA to a bad MR key is by moving the entire + * queue pair to error state. We cold possibly + * recover from that, but right now we drop the + * connection. + * Therefore, we never retransmit messages with RDMA ops. + */ + if (rm->m_rdma_op + && test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags)) { + spin_lock_irqsave(&conn->c_lock, flags); + if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) + list_move(&rm->m_conn_item, &to_be_dropped); + spin_unlock_irqrestore(&conn->c_lock, flags); + rds_message_put(rm); + continue; + } + + /* Require an ACK every once in a while */ + len = ntohl(rm->m_inc.i_hdr.h_len); + if (conn->c_unacked_packets == 0 + || conn->c_unacked_bytes < len) { + __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + + conn->c_unacked_packets = rds_sysctl_max_unacked_packets; + conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes; + rds_stats_inc(s_send_ack_required); + } else { + conn->c_unacked_bytes -= len; + conn->c_unacked_packets--; + } + + conn->c_xmit_rm = rm; + } + + /* + * Try and send an rdma message. Let's see if we can + * keep this simple and require that the transport either + * send the whole rdma or none of it. + */ + if (rm->m_rdma_op && !conn->c_xmit_rdma_sent) { + ret = conn->c_trans->xmit_rdma(conn, rm->m_rdma_op); + if (ret) + break; + conn->c_xmit_rdma_sent = 1; + /* The transport owns the mapped memory for now. + * You can't unmap it while it's on the send queue */ + set_bit(RDS_MSG_MAPPED, &rm->m_flags); + } + + if (conn->c_xmit_hdr_off < sizeof(struct rds_header) || + conn->c_xmit_sg < rm->m_nents) { + ret = conn->c_trans->xmit(conn, rm, + conn->c_xmit_hdr_off, + conn->c_xmit_sg, + conn->c_xmit_data_off); + if (ret <= 0) + break; + + if (conn->c_xmit_hdr_off < sizeof(struct rds_header)) { + tmp = min_t(int, ret, + sizeof(struct rds_header) - + conn->c_xmit_hdr_off); + conn->c_xmit_hdr_off += tmp; + ret -= tmp; + } + + sg = &rm->m_sg[conn->c_xmit_sg]; + while (ret) { + tmp = min_t(int, ret, sg->length - + conn->c_xmit_data_off); + conn->c_xmit_data_off += tmp; + ret -= tmp; + if (conn->c_xmit_data_off == sg->length) { + conn->c_xmit_data_off = 0; + sg++; + conn->c_xmit_sg++; + BUG_ON(ret != 0 && + conn->c_xmit_sg == rm->m_nents); + } + } + } + } + + /* Nuke any messages we decided not to retransmit. */ + if (!list_empty(&to_be_dropped)) + rds_send_remove_from_sock(&to_be_dropped, RDS_RDMA_DROPPED); + + if (conn->c_trans->xmit_complete) + conn->c_trans->xmit_complete(conn); + + /* + * We might be racing with another sender who queued a message but + * backed off on noticing that we held the c_send_lock. If we check + * for queued messages after dropping the sem then either we'll + * see the queued message or the queuer will get the sem. If we + * notice the queued message then we trigger an immediate retry. + * + * We need to be careful only to do this when we stopped processing + * the send queue because it was empty. It's the only way we + * stop processing the loop when the transport hasn't taken + * responsibility for forward progress. + */ + mutex_unlock(&conn->c_send_lock); + + if (conn->c_map_bytes || (send_quota == 0 && !was_empty)) { + /* We exhausted the send quota, but there's work left to + * do. Return and (re-)schedule the send worker. + */ + ret = -EAGAIN; + } + + if (ret == 0 && was_empty) { + /* A simple bit test would be way faster than taking the + * spin lock */ + spin_lock_irqsave(&conn->c_lock, flags); + if (!list_empty(&conn->c_send_queue)) { + rds_stats_inc(s_send_sem_queue_raced); + ret = -EAGAIN; + } + spin_unlock_irqrestore(&conn->c_lock, flags); + } +out: + return ret; +} + +static void rds_send_sndbuf_remove(struct rds_sock *rs, struct rds_message *rm) +{ + u32 len = be32_to_cpu(rm->m_inc.i_hdr.h_len); + + assert_spin_locked(&rs->rs_lock); + + BUG_ON(rs->rs_snd_bytes < len); + rs->rs_snd_bytes -= len; + + if (rs->rs_snd_bytes == 0) + rds_stats_inc(s_send_queue_empty); +} + +static inline int rds_send_is_acked(struct rds_message *rm, u64 ack, + is_acked_func is_acked) +{ + if (is_acked) + return is_acked(rm, ack); + return be64_to_cpu(rm->m_inc.i_hdr.h_sequence) <= ack; +} + +/* + * Returns true if there are no messages on the send and retransmit queues + * which have a sequence number greater than or equal to the given sequence + * number. + */ +int rds_send_acked_before(struct rds_connection *conn, u64 seq) +{ + struct rds_message *rm, *tmp; + int ret = 1; + + spin_lock(&conn->c_lock); + + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq) + ret = 0; + break; + } + + list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) { + if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq) + ret = 0; + break; + } + + spin_unlock(&conn->c_lock); + + return ret; +} + +/* + * This is pretty similar to what happens below in the ACK + * handling code - except that we call here as soon as we get + * the IB send completion on the RDMA op and the accompanying + * message. + */ +void rds_rdma_send_complete(struct rds_message *rm, int status) +{ + struct rds_sock *rs = NULL; + struct rds_rdma_op *ro; + struct rds_notifier *notifier; + + spin_lock(&rm->m_rs_lock); + + ro = rm->m_rdma_op; + if (test_bit(RDS_MSG_ON_SOCK, &rm->m_flags) + && ro && ro->r_notify && ro->r_notifier) { + notifier = ro->r_notifier; + rs = rm->m_rs; + sock_hold(rds_rs_to_sk(rs)); + + notifier->n_status = status; + spin_lock(&rs->rs_lock); + list_add_tail(¬ifier->n_list, &rs->rs_notify_queue); + spin_unlock(&rs->rs_lock); + + ro->r_notifier = NULL; + } + + spin_unlock(&rm->m_rs_lock); + + if (rs) { + rds_wake_sk_sleep(rs); + sock_put(rds_rs_to_sk(rs)); + } +} + +/* + * This is the same as rds_rdma_send_complete except we + * don't do any locking - we have all the ingredients (message, + * socket, socket lock) and can just move the notifier. + */ +static inline void +__rds_rdma_send_complete(struct rds_sock *rs, struct rds_message *rm, int status) +{ + struct rds_rdma_op *ro; + + ro = rm->m_rdma_op; + if (ro && ro->r_notify && ro->r_notifier) { + ro->r_notifier->n_status = status; + list_add_tail(&ro->r_notifier->n_list, &rs->rs_notify_queue); + ro->r_notifier = NULL; + } + + /* No need to wake the app - caller does this */ +} + +/* + * This is called from the IB send completion when we detect + * a RDMA operation that failed with remote access error. + * So speed is not an issue here. + */ +struct rds_message *rds_send_get_message(struct rds_connection *conn, + struct rds_rdma_op *op) +{ + struct rds_message *rm, *tmp, *found = NULL; + unsigned long flags; + + spin_lock_irqsave(&conn->c_lock, flags); + + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + if (rm->m_rdma_op == op) { + atomic_inc(&rm->m_refcount); + found = rm; + goto out; + } + } + + list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) { + if (rm->m_rdma_op == op) { + atomic_inc(&rm->m_refcount); + found = rm; + break; + } + } + +out: + spin_unlock_irqrestore(&conn->c_lock, flags); + + return found; +} + +/* + * This removes messages from the socket's list if they're on it. The list + * argument must be private to the caller, we must be able to modify it + * without locks. The messages must have a reference held for their + * position on the list. This function will drop that reference after + * removing the messages from the 'messages' list regardless of if it found + * the messages on the socket list or not. + */ +void rds_send_remove_from_sock(struct list_head *messages, int status) +{ + unsigned long flags = 0; /* silence gcc :P */ + struct rds_sock *rs = NULL; + struct rds_message *rm; + + local_irq_save(flags); + while (!list_empty(messages)) { + rm = list_entry(messages->next, struct rds_message, + m_conn_item); + list_del_init(&rm->m_conn_item); + + /* + * If we see this flag cleared then we're *sure* that someone + * else beat us to removing it from the sock. If we race + * with their flag update we'll get the lock and then really + * see that the flag has been cleared. + * + * The message spinlock makes sure nobody clears rm->m_rs + * while we're messing with it. It does not prevent the + * message from being removed from the socket, though. + */ + spin_lock(&rm->m_rs_lock); + if (!test_bit(RDS_MSG_ON_SOCK, &rm->m_flags)) + goto unlock_and_drop; + + if (rs != rm->m_rs) { + if (rs) { + spin_unlock(&rs->rs_lock); + rds_wake_sk_sleep(rs); + sock_put(rds_rs_to_sk(rs)); + } + rs = rm->m_rs; + spin_lock(&rs->rs_lock); + sock_hold(rds_rs_to_sk(rs)); + } + + if (test_and_clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags)) { + struct rds_rdma_op *ro = rm->m_rdma_op; + struct rds_notifier *notifier; + + list_del_init(&rm->m_sock_item); + rds_send_sndbuf_remove(rs, rm); + + if (ro && ro->r_notifier + && (status || ro->r_notify)) { + notifier = ro->r_notifier; + list_add_tail(¬ifier->n_list, + &rs->rs_notify_queue); + if (!notifier->n_status) + notifier->n_status = status; + rm->m_rdma_op->r_notifier = NULL; + } + rds_message_put(rm); + rm->m_rs = NULL; + } + +unlock_and_drop: + spin_unlock(&rm->m_rs_lock); + rds_message_put(rm); + } + + if (rs) { + spin_unlock(&rs->rs_lock); + rds_wake_sk_sleep(rs); + sock_put(rds_rs_to_sk(rs)); + } + local_irq_restore(flags); +} + +/* + * Transports call here when they've determined that the receiver queued + * messages up to, and including, the given sequence number. Messages are + * moved to the retrans queue when rds_send_xmit picks them off the send + * queue. This means that in the TCP case, the message may not have been + * assigned the m_ack_seq yet - but that's fine as long as tcp_is_acked + * checks the RDS_MSG_HAS_ACK_SEQ bit. + * + * XXX It's not clear to me how this is safely serialized with socket + * destruction. Maybe it should bail if it sees SOCK_DEAD. + */ +void rds_send_drop_acked(struct rds_connection *conn, u64 ack, + is_acked_func is_acked) +{ + struct rds_message *rm, *tmp; + unsigned long flags; + LIST_HEAD(list); + + spin_lock_irqsave(&conn->c_lock, flags); + + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + if (!rds_send_is_acked(rm, ack, is_acked)) + break; + + list_move(&rm->m_conn_item, &list); + clear_bit(RDS_MSG_ON_CONN, &rm->m_flags); + } + + /* order flag updates with spin locks */ + if (!list_empty(&list)) + smp_mb__after_clear_bit(); + + spin_unlock_irqrestore(&conn->c_lock, flags); + + /* now remove the messages from the sock list as needed */ + rds_send_remove_from_sock(&list, RDS_RDMA_SUCCESS); +} + +void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest) +{ + struct rds_message *rm, *tmp; + struct rds_connection *conn; + unsigned long flags; + LIST_HEAD(list); + int wake = 0; + + /* get all the messages we're dropping under the rs lock */ + spin_lock_irqsave(&rs->rs_lock, flags); + + list_for_each_entry_safe(rm, tmp, &rs->rs_send_queue, m_sock_item) { + if (dest && (dest->sin_addr.s_addr != rm->m_daddr || + dest->sin_port != rm->m_inc.i_hdr.h_dport)) + continue; + + wake = 1; + list_move(&rm->m_sock_item, &list); + rds_send_sndbuf_remove(rs, rm); + clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags); + + /* If this is a RDMA operation, notify the app. */ + __rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED); + } + + /* order flag updates with the rs lock */ + if (wake) + smp_mb__after_clear_bit(); + + spin_unlock_irqrestore(&rs->rs_lock, flags); + + if (wake) + rds_wake_sk_sleep(rs); + + conn = NULL; + + /* now remove the messages from the conn list as needed */ + list_for_each_entry(rm, &list, m_sock_item) { + /* We do this here rather than in the loop above, so that + * we don't have to nest m_rs_lock under rs->rs_lock */ + spin_lock(&rm->m_rs_lock); + rm->m_rs = NULL; + spin_unlock(&rm->m_rs_lock); + + /* + * If we see this flag cleared then we're *sure* that someone + * else beat us to removing it from the conn. If we race + * with their flag update we'll get the lock and then really + * see that the flag has been cleared. + */ + if (!test_bit(RDS_MSG_ON_CONN, &rm->m_flags)) + continue; + + if (conn != rm->m_inc.i_conn) { + if (conn) + spin_unlock_irqrestore(&conn->c_lock, flags); + conn = rm->m_inc.i_conn; + spin_lock_irqsave(&conn->c_lock, flags); + } + + if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) { + list_del_init(&rm->m_conn_item); + rds_message_put(rm); + } + } + + if (conn) + spin_unlock_irqrestore(&conn->c_lock, flags); + + while (!list_empty(&list)) { + rm = list_entry(list.next, struct rds_message, m_sock_item); + list_del_init(&rm->m_sock_item); + + rds_message_wait(rm); + rds_message_put(rm); + } +} + +/* + * we only want this to fire once so we use the callers 'queued'. It's + * possible that another thread can race with us and remove the + * message from the flow with RDS_CANCEL_SENT_TO. + */ +static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn, + struct rds_message *rm, __be16 sport, + __be16 dport, int *queued) +{ + unsigned long flags; + u32 len; + + if (*queued) + goto out; + + len = be32_to_cpu(rm->m_inc.i_hdr.h_len); + + /* this is the only place which holds both the socket's rs_lock + * and the connection's c_lock */ + spin_lock_irqsave(&rs->rs_lock, flags); + + /* + * If there is a little space in sndbuf, we don't queue anything, + * and userspace gets -EAGAIN. But poll() indicates there's send + * room. This can lead to bad behavior (spinning) if snd_bytes isn't + * freed up by incoming acks. So we check the *old* value of + * rs_snd_bytes here to allow the last msg to exceed the buffer, + * and poll() now knows no more data can be sent. + */ + if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) { + rs->rs_snd_bytes += len; + + /* let recv side know we are close to send space exhaustion. + * This is probably not the optimal way to do it, as this + * means we set the flag on *all* messages as soon as our + * throughput hits a certain threshold. + */ + if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2) + __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + + list_add_tail(&rm->m_sock_item, &rs->rs_send_queue); + set_bit(RDS_MSG_ON_SOCK, &rm->m_flags); + rds_message_addref(rm); + rm->m_rs = rs; + + /* The code ordering is a little weird, but we're + trying to minimize the time we hold c_lock */ + rds_message_populate_header(&rm->m_inc.i_hdr, sport, dport, 0); + rm->m_inc.i_conn = conn; + rds_message_addref(rm); + + spin_lock(&conn->c_lock); + rm->m_inc.i_hdr.h_sequence = cpu_to_be64(conn->c_next_tx_seq++); + list_add_tail(&rm->m_conn_item, &conn->c_send_queue); + set_bit(RDS_MSG_ON_CONN, &rm->m_flags); + spin_unlock(&conn->c_lock); + + rdsdebug("queued msg %p len %d, rs %p bytes %d seq %llu\n", + rm, len, rs, rs->rs_snd_bytes, + (unsigned long long)be64_to_cpu(rm->m_inc.i_hdr.h_sequence)); + + *queued = 1; + } + + spin_unlock_irqrestore(&rs->rs_lock, flags); +out: + return *queued; +} + +static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm, + struct msghdr *msg, int *allocated_mr) +{ + struct cmsghdr *cmsg; + int ret = 0; + + for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) { + if (!CMSG_OK(msg, cmsg)) + return -EINVAL; + + if (cmsg->cmsg_level != SOL_RDS) + continue; + + /* As a side effect, RDMA_DEST and RDMA_MAP will set + * rm->m_rdma_cookie and rm->m_rdma_mr. + */ + switch (cmsg->cmsg_type) { + case RDS_CMSG_RDMA_ARGS: + ret = rds_cmsg_rdma_args(rs, rm, cmsg); + break; + + case RDS_CMSG_RDMA_DEST: + ret = rds_cmsg_rdma_dest(rs, rm, cmsg); + break; + + case RDS_CMSG_RDMA_MAP: + ret = rds_cmsg_rdma_map(rs, rm, cmsg); + if (!ret) + *allocated_mr = 1; + break; + + default: + return -EINVAL; + } + + if (ret) + break; + } + + return ret; +} + +int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t payload_len) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs = rds_sk_to_rs(sk); + struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name; + __be32 daddr; + __be16 dport; + struct rds_message *rm = NULL; + struct rds_connection *conn; + int ret = 0; + int queued = 0, allocated_mr = 0; + int nonblock = msg->msg_flags & MSG_DONTWAIT; + long timeo = sock_rcvtimeo(sk, nonblock); + + /* Mirror Linux UDP mirror of BSD error message compatibility */ + /* XXX: Perhaps MSG_MORE someday */ + if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) { + printk(KERN_INFO "msg_flags 0x%08X\n", msg->msg_flags); + ret = -EOPNOTSUPP; + goto out; + } + + if (msg->msg_namelen) { + /* XXX fail non-unicast destination IPs? */ + if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != AF_INET) { + ret = -EINVAL; + goto out; + } + daddr = usin->sin_addr.s_addr; + dport = usin->sin_port; + } else { + /* We only care about consistency with ->connect() */ + lock_sock(sk); + daddr = rs->rs_conn_addr; + dport = rs->rs_conn_port; + release_sock(sk); + } + + /* racing with another thread binding seems ok here */ + if (daddr == 0 || rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + rm = rds_message_copy_from_user(msg->msg_iov, payload_len); + if (IS_ERR(rm)) { + ret = PTR_ERR(rm); + rm = NULL; + goto out; + } + + rm->m_daddr = daddr; + + /* Parse any control messages the user may have included. */ + ret = rds_cmsg_send(rs, rm, msg, &allocated_mr); + if (ret) + goto out; + + /* rds_conn_create has a spinlock that runs with IRQ off. + * Caching the conn in the socket helps a lot. */ + if (rs->rs_conn && rs->rs_conn->c_faddr == daddr) + conn = rs->rs_conn; + else { + conn = rds_conn_create_outgoing(rs->rs_bound_addr, daddr, + rs->rs_transport, + sock->sk->sk_allocation); + if (IS_ERR(conn)) { + ret = PTR_ERR(conn); + goto out; + } + rs->rs_conn = conn; + } + + if ((rm->m_rdma_cookie || rm->m_rdma_op) + && conn->c_trans->xmit_rdma == NULL) { + if (printk_ratelimit()) + printk(KERN_NOTICE "rdma_op %p conn xmit_rdma %p\n", + rm->m_rdma_op, conn->c_trans->xmit_rdma); + ret = -EOPNOTSUPP; + goto out; + } + + /* If the connection is down, trigger a connect. We may + * have scheduled a delayed reconnect however - in this case + * we should not interfere. + */ + if (rds_conn_state(conn) == RDS_CONN_DOWN + && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_conn_w, 0); + + ret = rds_cong_wait(conn->c_fcong, dport, nonblock, rs); + if (ret) + goto out; + + while (!rds_send_queue_rm(rs, conn, rm, rs->rs_bound_port, + dport, &queued)) { + rds_stats_inc(s_send_queue_full); + /* XXX make sure this is reasonable */ + if (payload_len > rds_sk_sndbuf(rs)) { + ret = -EMSGSIZE; + goto out; + } + if (nonblock) { + ret = -EAGAIN; + goto out; + } + + timeo = wait_event_interruptible_timeout(*sk->sk_sleep, + rds_send_queue_rm(rs, conn, rm, + rs->rs_bound_port, + dport, + &queued), + timeo); + rdsdebug("sendmsg woke queued %d timeo %ld\n", queued, timeo); + if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT) + continue; + + ret = timeo; + if (ret == 0) + ret = -ETIMEDOUT; + goto out; + } + + /* + * By now we've committed to the send. We reuse rds_send_worker() + * to retry sends in the rds thread if the transport asks us to. + */ + rds_stats_inc(s_send_queued); + + if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + rds_send_worker(&conn->c_send_w.work); + + rds_message_put(rm); + return payload_len; + +out: + /* If the user included a RDMA_MAP cmsg, we allocated a MR on the fly. + * If the sendmsg goes through, we keep the MR. If it fails with EAGAIN + * or in any other way, we need to destroy the MR again */ + if (allocated_mr) + rds_rdma_unuse(rs, rds_rdma_cookie_key(rm->m_rdma_cookie), 1); + + if (rm) + rds_message_put(rm); + return ret; +} + +/* + * Reply to a ping packet. + */ +int +rds_send_pong(struct rds_connection *conn, __be16 dport) +{ + struct rds_message *rm; + unsigned long flags; + int ret = 0; + + rm = rds_message_alloc(0, GFP_ATOMIC); + if (rm == NULL) { + ret = -ENOMEM; + goto out; + } + + rm->m_daddr = conn->c_faddr; + + /* If the connection is down, trigger a connect. We may + * have scheduled a delayed reconnect however - in this case + * we should not interfere. + */ + if (rds_conn_state(conn) == RDS_CONN_DOWN + && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_conn_w, 0); + + ret = rds_cong_wait(conn->c_fcong, dport, 1, NULL); + if (ret) + goto out; + + spin_lock_irqsave(&conn->c_lock, flags); + list_add_tail(&rm->m_conn_item, &conn->c_send_queue); + set_bit(RDS_MSG_ON_CONN, &rm->m_flags); + rds_message_addref(rm); + rm->m_inc.i_conn = conn; + + rds_message_populate_header(&rm->m_inc.i_hdr, 0, dport, + conn->c_next_tx_seq); + conn->c_next_tx_seq++; + spin_unlock_irqrestore(&conn->c_lock, flags); + + rds_stats_inc(s_send_queued); + rds_stats_inc(s_send_pong); + + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + rds_message_put(rm); + return 0; + +out: + if (rm) + rds_message_put(rm); + return ret; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:28 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:28 -0800 Subject: [ofa-general] [PATCH 11/26] RDS: recv.c In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-12-git-send-email-andy.grover@oracle.com> Upon receiving a datagram from the transport, RDS parses the headers and potentially queues an ACK. Signed-off-by: Andy Grover --- net/rds/recv.c | 542 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 542 insertions(+), 0 deletions(-) create mode 100644 net/rds/recv.c diff --git a/net/rds/recv.c b/net/rds/recv.c new file mode 100644 index 0000000..f2118c5 --- /dev/null +++ b/net/rds/recv.c @@ -0,0 +1,542 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "rdma.h" + +void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, + __be32 saddr) +{ + atomic_set(&inc->i_refcount, 1); + INIT_LIST_HEAD(&inc->i_item); + inc->i_conn = conn; + inc->i_saddr = saddr; + inc->i_rdma_cookie = 0; +} + +void rds_inc_addref(struct rds_incoming *inc) +{ + rdsdebug("addref inc %p ref %d\n", inc, atomic_read(&inc->i_refcount)); + atomic_inc(&inc->i_refcount); +} + +void rds_inc_put(struct rds_incoming *inc) +{ + rdsdebug("put inc %p ref %d\n", inc, atomic_read(&inc->i_refcount)); + if (atomic_dec_and_test(&inc->i_refcount)) { + BUG_ON(!list_empty(&inc->i_item)); + + inc->i_conn->c_trans->inc_free(inc); + } +} + +static void rds_recv_rcvbuf_delta(struct rds_sock *rs, struct sock *sk, + struct rds_cong_map *map, + int delta, __be16 port) +{ + int now_congested; + + if (delta == 0) + return; + + rs->rs_rcv_bytes += delta; + now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs); + + rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d " + "now_cong %d delta %d\n", + rs, &rs->rs_bound_addr, + ntohs(rs->rs_bound_port), rs->rs_rcv_bytes, + rds_sk_rcvbuf(rs), now_congested, delta); + + /* wasn't -> am congested */ + if (!rs->rs_congested && now_congested) { + rs->rs_congested = 1; + rds_cong_set_bit(map, port); + rds_cong_queue_updates(map); + } + /* was -> aren't congested */ + /* Require more free space before reporting uncongested to prevent + bouncing cong/uncong state too often */ + else if (rs->rs_congested && (rs->rs_rcv_bytes < (rds_sk_rcvbuf(rs)/2))) { + rs->rs_congested = 0; + rds_cong_clear_bit(map, port); + rds_cong_queue_updates(map); + } + + /* do nothing if no change in cong state */ +} + +/* + * Process all extension headers that come with this message. + */ +static void rds_recv_incoming_exthdrs(struct rds_incoming *inc, struct rds_sock *rs) +{ + struct rds_header *hdr = &inc->i_hdr; + unsigned int pos = 0, type, len; + union { + struct rds_ext_header_version version; + struct rds_ext_header_rdma rdma; + struct rds_ext_header_rdma_dest rdma_dest; + } buffer; + + while (1) { + len = sizeof(buffer); + type = rds_message_next_extension(hdr, &pos, &buffer, &len); + if (type == RDS_EXTHDR_NONE) + break; + /* Process extension header here */ + switch (type) { + case RDS_EXTHDR_RDMA: + rds_rdma_unuse(rs, be32_to_cpu(buffer.rdma.h_rdma_rkey), 0); + break; + + case RDS_EXTHDR_RDMA_DEST: + /* We ignore the size for now. We could stash it + * somewhere and use it for error checking. */ + inc->i_rdma_cookie = rds_rdma_make_cookie( + be32_to_cpu(buffer.rdma_dest.h_rdma_rkey), + be32_to_cpu(buffer.rdma_dest.h_rdma_offset)); + + break; + } + } +} + +/* + * The transport must make sure that this is serialized against other + * rx and conn reset on this specific conn. + * + * We currently assert that only one fragmented message will be sent + * down a connection at a time. This lets us reassemble in the conn + * instead of per-flow which means that we don't have to go digging through + * flows to tear down partial reassembly progress on conn failure and + * we save flow lookup and locking for each frag arrival. It does mean + * that small messages will wait behind large ones. Fragmenting at all + * is only to reduce the memory consumption of pre-posted buffers. + * + * The caller passes in saddr and daddr instead of us getting it from the + * conn. This lets loopback, who only has one conn for both directions, + * tell us which roles the addrs in the conn are playing for this message. + */ +void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, + struct rds_incoming *inc, gfp_t gfp, enum km_type km) +{ + struct rds_sock *rs = NULL; + struct sock *sk; + unsigned long flags; + + inc->i_conn = conn; + inc->i_rx_jiffies = jiffies; + + rdsdebug("conn %p next %llu inc %p seq %llu len %u sport %u dport %u " + "flags 0x%x rx_jiffies %lu\n", conn, + (unsigned long long)conn->c_next_rx_seq, + inc, + (unsigned long long)be64_to_cpu(inc->i_hdr.h_sequence), + be32_to_cpu(inc->i_hdr.h_len), + be16_to_cpu(inc->i_hdr.h_sport), + be16_to_cpu(inc->i_hdr.h_dport), + inc->i_hdr.h_flags, + inc->i_rx_jiffies); + + /* + * Sequence numbers should only increase. Messages get their + * sequence number as they're queued in a sending conn. They + * can be dropped, though, if the sending socket is closed before + * they hit the wire. So sequence numbers can skip forward + * under normal operation. They can also drop back in the conn + * failover case as previously sent messages are resent down the + * new instance of a conn. We drop those, otherwise we have + * to assume that the next valid seq does not come after a + * hole in the fragment stream. + * + * The headers don't give us a way to realize if fragments of + * a message have been dropped. We assume that frags that arrive + * to a flow are part of the current message on the flow that is + * being reassembled. This means that senders can't drop messages + * from the sending conn until all their frags are sent. + * + * XXX we could spend more on the wire to get more robust failure + * detection, arguably worth it to avoid data corruption. + */ + if (be64_to_cpu(inc->i_hdr.h_sequence) < conn->c_next_rx_seq + && (inc->i_hdr.h_flags & RDS_FLAG_RETRANSMITTED)) { + rds_stats_inc(s_recv_drop_old_seq); + goto out; + } + conn->c_next_rx_seq = be64_to_cpu(inc->i_hdr.h_sequence) + 1; + + if (rds_sysctl_ping_enable && inc->i_hdr.h_dport == 0) { + rds_stats_inc(s_recv_ping); + rds_send_pong(conn, inc->i_hdr.h_sport); + goto out; + } + + rs = rds_find_bound(daddr, inc->i_hdr.h_dport); + if (rs == NULL) { + rds_stats_inc(s_recv_drop_no_sock); + goto out; + } + + /* Process extension headers */ + rds_recv_incoming_exthdrs(inc, rs); + + /* We can be racing with rds_release() which marks the socket dead. */ + sk = rds_rs_to_sk(rs); + + /* serialize with rds_release -> sock_orphan */ + write_lock_irqsave(&rs->rs_recv_lock, flags); + if (!sock_flag(sk, SOCK_DEAD)) { + rdsdebug("adding inc %p to rs %p's recv queue\n", inc, rs); + rds_stats_inc(s_recv_queued); + rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, + be32_to_cpu(inc->i_hdr.h_len), + inc->i_hdr.h_dport); + rds_inc_addref(inc); + list_add_tail(&inc->i_item, &rs->rs_recv_queue); + __rds_wake_sk_sleep(sk); + } else { + rds_stats_inc(s_recv_drop_dead_sock); + } + write_unlock_irqrestore(&rs->rs_recv_lock, flags); + +out: + if (rs) + rds_sock_put(rs); +} + +/* + * be very careful here. This is being called as the condition in + * wait_event_*() needs to cope with being called many times. + */ +static int rds_next_incoming(struct rds_sock *rs, struct rds_incoming **inc) +{ + unsigned long flags; + + if (*inc == NULL) { + read_lock_irqsave(&rs->rs_recv_lock, flags); + if (!list_empty(&rs->rs_recv_queue)) { + *inc = list_entry(rs->rs_recv_queue.next, + struct rds_incoming, + i_item); + rds_inc_addref(*inc); + } + read_unlock_irqrestore(&rs->rs_recv_lock, flags); + } + + return *inc != NULL; +} + +static int rds_still_queued(struct rds_sock *rs, struct rds_incoming *inc, + int drop) +{ + struct sock *sk = rds_rs_to_sk(rs); + int ret = 0; + unsigned long flags; + + write_lock_irqsave(&rs->rs_recv_lock, flags); + if (!list_empty(&inc->i_item)) { + ret = 1; + if (drop) { + /* XXX make sure this i_conn is reliable */ + rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, + -be32_to_cpu(inc->i_hdr.h_len), + inc->i_hdr.h_dport); + list_del_init(&inc->i_item); + rds_inc_put(inc); + } + } + write_unlock_irqrestore(&rs->rs_recv_lock, flags); + + rdsdebug("inc %p rs %p still %d dropped %d\n", inc, rs, ret, drop); + return ret; +} + +/* + * Pull errors off the error queue. + * If msghdr is NULL, we will just purge the error queue. + */ +int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msghdr) +{ + struct rds_notifier *notifier; + struct rds_rdma_notify cmsg; + unsigned int count = 0, max_messages = ~0U; + unsigned long flags; + LIST_HEAD(copy); + int err = 0; + + + /* put_cmsg copies to user space and thus may sleep. We can't do this + * with rs_lock held, so first grab as many notifications as we can stuff + * in the user provided cmsg buffer. We don't try to copy more, to avoid + * losing notifications - except when the buffer is so small that it wouldn't + * even hold a single notification. Then we give him as much of this single + * msg as we can squeeze in, and set MSG_CTRUNC. + */ + if (msghdr) { + max_messages = msghdr->msg_controllen / CMSG_SPACE(sizeof(cmsg)); + if (!max_messages) + max_messages = 1; + } + + spin_lock_irqsave(&rs->rs_lock, flags); + while (!list_empty(&rs->rs_notify_queue) && count < max_messages) { + notifier = list_entry(rs->rs_notify_queue.next, + struct rds_notifier, n_list); + list_move(¬ifier->n_list, ©); + count++; + } + spin_unlock_irqrestore(&rs->rs_lock, flags); + + if (!count) + return 0; + + while (!list_empty(©)) { + notifier = list_entry(copy.next, struct rds_notifier, n_list); + + if (msghdr) { + cmsg.user_token = notifier->n_user_token; + cmsg.status = notifier->n_status; + + err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_RDMA_STATUS, + sizeof(cmsg), &cmsg); + if (err) + break; + } + + list_del_init(¬ifier->n_list); + kfree(notifier); + } + + /* If we bailed out because of an error in put_cmsg, + * we may be left with one or more notifications that we + * didn't process. Return them to the head of the list. */ + if (!list_empty(©)) { + spin_lock_irqsave(&rs->rs_lock, flags); + list_splice(©, &rs->rs_notify_queue); + spin_unlock_irqrestore(&rs->rs_lock, flags); + } + + return err; +} + +/* + * Queue a congestion notification + */ +static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr) +{ + uint64_t notify = rs->rs_cong_notify; + unsigned long flags; + int err; + + err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_CONG_UPDATE, + sizeof(notify), ¬ify); + if (err) + return err; + + spin_lock_irqsave(&rs->rs_lock, flags); + rs->rs_cong_notify &= ~notify; + spin_unlock_irqrestore(&rs->rs_lock, flags); + + return 0; +} + +/* + * Receive any control messages. + */ +static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) +{ + int ret = 0; + + if (inc->i_rdma_cookie) { + ret = put_cmsg(msg, SOL_RDS, RDS_CMSG_RDMA_DEST, + sizeof(inc->i_rdma_cookie), &inc->i_rdma_cookie); + if (ret) + return ret; + } + + return 0; +} + +int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t size, int msg_flags) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs = rds_sk_to_rs(sk); + long timeo; + int ret = 0, nonblock = msg_flags & MSG_DONTWAIT; + struct sockaddr_in *sin; + struct rds_incoming *inc = NULL; + + /* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */ + timeo = sock_rcvtimeo(sk, nonblock); + + rdsdebug("size %zu flags 0x%x timeo %ld\n", size, msg_flags, timeo); + + if (msg_flags & MSG_OOB) + goto out; + + /* If there are pending notifications, do those - and nothing else */ + if (!list_empty(&rs->rs_notify_queue)) { + ret = rds_notify_queue_get(rs, msg); + goto out; + } + + if (rs->rs_cong_notify) { + ret = rds_notify_cong(rs, msg); + goto out; + } + + while (1) { + if (!rds_next_incoming(rs, &inc)) { + if (nonblock) { + ret = -EAGAIN; + break; + } + + timeo = wait_event_interruptible_timeout(*sk->sk_sleep, + rds_next_incoming(rs, &inc), + timeo); + rdsdebug("recvmsg woke inc %p timeo %ld\n", inc, + timeo); + if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT) + continue; + + ret = timeo; + if (ret == 0) + ret = -ETIMEDOUT; + break; + } + + rdsdebug("copying inc %p from %pI4:%u to user\n", inc, + &inc->i_conn->c_faddr, + ntohs(inc->i_hdr.h_sport)); + ret = inc->i_conn->c_trans->inc_copy_to_user(inc, msg->msg_iov, + size); + if (ret < 0) + break; + + /* + * if the message we just copied isn't at the head of the + * recv queue then someone else raced us to return it, try + * to get the next message. + */ + if (!rds_still_queued(rs, inc, !(msg_flags & MSG_PEEK))) { + rds_inc_put(inc); + inc = NULL; + rds_stats_inc(s_recv_deliver_raced); + continue; + } + + if (ret < be32_to_cpu(inc->i_hdr.h_len)) { + if (msg_flags & MSG_TRUNC) + ret = be32_to_cpu(inc->i_hdr.h_len); + msg->msg_flags |= MSG_TRUNC; + } + + if (rds_cmsg_recv(inc, msg)) { + ret = -EFAULT; + goto out; + } + + rds_stats_inc(s_recv_delivered); + + sin = (struct sockaddr_in *)msg->msg_name; + if (sin) { + sin->sin_family = AF_INET; + sin->sin_port = inc->i_hdr.h_sport; + sin->sin_addr.s_addr = inc->i_saddr; + memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); + } + break; + } + + if (inc) + rds_inc_put(inc); + +out: + return ret; +} + +/* + * The socket is being shut down and we're asked to drop messages that were + * queued for recvmsg. The caller has unbound the socket so the receive path + * won't queue any more incoming fragments or messages on the socket. + */ +void rds_clear_recv_queue(struct rds_sock *rs) +{ + struct sock *sk = rds_rs_to_sk(rs); + struct rds_incoming *inc, *tmp; + unsigned long flags; + + write_lock_irqsave(&rs->rs_recv_lock, flags); + list_for_each_entry_safe(inc, tmp, &rs->rs_recv_queue, i_item) { + rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, + -be32_to_cpu(inc->i_hdr.h_len), + inc->i_hdr.h_dport); + list_del_init(&inc->i_item); + rds_inc_put(inc); + } + write_unlock_irqrestore(&rs->rs_recv_lock, flags); +} + +/* + * inc->i_saddr isn't used here because it is only set in the receive + * path. + */ +void rds_inc_info_copy(struct rds_incoming *inc, + struct rds_info_iterator *iter, + __be32 saddr, __be32 daddr, int flip) +{ + struct rds_info_message minfo; + + minfo.seq = be64_to_cpu(inc->i_hdr.h_sequence); + minfo.len = be32_to_cpu(inc->i_hdr.h_len); + + if (flip) { + minfo.laddr = daddr; + minfo.faddr = saddr; + minfo.lport = inc->i_hdr.h_dport; + minfo.fport = inc->i_hdr.h_sport; + } else { + minfo.laddr = saddr; + minfo.faddr = daddr; + minfo.lport = inc->i_hdr.h_sport; + minfo.fport = inc->i_hdr.h_dport; + } + + rds_info_copy(iter, &minfo, sizeof(minfo)); +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:30 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:30 -0800 Subject: [ofa-general] [PATCH 13/26] RDS/IB: Infiniband transport In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-14-git-send-email-andy.grover@oracle.com> Registers as an RDS transport and an IB client, and uses IB CM API to allocate ids, queue pairs, and the rest of that fun stuff. Signed-off-by: Andy Grover --- net/rds/ib.c | 323 ++++++++++++++++++++++++ net/rds/ib.h | 367 ++++++++++++++++++++++++++++ net/rds/ib_cm.c | 726 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 1416 insertions(+), 0 deletions(-) create mode 100644 net/rds/ib.c create mode 100644 net/rds/ib.h create mode 100644 net/rds/ib_cm.c diff --git a/net/rds/ib.c b/net/rds/ib.c new file mode 100644 index 0000000..06a7b79 --- /dev/null +++ b/net/rds/ib.c @@ -0,0 +1,323 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE; +unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned MRs */ + +module_param(fmr_pool_size, int, 0444); +MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA"); +module_param(fmr_message_size, int, 0444); +MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer"); + +struct list_head rds_ib_devices; + +DEFINE_SPINLOCK(ib_nodev_conns_lock); +LIST_HEAD(ib_nodev_conns); + +void rds_ib_add_one(struct ib_device *device) +{ + struct rds_ib_device *rds_ibdev; + struct ib_device_attr *dev_attr; + + /* Only handle IB (no iWARP) devices */ + if (device->node_type != RDMA_NODE_IB_CA) + return; + + dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); + if (!dev_attr) + return; + + if (ib_query_device(device, dev_attr)) { + rdsdebug("Query device failed for %s\n", device->name); + goto free_attr; + } + + rds_ibdev = kmalloc(sizeof *rds_ibdev, GFP_KERNEL); + if (!rds_ibdev) + goto free_attr; + + spin_lock_init(&rds_ibdev->spinlock); + + rds_ibdev->max_wrs = dev_attr->max_qp_wr; + rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE); + + rds_ibdev->fmr_page_shift = max(9, ffs(dev_attr->page_size_cap) - 1); + rds_ibdev->fmr_page_size = 1 << rds_ibdev->fmr_page_shift; + rds_ibdev->fmr_page_mask = ~((u64) rds_ibdev->fmr_page_size - 1); + rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32; + rds_ibdev->max_fmrs = dev_attr->max_fmr ? + min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) : + fmr_pool_size; + + rds_ibdev->dev = device; + rds_ibdev->pd = ib_alloc_pd(device); + if (IS_ERR(rds_ibdev->pd)) + goto free_dev; + + rds_ibdev->mr = ib_get_dma_mr(rds_ibdev->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(rds_ibdev->mr)) + goto err_pd; + + rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev); + if (IS_ERR(rds_ibdev->mr_pool)) { + rds_ibdev->mr_pool = NULL; + goto err_mr; + } + + INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); + INIT_LIST_HEAD(&rds_ibdev->conn_list); + list_add_tail(&rds_ibdev->list, &rds_ib_devices); + + ib_set_client_data(device, &rds_ib_client, rds_ibdev); + + goto free_attr; + +err_mr: + ib_dereg_mr(rds_ibdev->mr); +err_pd: + ib_dealloc_pd(rds_ibdev->pd); +free_dev: + kfree(rds_ibdev); +free_attr: + kfree(dev_attr); +} + +void rds_ib_remove_one(struct ib_device *device) +{ + struct rds_ib_device *rds_ibdev; + struct rds_ib_ipaddr *i_ipaddr, *i_next; + + rds_ibdev = ib_get_client_data(device, &rds_ib_client); + if (!rds_ibdev) + return; + + list_for_each_entry_safe(i_ipaddr, i_next, &rds_ibdev->ipaddr_list, list) { + list_del(&i_ipaddr->list); + kfree(i_ipaddr); + } + + rds_ib_remove_conns(rds_ibdev); + + if (rds_ibdev->mr_pool) + rds_ib_destroy_mr_pool(rds_ibdev->mr_pool); + + ib_dereg_mr(rds_ibdev->mr); + + while (ib_dealloc_pd(rds_ibdev->pd)) { + rdsdebug("Failed to dealloc pd %p\n", rds_ibdev->pd); + msleep(1); + } + + list_del(&rds_ibdev->list); + kfree(rds_ibdev); +} + +struct ib_client rds_ib_client = { + .name = "rds_ib", + .add = rds_ib_add_one, + .remove = rds_ib_remove_one +}; + +static int rds_ib_conn_info_visitor(struct rds_connection *conn, + void *buffer) +{ + struct rds_info_rdma_connection *iinfo = buffer; + struct rds_ib_connection *ic; + + /* We will only ever look at IB transports */ + if (conn->c_trans != &rds_ib_transport) + return 0; + + iinfo->src_addr = conn->c_laddr; + iinfo->dst_addr = conn->c_faddr; + + memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid)); + memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid)); + if (rds_conn_state(conn) == RDS_CONN_UP) { + struct rds_ib_device *rds_ibdev; + struct rdma_dev_addr *dev_addr; + + ic = conn->c_transport_data; + dev_addr = &ic->i_cm_id->route.addr.dev_addr; + + ib_addr_get_sgid(dev_addr, (union ib_gid *) &iinfo->src_gid); + ib_addr_get_dgid(dev_addr, (union ib_gid *) &iinfo->dst_gid); + + rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); + iinfo->max_send_wr = ic->i_send_ring.w_nr; + iinfo->max_recv_wr = ic->i_recv_ring.w_nr; + iinfo->max_send_sge = rds_ibdev->max_sge; + rds_ib_get_mr_info(rds_ibdev, iinfo); + } + return 1; +} + +static void rds_ib_ic_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_for_each_conn_info(sock, len, iter, lens, + rds_ib_conn_info_visitor, + sizeof(struct rds_info_rdma_connection)); +} + + +/* + * Early RDS/IB was built to only bind to an address if there is an IPoIB + * device with that address set. + * + * If it were me, I'd advocate for something more flexible. Sending and + * receiving should be device-agnostic. Transports would try and maintain + * connections between peers who have messages queued. Userspace would be + * allowed to influence which paths have priority. We could call userspace + * asserting this policy "routing". + */ +static int rds_ib_laddr_check(__be32 addr) +{ + int ret; + struct rdma_cm_id *cm_id; + struct sockaddr_in sin; + + /* Create a CMA ID and try to bind it. This catches both + * IB and iWARP capable NICs. + */ + cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP); + if (!cm_id) + return -EADDRNOTAVAIL; + + memset(&sin, 0, sizeof(sin)); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = addr; + + /* rdma_bind_addr will only succeed for IB & iWARP devices */ + ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin); + /* due to this, we will claim to support iWARP devices unless we + check node_type. */ + if (ret || cm_id->device->node_type != RDMA_NODE_IB_CA) + ret = -EADDRNOTAVAIL; + + rdsdebug("addr %pI4 ret %d node type %d\n", + &addr, ret, + cm_id->device ? cm_id->device->node_type : -1); + + rdma_destroy_id(cm_id); + + return ret; +} + +void rds_ib_exit(void) +{ + rds_info_deregister_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info); + rds_ib_remove_nodev_conns(); + ib_unregister_client(&rds_ib_client); + rds_ib_sysctl_exit(); + rds_ib_recv_exit(); + rds_trans_unregister(&rds_ib_transport); +} + +struct rds_transport rds_ib_transport = { + .laddr_check = rds_ib_laddr_check, + .xmit_complete = rds_ib_xmit_complete, + .xmit = rds_ib_xmit, + .xmit_cong_map = NULL, + .xmit_rdma = rds_ib_xmit_rdma, + .recv = rds_ib_recv, + .conn_alloc = rds_ib_conn_alloc, + .conn_free = rds_ib_conn_free, + .conn_connect = rds_ib_conn_connect, + .conn_shutdown = rds_ib_conn_shutdown, + .inc_copy_to_user = rds_ib_inc_copy_to_user, + .inc_purge = rds_ib_inc_purge, + .inc_free = rds_ib_inc_free, + .cm_initiate_connect = rds_ib_cm_initiate_connect, + .cm_handle_connect = rds_ib_cm_handle_connect, + .cm_connect_complete = rds_ib_cm_connect_complete, + .stats_info_copy = rds_ib_stats_info_copy, + .exit = rds_ib_exit, + .get_mr = rds_ib_get_mr, + .sync_mr = rds_ib_sync_mr, + .free_mr = rds_ib_free_mr, + .flush_mrs = rds_ib_flush_mrs, + .t_owner = THIS_MODULE, + .t_name = "infiniband", +}; + +int __init rds_ib_init(void) +{ + int ret; + + INIT_LIST_HEAD(&rds_ib_devices); + + ret = ib_register_client(&rds_ib_client); + if (ret) + goto out; + + ret = rds_ib_sysctl_init(); + if (ret) + goto out_ibreg; + + ret = rds_ib_recv_init(); + if (ret) + goto out_sysctl; + + ret = rds_trans_register(&rds_ib_transport); + if (ret) + goto out_recv; + + rds_info_register_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info); + + goto out; + +out_recv: + rds_ib_recv_exit(); +out_sysctl: + rds_ib_sysctl_exit(); +out_ibreg: + ib_unregister_client(&rds_ib_client); +out: + return ret; +} + +MODULE_LICENSE("GPL"); + diff --git a/net/rds/ib.h b/net/rds/ib.h new file mode 100644 index 0000000..8be563a --- /dev/null +++ b/net/rds/ib.h @@ -0,0 +1,367 @@ +#ifndef _RDS_IB_H +#define _RDS_IB_H + +#include +#include +#include "rds.h" +#include "rdma_transport.h" + +#define RDS_FMR_SIZE 256 +#define RDS_FMR_POOL_SIZE 4096 + +#define RDS_IB_MAX_SGE 8 +#define RDS_IB_RECV_SGE 2 + +#define RDS_IB_DEFAULT_RECV_WR 1024 +#define RDS_IB_DEFAULT_SEND_WR 256 + +#define RDS_IB_SUPPORTED_PROTOCOLS 0x00000003 /* minor versions supported */ + +extern struct list_head rds_ib_devices; + +/* + * IB posts RDS_FRAG_SIZE fragments of pages to the receive queues to + * try and minimize the amount of memory tied up both the device and + * socket receive queues. + */ +/* page offset of the final full frag that fits in the page */ +#define RDS_PAGE_LAST_OFF (((PAGE_SIZE / RDS_FRAG_SIZE) - 1) * RDS_FRAG_SIZE) +struct rds_page_frag { + struct list_head f_item; + struct page *f_page; + unsigned long f_offset; + dma_addr_t f_mapped; +}; + +struct rds_ib_incoming { + struct list_head ii_frags; + struct rds_incoming ii_inc; +}; + +struct rds_ib_connect_private { + /* Add new fields at the end, and don't permute existing fields. */ + __be32 dp_saddr; + __be32 dp_daddr; + u8 dp_protocol_major; + u8 dp_protocol_minor; + __be16 dp_protocol_minor_mask; /* bitmask */ + __be32 dp_reserved1; + __be64 dp_ack_seq; + __be32 dp_credit; /* non-zero enables flow ctl */ +}; + +struct rds_ib_send_work { + struct rds_message *s_rm; + struct rds_rdma_op *s_op; + struct ib_send_wr s_wr; + struct ib_sge s_sge[RDS_IB_MAX_SGE]; + unsigned long s_queued; +}; + +struct rds_ib_recv_work { + struct rds_ib_incoming *r_ibinc; + struct rds_page_frag *r_frag; + struct ib_recv_wr r_wr; + struct ib_sge r_sge[2]; +}; + +struct rds_ib_work_ring { + u32 w_nr; + u32 w_alloc_ptr; + u32 w_alloc_ctr; + u32 w_free_ptr; + atomic_t w_free_ctr; +}; + +struct rds_ib_device; + +struct rds_ib_connection { + + struct list_head ib_node; + struct rds_ib_device *rds_ibdev; + struct rds_connection *conn; + + /* alphabet soup, IBTA style */ + struct rdma_cm_id *i_cm_id; + struct ib_pd *i_pd; + struct ib_mr *i_mr; + struct ib_cq *i_send_cq; + struct ib_cq *i_recv_cq; + + /* tx */ + struct rds_ib_work_ring i_send_ring; + struct rds_message *i_rm; + struct rds_header *i_send_hdrs; + u64 i_send_hdrs_dma; + struct rds_ib_send_work *i_sends; + + /* rx */ + struct mutex i_recv_mutex; + struct rds_ib_work_ring i_recv_ring; + struct rds_ib_incoming *i_ibinc; + u32 i_recv_data_rem; + struct rds_header *i_recv_hdrs; + u64 i_recv_hdrs_dma; + struct rds_ib_recv_work *i_recvs; + struct rds_page_frag i_frag; + u64 i_ack_recv; /* last ACK received */ + + /* sending acks */ + unsigned long i_ack_flags; + u64 i_ack_next; /* next ACK to send */ + struct rds_header *i_ack; + struct ib_send_wr i_ack_wr; + struct ib_sge i_ack_sge; + u64 i_ack_dma; + unsigned long i_ack_queued; + + /* Flow control related information + * + * Our algorithm uses a pair variables that we need to access + * atomically - one for the send credits, and one posted + * recv credits we need to transfer to remote. + * Rather than protect them using a slow spinlock, we put both into + * a single atomic_t and update it using cmpxchg + */ + atomic_t i_credits; + + /* Protocol version specific information */ + unsigned int i_flowctl:1; /* enable/disable flow ctl */ + + /* Batched completions */ + unsigned int i_unsignaled_wrs; + long i_unsignaled_bytes; +}; + +/* This assumes that atomic_t is at least 32 bits */ +#define IB_GET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_GET_POST_CREDITS(v) ((v) >> 16) +#define IB_SET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_SET_POST_CREDITS(v) ((v) << 16) + +struct rds_ib_ipaddr { + struct list_head list; + __be32 ipaddr; +}; + +struct rds_ib_device { + struct list_head list; + struct list_head ipaddr_list; + struct list_head conn_list; + struct ib_device *dev; + struct ib_pd *pd; + struct ib_mr *mr; + struct rds_ib_mr_pool *mr_pool; + int fmr_page_shift; + int fmr_page_size; + u64 fmr_page_mask; + unsigned int fmr_max_remaps; + unsigned int max_fmrs; + int max_sge; + unsigned int max_wrs; + spinlock_t spinlock; /* protect the above */ +}; + +/* bits for i_ack_flags */ +#define IB_ACK_IN_FLIGHT 0 +#define IB_ACK_REQUESTED 1 + +/* Magic WR_ID for ACKs */ +#define RDS_IB_ACK_WR_ID (~(u64) 0) + +struct rds_ib_statistics { + uint64_t s_ib_connect_raced; + uint64_t s_ib_listen_closed_stale; + uint64_t s_ib_tx_cq_call; + uint64_t s_ib_tx_cq_event; + uint64_t s_ib_tx_ring_full; + uint64_t s_ib_tx_throttle; + uint64_t s_ib_tx_sg_mapping_failure; + uint64_t s_ib_tx_stalled; + uint64_t s_ib_tx_credit_updates; + uint64_t s_ib_rx_cq_call; + uint64_t s_ib_rx_cq_event; + uint64_t s_ib_rx_ring_empty; + uint64_t s_ib_rx_refill_from_cq; + uint64_t s_ib_rx_refill_from_thread; + uint64_t s_ib_rx_alloc_limit; + uint64_t s_ib_rx_credit_updates; + uint64_t s_ib_ack_sent; + uint64_t s_ib_ack_send_failure; + uint64_t s_ib_ack_send_delayed; + uint64_t s_ib_ack_send_piggybacked; + uint64_t s_ib_ack_received; + uint64_t s_ib_rdma_mr_alloc; + uint64_t s_ib_rdma_mr_free; + uint64_t s_ib_rdma_mr_used; + uint64_t s_ib_rdma_mr_pool_flush; + uint64_t s_ib_rdma_mr_pool_wait; + uint64_t s_ib_rdma_mr_pool_depleted; +}; + +extern struct workqueue_struct *rds_ib_wq; + +/* + * Fake ib_dma_sync_sg_for_{cpu,device} as long as ib_verbs.h + * doesn't define it. + */ +static inline void rds_ib_dma_sync_sg_for_cpu(struct ib_device *dev, + struct scatterlist *sg, unsigned int sg_dma_len, int direction) +{ + unsigned int i; + + for (i = 0; i < sg_dma_len; ++i) { + ib_dma_sync_single_for_cpu(dev, + ib_sg_dma_address(dev, &sg[i]), + ib_sg_dma_len(dev, &sg[i]), + direction); + } +} +#define ib_dma_sync_sg_for_cpu rds_ib_dma_sync_sg_for_cpu + +static inline void rds_ib_dma_sync_sg_for_device(struct ib_device *dev, + struct scatterlist *sg, unsigned int sg_dma_len, int direction) +{ + unsigned int i; + + for (i = 0; i < sg_dma_len; ++i) { + ib_dma_sync_single_for_device(dev, + ib_sg_dma_address(dev, &sg[i]), + ib_sg_dma_len(dev, &sg[i]), + direction); + } +} +#define ib_dma_sync_sg_for_device rds_ib_dma_sync_sg_for_device + + +/* ib.c */ +extern struct rds_transport rds_ib_transport; +extern void rds_ib_add_one(struct ib_device *device); +extern void rds_ib_remove_one(struct ib_device *device); +extern struct ib_client rds_ib_client; + +extern unsigned int fmr_pool_size; +extern unsigned int fmr_message_size; + +extern spinlock_t ib_nodev_conns_lock; +extern struct list_head ib_nodev_conns; + +/* ib_cm.c */ +int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp); +void rds_ib_conn_free(void *arg); +int rds_ib_conn_connect(struct rds_connection *conn); +void rds_ib_conn_shutdown(struct rds_connection *conn); +void rds_ib_state_change(struct sock *sk); +int __init rds_ib_listen_init(void); +void rds_ib_listen_stop(void); +void __rds_ib_conn_error(struct rds_connection *conn, const char *, ...); +int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event); +int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id); +void rds_ib_cm_connect_complete(struct rds_connection *conn, + struct rdma_cm_event *event); + + +#define rds_ib_conn_error(conn, fmt...) \ + __rds_ib_conn_error(conn, KERN_WARNING "RDS/IB: " fmt) + +/* ib_rdma.c */ +int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); +int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); +void rds_ib_remove_nodev_conns(void); +void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev); +struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *); +void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo); +void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *); +void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, + struct rds_sock *rs, u32 *key_ret); +void rds_ib_sync_mr(void *trans_private, int dir); +void rds_ib_free_mr(void *trans_private, int invalidate); +void rds_ib_flush_mrs(void); + +/* ib_recv.c */ +int __init rds_ib_recv_init(void); +void rds_ib_recv_exit(void); +int rds_ib_recv(struct rds_connection *conn); +int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, + gfp_t page_gfp, int prefill); +void rds_ib_inc_purge(struct rds_incoming *inc); +void rds_ib_inc_free(struct rds_incoming *inc); +int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov, + size_t size); +void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_ib_recv_init_ring(struct rds_ib_connection *ic); +void rds_ib_recv_clear_ring(struct rds_ib_connection *ic); +void rds_ib_recv_init_ack(struct rds_ib_connection *ic); +void rds_ib_attempt_ack(struct rds_ib_connection *ic); +void rds_ib_ack_send_complete(struct rds_ib_connection *ic); +u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic); + +/* ib_ring.c */ +void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr); +void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr); +u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos); +void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val); +void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val); +int rds_ib_ring_empty(struct rds_ib_work_ring *ring); +int rds_ib_ring_low(struct rds_ib_work_ring *ring); +u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring); +u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest); +extern wait_queue_head_t rds_ib_ring_empty_wait; + +/* ib_send.c */ +void rds_ib_xmit_complete(struct rds_connection *conn); +int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off); +void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_ib_send_init_ring(struct rds_ib_connection *ic); +void rds_ib_send_clear_ring(struct rds_ib_connection *ic); +int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op); +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits); +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted); +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted, + u32 *adv_credits, int need_posted); + +/* ib_stats.c */ +DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats); +#define rds_ib_stats_inc(member) rds_stats_inc_which(rds_ib_stats, member) +unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail); + +/* ib_sysctl.c */ +int __init rds_ib_sysctl_init(void); +void rds_ib_sysctl_exit(void); +extern unsigned long rds_ib_sysctl_max_send_wr; +extern unsigned long rds_ib_sysctl_max_recv_wr; +extern unsigned long rds_ib_sysctl_max_unsig_wrs; +extern unsigned long rds_ib_sysctl_max_unsig_bytes; +extern unsigned long rds_ib_sysctl_max_recv_allocation; +extern unsigned int rds_ib_sysctl_flow_control; +extern ctl_table rds_ib_sysctl_table[]; + +/* + * Helper functions for getting/setting the header and data SGEs in + * RDS packets (not RDMA) + */ +static inline struct ib_sge * +rds_ib_header_sge(struct rds_ib_connection *ic, struct ib_sge *sge) +{ + return &sge[0]; +} + +static inline struct ib_sge * +rds_ib_data_sge(struct rds_ib_connection *ic, struct ib_sge *sge) +{ + return &sge[1]; +} + +static inline void rds_ib_set_64bit(u64 *ptr, u64 val) +{ +#if BITS_PER_LONG == 64 + *ptr = val; +#else + set_64bit(ptr, val); +#endif +} + +#endif diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c new file mode 100644 index 0000000..0532237 --- /dev/null +++ b/net/rds/ib_cm.c @@ -0,0 +1,726 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +/* + * Set the selected protocol version + */ +static void rds_ib_set_protocol(struct rds_connection *conn, unsigned int version) +{ + conn->c_version = version; +} + +/* + * Set up flow control + */ +static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (rds_ib_sysctl_flow_control && credits != 0) { + /* We're doing flow control */ + ic->i_flowctl = 1; + rds_ib_send_add_credits(conn, credits); + } else { + ic->i_flowctl = 0; + } +} + +/* + * Tune RNR behavior. Without flow control, we use a rather + * low timeout, but not the absolute minimum - this should + * be tunable. + * + * We already set the RNR retry count to 7 (which is the + * smallest infinite number :-) above. + * If flow control is off, we want to change this back to 0 + * so that we learn quickly when our credit accounting is + * buggy. + * + * Caller passes in a qp_attr pointer - don't waste stack spacv + * by allocation this twice. + */ +static void +rds_ib_tune_rnr(struct rds_ib_connection *ic, struct ib_qp_attr *attr) +{ + int ret; + + attr->min_rnr_timer = IB_RNR_TIMER_000_32; + ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_MIN_RNR_TIMER); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): err=%d\n", -ret); +} + +/* + * Connection established. + * We get here for both outgoing and incoming connection. + */ +void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_event *event) +{ + const struct rds_ib_connect_private *dp = NULL; + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_device *rds_ibdev; + struct ib_qp_attr qp_attr; + int err; + + if (event->param.conn.private_data_len) { + dp = event->param.conn.private_data; + + rds_ib_set_protocol(conn, + RDS_PROTOCOL(dp->dp_protocol_major, + dp->dp_protocol_minor)); + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); + } + + printk(KERN_NOTICE "RDS/IB: connected to %pI4 version %u.%u%s\n", + &conn->c_laddr, + RDS_PROTOCOL_MAJOR(conn->c_version), + RDS_PROTOCOL_MINOR(conn->c_version), + ic->i_flowctl ? ", flow control" : ""); + + /* Tune RNR behavior */ + rds_ib_tune_rnr(ic, &qp_attr); + + qp_attr.qp_state = IB_QPS_RTS; + err = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE); + if (err) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", err); + + /* update ib_device with this local ipaddr & conn */ + rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); + err = rds_ib_update_ipaddr(rds_ibdev, conn->c_laddr); + if (err) + printk(KERN_ERR "rds_ib_update_ipaddr failed (%d)\n", err); + err = rds_ib_add_conn(rds_ibdev, conn); + if (err) + printk(KERN_ERR "rds_ib_add_conn failed (%d)\n", err); + + /* If the peer gave us the last packet it saw, process this as if + * we had received a regular ACK. */ + if (dp && dp->dp_ack_seq) + rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL); + + rds_connect_complete(conn); +} + +static void rds_ib_cm_fill_conn_param(struct rds_connection *conn, + struct rdma_conn_param *conn_param, + struct rds_ib_connect_private *dp, + u32 protocol_version) +{ + memset(conn_param, 0, sizeof(struct rdma_conn_param)); + /* XXX tune these? */ + conn_param->responder_resources = 1; + conn_param->initiator_depth = 1; + conn_param->retry_count = 7; + conn_param->rnr_retry_count = 7; + + if (dp) { + struct rds_ib_connection *ic = conn->c_transport_data; + + memset(dp, 0, sizeof(*dp)); + dp->dp_saddr = conn->c_laddr; + dp->dp_daddr = conn->c_faddr; + dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version); + dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version); + dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS); + dp->dp_ack_seq = rds_ib_piggyb_ack(ic); + + /* Advertise flow control */ + if (ic->i_flowctl) { + unsigned int credits; + + credits = IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)); + dp->dp_credit = cpu_to_be32(credits); + atomic_sub(IB_SET_POST_CREDITS(credits), &ic->i_credits); + } + + conn_param->private_data = dp; + conn_param->private_data_len = sizeof(*dp); + } +} + +static void rds_ib_cq_event_handler(struct ib_event *event, void *data) +{ + rdsdebug("event %u data %p\n", event->event, data); +} + +static void rds_ib_qp_event_handler(struct ib_event *event, void *data) +{ + struct rds_connection *conn = data; + struct rds_ib_connection *ic = conn->c_transport_data; + + rdsdebug("conn %p ic %p event %u\n", conn, ic, event->event); + + switch (event->event) { + case IB_EVENT_COMM_EST: + rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST); + break; + default: + printk(KERN_WARNING "RDS/ib: unhandled QP event %u " + "on connection to %pI4\n", event->event, + &conn->c_faddr); + break; + } +} + +/* + * This needs to be very careful to not leave IS_ERR pointers around for + * cleanup to trip over. + */ +static int rds_ib_setup_qp(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_device *dev = ic->i_cm_id->device; + struct ib_qp_init_attr attr; + struct rds_ib_device *rds_ibdev; + int ret; + + /* rds_ib_add_one creates a rds_ib_device object per IB device, + * and allocates a protection domain, memory range and FMR pool + * for each. If that fails for any reason, it will not register + * the rds_ibdev at all. + */ + rds_ibdev = ib_get_client_data(dev, &rds_ib_client); + if (rds_ibdev == NULL) { + if (printk_ratelimit()) + printk(KERN_NOTICE "RDS/IB: No client_data for device %s\n", + dev->name); + return -EOPNOTSUPP; + } + + if (rds_ibdev->max_wrs < ic->i_send_ring.w_nr + 1) + rds_ib_ring_resize(&ic->i_send_ring, rds_ibdev->max_wrs - 1); + if (rds_ibdev->max_wrs < ic->i_recv_ring.w_nr + 1) + rds_ib_ring_resize(&ic->i_recv_ring, rds_ibdev->max_wrs - 1); + + /* Protection domain and memory range */ + ic->i_pd = rds_ibdev->pd; + ic->i_mr = rds_ibdev->mr; + + ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler, + rds_ib_cq_event_handler, conn, + ic->i_send_ring.w_nr + 1, 0); + if (IS_ERR(ic->i_send_cq)) { + ret = PTR_ERR(ic->i_send_cq); + ic->i_send_cq = NULL; + rdsdebug("ib_create_cq send failed: %d\n", ret); + goto out; + } + + ic->i_recv_cq = ib_create_cq(dev, rds_ib_recv_cq_comp_handler, + rds_ib_cq_event_handler, conn, + ic->i_recv_ring.w_nr, 0); + if (IS_ERR(ic->i_recv_cq)) { + ret = PTR_ERR(ic->i_recv_cq); + ic->i_recv_cq = NULL; + rdsdebug("ib_create_cq recv failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); + if (ret) { + rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); + if (ret) { + rdsdebug("ib_req_notify_cq recv failed: %d\n", ret); + goto out; + } + + /* XXX negotiate max send/recv with remote? */ + memset(&attr, 0, sizeof(attr)); + attr.event_handler = rds_ib_qp_event_handler; + attr.qp_context = conn; + /* + 1 to allow for the single ack message */ + attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; + attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; + attr.cap.max_send_sge = rds_ibdev->max_sge; + attr.cap.max_recv_sge = RDS_IB_RECV_SGE; + attr.sq_sig_type = IB_SIGNAL_REQ_WR; + attr.qp_type = IB_QPT_RC; + attr.send_cq = ic->i_send_cq; + attr.recv_cq = ic->i_recv_cq; + + /* + * XXX this can fail if max_*_wr is too large? Are we supposed + * to back off until we get a value that the hardware can support? + */ + ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr); + if (ret) { + rdsdebug("rdma_create_qp failed: %d\n", ret); + goto out; + } + + ic->i_send_hdrs = ib_dma_alloc_coherent(dev, + ic->i_send_ring.w_nr * + sizeof(struct rds_header), + &ic->i_send_hdrs_dma, GFP_KERNEL); + if (ic->i_send_hdrs == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent send failed\n"); + goto out; + } + + ic->i_recv_hdrs = ib_dma_alloc_coherent(dev, + ic->i_recv_ring.w_nr * + sizeof(struct rds_header), + &ic->i_recv_hdrs_dma, GFP_KERNEL); + if (ic->i_recv_hdrs == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent recv failed\n"); + goto out; + } + + ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header), + &ic->i_ack_dma, GFP_KERNEL); + if (ic->i_ack == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent ack failed\n"); + goto out; + } + + ic->i_sends = vmalloc(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work)); + if (ic->i_sends == NULL) { + ret = -ENOMEM; + rdsdebug("send allocation failed\n"); + goto out; + } + rds_ib_send_init_ring(ic); + + ic->i_recvs = vmalloc(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work)); + if (ic->i_recvs == NULL) { + ret = -ENOMEM; + rdsdebug("recv allocation failed\n"); + goto out; + } + + rds_ib_recv_init_ring(ic); + rds_ib_recv_init_ack(ic); + + /* Post receive buffers - as a side effect, this will update + * the posted credit count. */ + rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1); + + rdsdebug("conn %p pd %p mr %p cq %p %p\n", conn, ic->i_pd, ic->i_mr, + ic->i_send_cq, ic->i_recv_cq); + +out: + return ret; +} + +static u32 rds_ib_protocol_compatible(const struct rds_ib_connect_private *dp) +{ + u16 common; + u32 version = 0; + + /* rdma_cm private data is odd - when there is any private data in the + * request, we will be given a pretty large buffer without telling us the + * original size. The only way to tell the difference is by looking at + * the contents, which are initialized to zero. + * If the protocol version fields aren't set, this is a connection attempt + * from an older version. This could could be 3.0 or 2.0 - we can't tell. + * We really should have changed this for OFED 1.3 :-( */ + if (dp->dp_protocol_major == 0) + return RDS_PROTOCOL_3_0; + + common = be16_to_cpu(dp->dp_protocol_minor_mask) & RDS_IB_SUPPORTED_PROTOCOLS; + if (dp->dp_protocol_major == 3 && common) { + version = RDS_PROTOCOL_3_0; + while ((common >>= 1) != 0) + version++; + } else if (printk_ratelimit()) { + printk(KERN_NOTICE "RDS: Connection from %pI4 using " + "incompatible protocol version %u.%u\n", + &dp->dp_saddr, + dp->dp_protocol_major, + dp->dp_protocol_minor); + } + return version; +} + +int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + __be64 lguid = cm_id->route.path_rec->sgid.global.interface_id; + __be64 fguid = cm_id->route.path_rec->dgid.global.interface_id; + const struct rds_ib_connect_private *dp = event->param.conn.private_data; + struct rds_ib_connect_private dp_rep; + struct rds_connection *conn = NULL; + struct rds_ib_connection *ic = NULL; + struct rdma_conn_param conn_param; + u32 version; + int err, destroy = 1; + + /* Check whether the remote protocol version matches ours. */ + version = rds_ib_protocol_compatible(dp); + if (!version) + goto out; + + rdsdebug("saddr %pI4 daddr %pI4 RDSv%u.%u lguid 0x%llx fguid " + "0x%llx\n", &dp->dp_saddr, &dp->dp_daddr, + RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version), + (unsigned long long)be64_to_cpu(lguid), + (unsigned long long)be64_to_cpu(fguid)); + + conn = rds_conn_create(dp->dp_daddr, dp->dp_saddr, &rds_ib_transport, + GFP_KERNEL); + if (IS_ERR(conn)) { + rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn)); + conn = NULL; + goto out; + } + + /* + * The connection request may occur while the + * previous connection exist, e.g. in case of failover. + * But as connections may be initiated simultaneously + * by both hosts, we have a random backoff mechanism - + * see the comment above rds_queue_reconnect() + */ + mutex_lock(&conn->c_cm_lock); + if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) { + if (rds_conn_state(conn) == RDS_CONN_UP) { + rdsdebug("incoming connect while connecting\n"); + rds_conn_drop(conn); + rds_ib_stats_inc(s_ib_listen_closed_stale); + } else + if (rds_conn_state(conn) == RDS_CONN_CONNECTING) { + /* Wait and see - our connect may still be succeeding */ + rds_ib_stats_inc(s_ib_connect_raced); + } + mutex_unlock(&conn->c_cm_lock); + goto out; + } + + ic = conn->c_transport_data; + + rds_ib_set_protocol(conn, version); + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); + + /* If the peer gave us the last packet it saw, process this as if + * we had received a regular ACK. */ + if (dp->dp_ack_seq) + rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL); + + BUG_ON(cm_id->context); + BUG_ON(ic->i_cm_id); + + ic->i_cm_id = cm_id; + cm_id->context = conn; + + /* We got halfway through setting up the ib_connection, if we + * fail now, we have to take the long route out of this mess. */ + destroy = 0; + + err = rds_ib_setup_qp(conn); + if (err) { + rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", err); + goto out; + } + + rds_ib_cm_fill_conn_param(conn, &conn_param, &dp_rep, version); + + /* rdma_accept() calls rdma_reject() internally if it fails */ + err = rdma_accept(cm_id, &conn_param); + mutex_unlock(&conn->c_cm_lock); + if (err) { + rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err); + goto out; + } + + return 0; + +out: + rdma_reject(cm_id, NULL, 0); + return destroy; +} + + +int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id) +{ + struct rds_connection *conn = cm_id->context; + struct rds_ib_connection *ic = conn->c_transport_data; + struct rdma_conn_param conn_param; + struct rds_ib_connect_private dp; + int ret; + + /* If the peer doesn't do protocol negotiation, we must + * default to RDSv3.0 */ + rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0); + ic->i_flowctl = rds_ib_sysctl_flow_control; /* advertise flow control */ + + ret = rds_ib_setup_qp(conn); + if (ret) { + rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", ret); + goto out; + } + + rds_ib_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION); + + ret = rdma_connect(cm_id, &conn_param); + if (ret) + rds_ib_conn_error(conn, "rdma_connect failed (%d)\n", ret); + +out: + /* Beware - returning non-zero tells the rdma_cm to destroy + * the cm_id. We should certainly not do it as long as we still + * "own" the cm_id. */ + if (ret) { + if (ic->i_cm_id == cm_id) + ret = 0; + } + return ret; +} + +int rds_ib_conn_connect(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct sockaddr_in src, dest; + int ret; + + /* XXX I wonder what affect the port space has */ + /* delegate cm event handler to rdma_transport */ + ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn, + RDMA_PS_TCP); + if (IS_ERR(ic->i_cm_id)) { + ret = PTR_ERR(ic->i_cm_id); + ic->i_cm_id = NULL; + rdsdebug("rdma_create_id() failed: %d\n", ret); + goto out; + } + + rdsdebug("created cm id %p for conn %p\n", ic->i_cm_id, conn); + + src.sin_family = AF_INET; + src.sin_addr.s_addr = (__force u32)conn->c_laddr; + src.sin_port = (__force u16)htons(0); + + dest.sin_family = AF_INET; + dest.sin_addr.s_addr = (__force u32)conn->c_faddr; + dest.sin_port = (__force u16)htons(RDS_PORT); + + ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src, + (struct sockaddr *)&dest, + RDS_RDMA_RESOLVE_TIMEOUT_MS); + if (ret) { + rdsdebug("addr resolve failed for cm id %p: %d\n", ic->i_cm_id, + ret); + rdma_destroy_id(ic->i_cm_id); + ic->i_cm_id = NULL; + } + +out: + return ret; +} + +/* + * This is so careful about only cleaning up resources that were built up + * so that it can be called at any point during startup. In fact it + * can be called multiple times for a given connection. + */ +void rds_ib_conn_shutdown(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + int err = 0; + + rdsdebug("cm %p pd %p cq %p %p qp %p\n", ic->i_cm_id, + ic->i_pd, ic->i_send_cq, ic->i_recv_cq, + ic->i_cm_id ? ic->i_cm_id->qp : NULL); + + if (ic->i_cm_id) { + struct ib_device *dev = ic->i_cm_id->device; + + rdsdebug("disconnecting cm %p\n", ic->i_cm_id); + err = rdma_disconnect(ic->i_cm_id); + if (err) { + /* Actually this may happen quite frequently, when + * an outgoing connect raced with an incoming connect. + */ + rdsdebug("failed to disconnect, cm: %p err %d\n", + ic->i_cm_id, err); + } + + wait_event(rds_ib_ring_empty_wait, + rds_ib_ring_empty(&ic->i_send_ring) && + rds_ib_ring_empty(&ic->i_recv_ring)); + + if (ic->i_send_hdrs) + ib_dma_free_coherent(dev, + ic->i_send_ring.w_nr * + sizeof(struct rds_header), + ic->i_send_hdrs, + ic->i_send_hdrs_dma); + + if (ic->i_recv_hdrs) + ib_dma_free_coherent(dev, + ic->i_recv_ring.w_nr * + sizeof(struct rds_header), + ic->i_recv_hdrs, + ic->i_recv_hdrs_dma); + + if (ic->i_ack) + ib_dma_free_coherent(dev, sizeof(struct rds_header), + ic->i_ack, ic->i_ack_dma); + + if (ic->i_sends) + rds_ib_send_clear_ring(ic); + if (ic->i_recvs) + rds_ib_recv_clear_ring(ic); + + if (ic->i_cm_id->qp) + rdma_destroy_qp(ic->i_cm_id); + if (ic->i_send_cq) + ib_destroy_cq(ic->i_send_cq); + if (ic->i_recv_cq) + ib_destroy_cq(ic->i_recv_cq); + rdma_destroy_id(ic->i_cm_id); + + /* + * Move connection back to the nodev list. + */ + if (ic->rds_ibdev) { + + spin_lock_irq(&ic->rds_ibdev->spinlock); + BUG_ON(list_empty(&ic->ib_node)); + list_del(&ic->ib_node); + spin_unlock_irq(&ic->rds_ibdev->spinlock); + + spin_lock_irq(&ib_nodev_conns_lock); + list_add_tail(&ic->ib_node, &ib_nodev_conns); + spin_unlock_irq(&ib_nodev_conns_lock); + ic->rds_ibdev = NULL; + } + + ic->i_cm_id = NULL; + ic->i_pd = NULL; + ic->i_mr = NULL; + ic->i_send_cq = NULL; + ic->i_recv_cq = NULL; + ic->i_send_hdrs = NULL; + ic->i_recv_hdrs = NULL; + ic->i_ack = NULL; + } + BUG_ON(ic->rds_ibdev); + + /* Clear pending transmit */ + if (ic->i_rm) { + rds_message_put(ic->i_rm); + ic->i_rm = NULL; + } + + /* Clear the ACK state */ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + rds_ib_set_64bit(&ic->i_ack_next, 0); + ic->i_ack_recv = 0; + + /* Clear flow control state */ + ic->i_flowctl = 0; + atomic_set(&ic->i_credits, 0); + + rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr); + rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr); + + if (ic->i_ibinc) { + rds_inc_put(&ic->i_ibinc->ii_inc); + ic->i_ibinc = NULL; + } + + vfree(ic->i_sends); + ic->i_sends = NULL; + vfree(ic->i_recvs); + ic->i_recvs = NULL; +} + +int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp) +{ + struct rds_ib_connection *ic; + unsigned long flags; + + /* XXX too lazy? */ + ic = kzalloc(sizeof(struct rds_ib_connection), GFP_KERNEL); + if (ic == NULL) + return -ENOMEM; + + INIT_LIST_HEAD(&ic->ib_node); + mutex_init(&ic->i_recv_mutex); + + /* + * rds_ib_conn_shutdown() waits for these to be emptied so they + * must be initialized before it can be called. + */ + rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr); + rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr); + + ic->conn = conn; + conn->c_transport_data = ic; + + spin_lock_irqsave(&ib_nodev_conns_lock, flags); + list_add_tail(&ic->ib_node, &ib_nodev_conns); + spin_unlock_irqrestore(&ib_nodev_conns_lock, flags); + + + rdsdebug("conn %p conn ic %p\n", conn, conn->c_transport_data); + return 0; +} + +void rds_ib_conn_free(void *arg) +{ + struct rds_ib_connection *ic = arg; + rdsdebug("ic %p\n", ic); + list_del(&ic->ib_node); + kfree(ic); +} + + +/* + * An error occurred on the connection + */ +void +__rds_ib_conn_error(struct rds_connection *conn, const char *fmt, ...) +{ + va_list ap; + + rds_conn_drop(conn); + + va_start(ap, fmt); + vprintk(fmt, ap); + va_end(ap); +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:31 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:31 -0800 Subject: [ofa-general] [PATCH 14/26] RDS/IB: Ring-handling code. In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-15-git-send-email-andy.grover@oracle.com> Signed-off-by: Andy Grover --- net/rds/ib_ring.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 168 insertions(+), 0 deletions(-) create mode 100644 net/rds/ib_ring.c diff --git a/net/rds/ib_ring.c b/net/rds/ib_ring.c new file mode 100644 index 0000000..99a6cca --- /dev/null +++ b/net/rds/ib_ring.c @@ -0,0 +1,168 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "ib.h" + +/* + * Locking for IB rings. + * We assume that allocation is always protected by a mutex + * in the caller (this is a valid assumption for the current + * implementation). + * + * Freeing always happens in an interrupt, and hence only + * races with allocations, but not with other free()s. + * + * The interaction between allocation and freeing is that + * the alloc code has to determine the number of free entries. + * To this end, we maintain two counters; an allocation counter + * and a free counter. Both are allowed to run freely, and wrap + * around. + * The number of used entries is always (alloc_ctr - free_ctr) % NR. + * + * The current implementation makes free_ctr atomic. When the + * caller finds an allocation fails, it should set an "alloc fail" + * bit and retry the allocation. The "alloc fail" bit essentially tells + * the CQ completion handlers to wake it up after freeing some + * more entries. + */ + +/* + * This only happens on shutdown. + */ +DECLARE_WAIT_QUEUE_HEAD(rds_ib_ring_empty_wait); + +void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr) +{ + memset(ring, 0, sizeof(*ring)); + ring->w_nr = nr; + rdsdebug("ring %p nr %u\n", ring, ring->w_nr); +} + +static inline u32 __rds_ib_ring_used(struct rds_ib_work_ring *ring) +{ + u32 diff; + + /* This assumes that atomic_t has at least as many bits as u32 */ + diff = ring->w_alloc_ctr - (u32) atomic_read(&ring->w_free_ctr); + BUG_ON(diff > ring->w_nr); + + return diff; +} + +void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr) +{ + /* We only ever get called from the connection setup code, + * prior to creating the QP. */ + BUG_ON(__rds_ib_ring_used(ring)); + ring->w_nr = nr; +} + +static int __rds_ib_ring_empty(struct rds_ib_work_ring *ring) +{ + return __rds_ib_ring_used(ring) == 0; +} + +u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos) +{ + u32 ret = 0, avail; + + avail = ring->w_nr - __rds_ib_ring_used(ring); + + rdsdebug("ring %p val %u next %u free %u\n", ring, val, + ring->w_alloc_ptr, avail); + + if (val && avail) { + ret = min(val, avail); + *pos = ring->w_alloc_ptr; + + ring->w_alloc_ptr = (ring->w_alloc_ptr + ret) % ring->w_nr; + ring->w_alloc_ctr += ret; + } + + return ret; +} + +void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val) +{ + ring->w_free_ptr = (ring->w_free_ptr + val) % ring->w_nr; + atomic_add(val, &ring->w_free_ctr); + + if (__rds_ib_ring_empty(ring) && + waitqueue_active(&rds_ib_ring_empty_wait)) + wake_up(&rds_ib_ring_empty_wait); +} + +void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val) +{ + ring->w_alloc_ptr = (ring->w_alloc_ptr - val) % ring->w_nr; + ring->w_alloc_ctr -= val; +} + +int rds_ib_ring_empty(struct rds_ib_work_ring *ring) +{ + return __rds_ib_ring_empty(ring); +} + +int rds_ib_ring_low(struct rds_ib_work_ring *ring) +{ + return __rds_ib_ring_used(ring) <= (ring->w_nr >> 2); +} + +/* + * returns the oldest alloced ring entry. This will be the next one + * freed. This can't be called if there are none allocated. + */ +u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring) +{ + return ring->w_free_ptr; +} + +/* + * returns the number of completed work requests. + */ + +u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest) +{ + u32 ret; + + if (oldest <= (unsigned long long)wr_id) + ret = (unsigned long long)wr_id - oldest + 1; + else + ret = ring->w_nr - oldest + (unsigned long long)wr_id + 1; + + rdsdebug("ring %p ret %u wr_id %u oldest %u\n", ring, ret, + wr_id, oldest); + return ret; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:29 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:29 -0800 Subject: [ofa-general] [PATCH 12/26] RDS: RDMA support In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-13-git-send-email-andy.grover@oracle.com> Some transports may support RDMA features. This handles the non-transport-specific parts, like pinning user pages and tracking mapped regions. Signed-off-by: Andy Grover --- net/rds/rdma.c | 679 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/rdma.h | 84 +++++++ 2 files changed, 763 insertions(+), 0 deletions(-) create mode 100644 net/rds/rdma.c create mode 100644 net/rds/rdma.h diff --git a/net/rds/rdma.c b/net/rds/rdma.c new file mode 100644 index 0000000..eaeeb91 --- /dev/null +++ b/net/rds/rdma.c @@ -0,0 +1,679 @@ +/* + * Copyright (c) 2007 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include /* for DMA_*_DEVICE */ + +#include "rdma.h" + +/* + * XXX + * - build with sparse + * - should we limit the size of a mr region? let transport return failure? + * - should we detect duplicate keys on a socket? hmm. + * - an rdma is an mlock, apply rlimit? + */ + +/* + * get the number of pages by looking at the page indices that the start and + * end addresses fall in. + * + * Returns 0 if the vec is invalid. It is invalid if the number of bytes + * causes the address to wrap or overflows an unsigned int. This comes + * from being stored in the 'length' member of 'struct scatterlist'. + */ +static unsigned int rds_pages_in_vec(struct rds_iovec *vec) +{ + if ((vec->addr + vec->bytes <= vec->addr) || + (vec->bytes > (u64)UINT_MAX)) + return 0; + + return ((vec->addr + vec->bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) - + (vec->addr >> PAGE_SHIFT); +} + +static struct rds_mr *rds_mr_tree_walk(struct rb_root *root, u64 key, + struct rds_mr *insert) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct rds_mr *mr; + + while (*p) { + parent = *p; + mr = rb_entry(parent, struct rds_mr, r_rb_node); + + if (key < mr->r_key) + p = &(*p)->rb_left; + else if (key > mr->r_key) + p = &(*p)->rb_right; + else + return mr; + } + + if (insert) { + rb_link_node(&insert->r_rb_node, parent, p); + rb_insert_color(&insert->r_rb_node, root); + atomic_inc(&insert->r_refcount); + } + return NULL; +} + +/* + * Destroy the transport-specific part of a MR. + */ +static void rds_destroy_mr(struct rds_mr *mr) +{ + struct rds_sock *rs = mr->r_sock; + void *trans_private = NULL; + unsigned long flags; + + rdsdebug("RDS: destroy mr key is %x refcnt %u\n", + mr->r_key, atomic_read(&mr->r_refcount)); + + if (test_and_set_bit(RDS_MR_DEAD, &mr->r_state)) + return; + + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + if (!RB_EMPTY_NODE(&mr->r_rb_node)) + rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); + trans_private = mr->r_trans_private; + mr->r_trans_private = NULL; + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + if (trans_private) + mr->r_trans->free_mr(trans_private, mr->r_invalidate); +} + +void __rds_put_mr_final(struct rds_mr *mr) +{ + rds_destroy_mr(mr); + kfree(mr); +} + +/* + * By the time this is called we can't have any more ioctls called on + * the socket so we don't need to worry about racing with others. + */ +void rds_rdma_drop_keys(struct rds_sock *rs) +{ + struct rds_mr *mr; + struct rb_node *node; + + /* Release any MRs associated with this socket */ + while ((node = rb_first(&rs->rs_rdma_keys))) { + mr = container_of(node, struct rds_mr, r_rb_node); + if (mr->r_trans == rs->rs_transport) + mr->r_invalidate = 0; + rds_mr_put(mr); + } + + if (rs->rs_transport && rs->rs_transport->flush_mrs) + rs->rs_transport->flush_mrs(); +} + +/* + * Helper function to pin user pages. + */ +static int rds_pin_pages(unsigned long user_addr, unsigned int nr_pages, + struct page **pages, int write) +{ + int ret; + + down_read(¤t->mm->mmap_sem); + ret = get_user_pages(current, current->mm, user_addr, + nr_pages, write, 0, pages, NULL); + up_read(¤t->mm->mmap_sem); + + if (0 <= ret && (unsigned) ret < nr_pages) { + while (ret--) + put_page(pages[ret]); + ret = -EFAULT; + } + + return ret; +} + +static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, + u64 *cookie_ret, struct rds_mr **mr_ret) +{ + struct rds_mr *mr = NULL, *found; + unsigned int nr_pages; + struct page **pages = NULL; + struct scatterlist *sg; + void *trans_private; + unsigned long flags; + rds_rdma_cookie_t cookie; + unsigned int nents; + long i; + int ret; + + if (rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + if (rs->rs_transport->get_mr == NULL) { + ret = -EOPNOTSUPP; + goto out; + } + + nr_pages = rds_pages_in_vec(&args->vec); + if (nr_pages == 0) { + ret = -EINVAL; + goto out; + } + + rdsdebug("RDS: get_mr addr %llx len %llu nr_pages %u\n", + args->vec.addr, args->vec.bytes, nr_pages); + + /* XXX clamp nr_pages to limit the size of this alloc? */ + pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL); + if (pages == NULL) { + ret = -ENOMEM; + goto out; + } + + mr = kzalloc(sizeof(struct rds_mr), GFP_KERNEL); + if (mr == NULL) { + ret = -ENOMEM; + goto out; + } + + atomic_set(&mr->r_refcount, 1); + RB_CLEAR_NODE(&mr->r_rb_node); + mr->r_trans = rs->rs_transport; + mr->r_sock = rs; + + if (args->flags & RDS_RDMA_USE_ONCE) + mr->r_use_once = 1; + if (args->flags & RDS_RDMA_INVALIDATE) + mr->r_invalidate = 1; + if (args->flags & RDS_RDMA_READWRITE) + mr->r_write = 1; + + /* + * Pin the pages that make up the user buffer and transfer the page + * pointers to the mr's sg array. We check to see if we've mapped + * the whole region after transferring the partial page references + * to the sg array so that we can have one page ref cleanup path. + * + * For now we have no flag that tells us whether the mapping is + * r/o or r/w. We need to assume r/w, or we'll do a lot of RDMA to + * the zero page. + */ + ret = rds_pin_pages(args->vec.addr & PAGE_MASK, nr_pages, pages, 1); + if (ret < 0) + goto out; + + nents = ret; + sg = kcalloc(nents, sizeof(*sg), GFP_KERNEL); + if (sg == NULL) { + ret = -ENOMEM; + goto out; + } + WARN_ON(!nents); + sg_init_table(sg, nents); + + /* Stick all pages into the scatterlist */ + for (i = 0 ; i < nents; i++) + sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0); + + rdsdebug("RDS: trans_private nents is %u\n", nents); + + /* Obtain a transport specific MR. If this succeeds, the + * s/g list is now owned by the MR. + * Note that dma_map() implies that pending writes are + * flushed to RAM, so no dma_sync is needed here. */ + trans_private = rs->rs_transport->get_mr(sg, nents, rs, + &mr->r_key); + + if (IS_ERR(trans_private)) { + for (i = 0 ; i < nents; i++) + put_page(sg_page(&sg[i])); + kfree(sg); + ret = PTR_ERR(trans_private); + goto out; + } + + mr->r_trans_private = trans_private; + + rdsdebug("RDS: get_mr put_user key is %x cookie_addr %p\n", + mr->r_key, (void *)(unsigned long) args->cookie_addr); + + /* The user may pass us an unaligned address, but we can only + * map page aligned regions. So we keep the offset, and build + * a 64bit cookie containing and pass that + * around. */ + cookie = rds_rdma_make_cookie(mr->r_key, args->vec.addr & ~PAGE_MASK); + if (cookie_ret) + *cookie_ret = cookie; + + if (args->cookie_addr && put_user(cookie, (u64 __user *)(unsigned long) args->cookie_addr)) { + ret = -EFAULT; + goto out; + } + + /* Inserting the new MR into the rbtree bumps its + * reference count. */ + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + found = rds_mr_tree_walk(&rs->rs_rdma_keys, mr->r_key, mr); + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + BUG_ON(found && found != mr); + + rdsdebug("RDS: get_mr key is %x\n", mr->r_key); + if (mr_ret) { + atomic_inc(&mr->r_refcount); + *mr_ret = mr; + } + + ret = 0; +out: + kfree(pages); + if (mr) + rds_mr_put(mr); + return ret; +} + +int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen) +{ + struct rds_get_mr_args args; + + if (optlen != sizeof(struct rds_get_mr_args)) + return -EINVAL; + + if (copy_from_user(&args, (struct rds_get_mr_args __user *)optval, + sizeof(struct rds_get_mr_args))) + return -EFAULT; + + return __rds_rdma_map(rs, &args, NULL, NULL); +} + +/* + * Free the MR indicated by the given R_Key + */ +int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen) +{ + struct rds_free_mr_args args; + struct rds_mr *mr; + unsigned long flags; + + if (optlen != sizeof(struct rds_free_mr_args)) + return -EINVAL; + + if (copy_from_user(&args, (struct rds_free_mr_args __user *)optval, + sizeof(struct rds_free_mr_args))) + return -EFAULT; + + /* Special case - a null cookie means flush all unused MRs */ + if (args.cookie == 0) { + if (!rs->rs_transport || !rs->rs_transport->flush_mrs) + return -EINVAL; + rs->rs_transport->flush_mrs(); + return 0; + } + + /* Look up the MR given its R_key and remove it from the rbtree + * so nobody else finds it. + * This should also prevent races with rds_rdma_unuse. + */ + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + mr = rds_mr_tree_walk(&rs->rs_rdma_keys, rds_rdma_cookie_key(args.cookie), NULL); + if (mr) { + rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); + RB_CLEAR_NODE(&mr->r_rb_node); + if (args.flags & RDS_RDMA_INVALIDATE) + mr->r_invalidate = 1; + } + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + if (!mr) + return -EINVAL; + + /* + * call rds_destroy_mr() ourselves so that we're sure it's done by the time + * we return. If we let rds_mr_put() do it it might not happen until + * someone else drops their ref. + */ + rds_destroy_mr(mr); + rds_mr_put(mr); + return 0; +} + +/* + * This is called when we receive an extension header that + * tells us this MR was used. It allows us to implement + * use_once semantics + */ +void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force) +{ + struct rds_mr *mr; + unsigned long flags; + int zot_me = 0; + + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL); + if (mr && (mr->r_use_once || force)) { + rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); + RB_CLEAR_NODE(&mr->r_rb_node); + zot_me = 1; + } else if (mr) + atomic_inc(&mr->r_refcount); + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + /* May have to issue a dma_sync on this memory region. + * Note we could avoid this if the operation was a RDMA READ, + * but at this point we can't tell. */ + if (mr != NULL) { + if (mr->r_trans->sync_mr) + mr->r_trans->sync_mr(mr->r_trans_private, DMA_FROM_DEVICE); + + /* If the MR was marked as invalidate, this will + * trigger an async flush. */ + if (zot_me) + rds_destroy_mr(mr); + rds_mr_put(mr); + } +} + +void rds_rdma_free_op(struct rds_rdma_op *ro) +{ + unsigned int i; + + for (i = 0; i < ro->r_nents; i++) { + struct page *page = sg_page(&ro->r_sg[i]); + + /* Mark page dirty if it was possibly modified, which + * is the case for a RDMA_READ which copies from remote + * to local memory */ + if (!ro->r_write) + set_page_dirty(page); + put_page(page); + } + + kfree(ro->r_notifier); + kfree(ro); +} + +/* + * args is a pointer to an in-kernel copy in the sendmsg cmsg. + */ +static struct rds_rdma_op *rds_rdma_prepare(struct rds_sock *rs, + struct rds_rdma_args *args) +{ + struct rds_iovec vec; + struct rds_rdma_op *op = NULL; + unsigned int nr_pages; + unsigned int max_pages; + unsigned int nr_bytes; + struct page **pages = NULL; + struct rds_iovec __user *local_vec; + struct scatterlist *sg; + unsigned int nr; + unsigned int i, j; + int ret; + + + if (rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + if (args->nr_local > (u64)UINT_MAX) { + ret = -EMSGSIZE; + goto out; + } + + nr_pages = 0; + max_pages = 0; + + local_vec = (struct rds_iovec __user *)(unsigned long) args->local_vec_addr; + + /* figure out the number of pages in the vector */ + for (i = 0; i < args->nr_local; i++) { + if (copy_from_user(&vec, &local_vec[i], + sizeof(struct rds_iovec))) { + ret = -EFAULT; + goto out; + } + + nr = rds_pages_in_vec(&vec); + if (nr == 0) { + ret = -EINVAL; + goto out; + } + + max_pages = max(nr, max_pages); + nr_pages += nr; + } + + pages = kcalloc(max_pages, sizeof(struct page *), GFP_KERNEL); + if (pages == NULL) { + ret = -ENOMEM; + goto out; + } + + op = kzalloc(offsetof(struct rds_rdma_op, r_sg[nr_pages]), GFP_KERNEL); + if (op == NULL) { + ret = -ENOMEM; + goto out; + } + + op->r_write = !!(args->flags & RDS_RDMA_READWRITE); + op->r_fence = !!(args->flags & RDS_RDMA_FENCE); + op->r_notify = !!(args->flags & RDS_RDMA_NOTIFY_ME); + op->r_recverr = rs->rs_recverr; + WARN_ON(!nr_pages); + sg_init_table(op->r_sg, nr_pages); + + if (op->r_notify || op->r_recverr) { + /* We allocate an uninitialized notifier here, because + * we don't want to do that in the completion handler. We + * would have to use GFP_ATOMIC there, and don't want to deal + * with failed allocations. + */ + op->r_notifier = kmalloc(sizeof(struct rds_notifier), GFP_KERNEL); + if (!op->r_notifier) { + ret = -ENOMEM; + goto out; + } + op->r_notifier->n_user_token = args->user_token; + op->r_notifier->n_status = RDS_RDMA_SUCCESS; + } + + /* The cookie contains the R_Key of the remote memory region, and + * optionally an offset into it. This is how we implement RDMA into + * unaligned memory. + * When setting up the RDMA, we need to add that offset to the + * destination address (which is really an offset into the MR) + * FIXME: We may want to move this into ib_rdma.c + */ + op->r_key = rds_rdma_cookie_key(args->cookie); + op->r_remote_addr = args->remote_vec.addr + rds_rdma_cookie_offset(args->cookie); + + nr_bytes = 0; + + rdsdebug("RDS: rdma prepare nr_local %llu rva %llx rkey %x\n", + (unsigned long long)args->nr_local, + (unsigned long long)args->remote_vec.addr, + op->r_key); + + for (i = 0; i < args->nr_local; i++) { + if (copy_from_user(&vec, &local_vec[i], + sizeof(struct rds_iovec))) { + ret = -EFAULT; + goto out; + } + + nr = rds_pages_in_vec(&vec); + if (nr == 0) { + ret = -EINVAL; + goto out; + } + + rs->rs_user_addr = vec.addr; + rs->rs_user_bytes = vec.bytes; + + /* did the user change the vec under us? */ + if (nr > max_pages || op->r_nents + nr > nr_pages) { + ret = -EINVAL; + goto out; + } + /* If it's a WRITE operation, we want to pin the pages for reading. + * If it's a READ operation, we need to pin the pages for writing. + */ + ret = rds_pin_pages(vec.addr & PAGE_MASK, nr, pages, !op->r_write); + if (ret < 0) + goto out; + + rdsdebug("RDS: nr_bytes %u nr %u vec.bytes %llu vec.addr %llx\n", + nr_bytes, nr, vec.bytes, vec.addr); + + nr_bytes += vec.bytes; + + for (j = 0; j < nr; j++) { + unsigned int offset = vec.addr & ~PAGE_MASK; + + sg = &op->r_sg[op->r_nents + j]; + sg_set_page(sg, pages[j], + min_t(unsigned int, vec.bytes, PAGE_SIZE - offset), + offset); + + rdsdebug("RDS: sg->offset %x sg->len %x vec.addr %llx vec.bytes %llu\n", + sg->offset, sg->length, vec.addr, vec.bytes); + + vec.addr += sg->length; + vec.bytes -= sg->length; + } + + op->r_nents += nr; + } + + + if (nr_bytes > args->remote_vec.bytes) { + rdsdebug("RDS nr_bytes %u remote_bytes %u do not match\n", + nr_bytes, + (unsigned int) args->remote_vec.bytes); + ret = -EINVAL; + goto out; + } + op->r_bytes = nr_bytes; + + ret = 0; +out: + kfree(pages); + if (ret) { + if (op) + rds_rdma_free_op(op); + op = ERR_PTR(ret); + } + return op; +} + +/* + * The application asks for a RDMA transfer. + * Extract all arguments and set up the rdma_op + */ +int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg) +{ + struct rds_rdma_op *op; + + if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_rdma_args)) + || rm->m_rdma_op != NULL) + return -EINVAL; + + op = rds_rdma_prepare(rs, CMSG_DATA(cmsg)); + if (IS_ERR(op)) + return PTR_ERR(op); + rds_stats_inc(s_send_rdma); + rm->m_rdma_op = op; + return 0; +} + +/* + * The application wants us to pass an RDMA destination (aka MR) + * to the remote + */ +int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg) +{ + unsigned long flags; + struct rds_mr *mr; + u32 r_key; + int err = 0; + + if (cmsg->cmsg_len < CMSG_LEN(sizeof(rds_rdma_cookie_t)) + || rm->m_rdma_cookie != 0) + return -EINVAL; + + memcpy(&rm->m_rdma_cookie, CMSG_DATA(cmsg), sizeof(rm->m_rdma_cookie)); + + /* We are reusing a previously mapped MR here. Most likely, the + * application has written to the buffer, so we need to explicitly + * flush those writes to RAM. Otherwise the HCA may not see them + * when doing a DMA from that buffer. + */ + r_key = rds_rdma_cookie_key(rm->m_rdma_cookie); + + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL); + if (mr == NULL) + err = -EINVAL; /* invalid r_key */ + else + atomic_inc(&mr->r_refcount); + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + if (mr) { + mr->r_trans->sync_mr(mr->r_trans_private, DMA_TO_DEVICE); + rm->m_rdma_mr = mr; + } + return err; +} + +/* + * The application passes us an address range it wants to enable RDMA + * to/from. We map the area, and save the pair + * in rm->m_rdma_cookie. This causes it to be sent along to the peer + * in an extension header. + */ +int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg) +{ + if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_get_mr_args)) + || rm->m_rdma_cookie != 0) + return -EINVAL; + + return __rds_rdma_map(rs, CMSG_DATA(cmsg), &rm->m_rdma_cookie, &rm->m_rdma_mr); +} diff --git a/net/rds/rdma.h b/net/rds/rdma.h new file mode 100644 index 0000000..4255120 --- /dev/null +++ b/net/rds/rdma.h @@ -0,0 +1,84 @@ +#ifndef _RDS_RDMA_H +#define _RDS_RDMA_H + +#include +#include +#include + +#include "rds.h" + +struct rds_mr { + struct rb_node r_rb_node; + atomic_t r_refcount; + u32 r_key; + + /* A copy of the creation flags */ + unsigned int r_use_once:1; + unsigned int r_invalidate:1; + unsigned int r_write:1; + + /* This is for RDS_MR_DEAD. + * It would be nice & consistent to make this part of the above + * bit field here, but we need to use test_and_set_bit. + */ + unsigned long r_state; + struct rds_sock *r_sock; /* back pointer to the socket that owns us */ + struct rds_transport *r_trans; + void *r_trans_private; +}; + +/* Flags for mr->r_state */ +#define RDS_MR_DEAD 0 + +struct rds_rdma_op { + u32 r_key; + u64 r_remote_addr; + unsigned int r_write:1; + unsigned int r_fence:1; + unsigned int r_notify:1; + unsigned int r_recverr:1; + unsigned int r_mapped:1; + struct rds_notifier *r_notifier; + unsigned int r_bytes; + unsigned int r_nents; + unsigned int r_count; + struct scatterlist r_sg[0]; +}; + +static inline rds_rdma_cookie_t rds_rdma_make_cookie(u32 r_key, u32 offset) +{ + return r_key | (((u64) offset) << 32); +} + +static inline u32 rds_rdma_cookie_key(rds_rdma_cookie_t cookie) +{ + return cookie; +} + +static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie) +{ + return cookie >> 32; +} + +int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen); +int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen); +void rds_rdma_drop_keys(struct rds_sock *rs); +int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +void rds_rdma_free_op(struct rds_rdma_op *ro); +void rds_rdma_send_complete(struct rds_message *rm, int); + +extern void __rds_put_mr_final(struct rds_mr *mr); +static inline void rds_mr_put(struct rds_mr *mr) +{ + if (atomic_dec_and_test(&mr->r_refcount)) + __rds_put_mr_final(mr); +} + +#endif -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:32 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:32 -0800 Subject: [ofa-general] [PATCH 15/26] RDS/IB: Implement RDMA ops using FMRs In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-16-git-send-email-andy.grover@oracle.com> Signed-off-by: Andy Grover --- net/rds/ib_rdma.c | 641 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 641 insertions(+), 0 deletions(-) create mode 100644 net/rds/ib_rdma.c diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c new file mode 100644 index 0000000..69a6289 --- /dev/null +++ b/net/rds/ib_rdma.c @@ -0,0 +1,641 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "rdma.h" +#include "ib.h" + + +/* + * This is stored as mr->r_trans_private. + */ +struct rds_ib_mr { + struct rds_ib_device *device; + struct rds_ib_mr_pool *pool; + struct ib_fmr *fmr; + struct list_head list; + unsigned int remap_count; + + struct scatterlist *sg; + unsigned int sg_len; + u64 *dma; + int sg_dma_len; +}; + +/* + * Our own little FMR pool + */ +struct rds_ib_mr_pool { + struct mutex flush_lock; /* serialize fmr invalidate */ + struct work_struct flush_worker; /* flush worker */ + + spinlock_t list_lock; /* protect variables below */ + atomic_t item_count; /* total # of MRs */ + atomic_t dirty_count; /* # dirty of MRs */ + struct list_head drop_list; /* MRs that have reached their max_maps limit */ + struct list_head free_list; /* unused MRs */ + struct list_head clean_list; /* unused & unamapped MRs */ + atomic_t free_pinned; /* memory pinned by free MRs */ + unsigned long max_items; + unsigned long max_items_soft; + unsigned long max_free_pinned; + struct ib_fmr_attr fmr_attr; +}; + +static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all); +static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr); +static void rds_ib_mr_pool_flush_worker(struct work_struct *work); + +static struct rds_ib_device *rds_ib_get_device(__be32 ipaddr) +{ + struct rds_ib_device *rds_ibdev; + struct rds_ib_ipaddr *i_ipaddr; + + list_for_each_entry(rds_ibdev, &rds_ib_devices, list) { + spin_lock_irq(&rds_ibdev->spinlock); + list_for_each_entry(i_ipaddr, &rds_ibdev->ipaddr_list, list) { + if (i_ipaddr->ipaddr == ipaddr) { + spin_unlock_irq(&rds_ibdev->spinlock); + return rds_ibdev; + } + } + spin_unlock_irq(&rds_ibdev->spinlock); + } + + return NULL; +} + +static int rds_ib_add_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +{ + struct rds_ib_ipaddr *i_ipaddr; + + i_ipaddr = kmalloc(sizeof *i_ipaddr, GFP_KERNEL); + if (!i_ipaddr) + return -ENOMEM; + + i_ipaddr->ipaddr = ipaddr; + + spin_lock_irq(&rds_ibdev->spinlock); + list_add_tail(&i_ipaddr->list, &rds_ibdev->ipaddr_list); + spin_unlock_irq(&rds_ibdev->spinlock); + + return 0; +} + +static void rds_ib_remove_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +{ + struct rds_ib_ipaddr *i_ipaddr, *next; + + spin_lock_irq(&rds_ibdev->spinlock); + list_for_each_entry_safe(i_ipaddr, next, &rds_ibdev->ipaddr_list, list) { + if (i_ipaddr->ipaddr == ipaddr) { + list_del(&i_ipaddr->list); + kfree(i_ipaddr); + break; + } + } + spin_unlock_irq(&rds_ibdev->spinlock); +} + +int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +{ + struct rds_ib_device *rds_ibdev_old; + + rds_ibdev_old = rds_ib_get_device(ipaddr); + if (rds_ibdev_old) + rds_ib_remove_ipaddr(rds_ibdev_old, ipaddr); + + return rds_ib_add_ipaddr(rds_ibdev, ipaddr); +} + +int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + /* conn was previously on the nodev_conns_list */ + spin_lock_irq(&ib_nodev_conns_lock); + BUG_ON(list_empty(&ib_nodev_conns)); + BUG_ON(list_empty(&ic->ib_node)); + list_del(&ic->ib_node); + spin_unlock_irq(&ib_nodev_conns_lock); + + spin_lock_irq(&rds_ibdev->spinlock); + list_add_tail(&ic->ib_node, &rds_ibdev->conn_list); + spin_unlock_irq(&rds_ibdev->spinlock); + + ic->rds_ibdev = rds_ibdev; + + return 0; +} + +void rds_ib_remove_nodev_conns(void) +{ + struct rds_ib_connection *ic, *_ic; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&ib_nodev_conns_lock); + list_splice(&ib_nodev_conns, &tmp_list); + INIT_LIST_HEAD(&ib_nodev_conns); + spin_unlock_irq(&ib_nodev_conns_lock); + + list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) { + if (ic->conn->c_passive) + rds_conn_destroy(ic->conn->c_passive); + rds_conn_destroy(ic->conn); + } +} + +void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev) +{ + struct rds_ib_connection *ic, *_ic; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&rds_ibdev->spinlock); + list_splice(&rds_ibdev->conn_list, &tmp_list); + INIT_LIST_HEAD(&rds_ibdev->conn_list); + spin_unlock_irq(&rds_ibdev->spinlock); + + list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) { + if (ic->conn->c_passive) + rds_conn_destroy(ic->conn->c_passive); + rds_conn_destroy(ic->conn); + } +} + +struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev) +{ + struct rds_ib_mr_pool *pool; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) + return ERR_PTR(-ENOMEM); + + INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->drop_list); + INIT_LIST_HEAD(&pool->clean_list); + mutex_init(&pool->flush_lock); + spin_lock_init(&pool->list_lock); + INIT_WORK(&pool->flush_worker, rds_ib_mr_pool_flush_worker); + + pool->fmr_attr.max_pages = fmr_message_size; + pool->fmr_attr.max_maps = rds_ibdev->fmr_max_remaps; + pool->fmr_attr.page_shift = rds_ibdev->fmr_page_shift; + pool->max_free_pinned = rds_ibdev->max_fmrs * fmr_message_size / 4; + + /* We never allow more than max_items MRs to be allocated. + * When we exceed more than max_items_soft, we start freeing + * items more aggressively. + * Make sure that max_items > max_items_soft > max_items / 2 + */ + pool->max_items_soft = rds_ibdev->max_fmrs * 3 / 4; + pool->max_items = rds_ibdev->max_fmrs; + + return pool; +} + +void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo) +{ + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + + iinfo->rdma_mr_max = pool->max_items; + iinfo->rdma_mr_size = pool->fmr_attr.max_pages; +} + +void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *pool) +{ + flush_workqueue(rds_wq); + rds_ib_flush_mr_pool(pool, 1); + BUG_ON(atomic_read(&pool->item_count)); + BUG_ON(atomic_read(&pool->free_pinned)); + kfree(pool); +} + +static inline struct rds_ib_mr *rds_ib_reuse_fmr(struct rds_ib_mr_pool *pool) +{ + struct rds_ib_mr *ibmr = NULL; + unsigned long flags; + + spin_lock_irqsave(&pool->list_lock, flags); + if (!list_empty(&pool->clean_list)) { + ibmr = list_entry(pool->clean_list.next, struct rds_ib_mr, list); + list_del_init(&ibmr->list); + } + spin_unlock_irqrestore(&pool->list_lock, flags); + + return ibmr; +} + +static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev) +{ + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + struct rds_ib_mr *ibmr = NULL; + int err = 0, iter = 0; + + while (1) { + ibmr = rds_ib_reuse_fmr(pool); + if (ibmr) + return ibmr; + + /* No clean MRs - now we have the choice of either + * allocating a fresh MR up to the limit imposed by the + * driver, or flush any dirty unused MRs. + * We try to avoid stalling in the send path if possible, + * so we allocate as long as we're allowed to. + * + * We're fussy with enforcing the FMR limit, though. If the driver + * tells us we can't use more than N fmrs, we shouldn't start + * arguing with it */ + if (atomic_inc_return(&pool->item_count) <= pool->max_items) + break; + + atomic_dec(&pool->item_count); + + if (++iter > 2) { + rds_ib_stats_inc(s_ib_rdma_mr_pool_depleted); + return ERR_PTR(-EAGAIN); + } + + /* We do have some empty MRs. Flush them out. */ + rds_ib_stats_inc(s_ib_rdma_mr_pool_wait); + rds_ib_flush_mr_pool(pool, 0); + } + + ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL); + if (!ibmr) { + err = -ENOMEM; + goto out_no_cigar; + } + + ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, + (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE), + &pool->fmr_attr); + if (IS_ERR(ibmr->fmr)) { + err = PTR_ERR(ibmr->fmr); + ibmr->fmr = NULL; + printk(KERN_WARNING "RDS/IB: ib_alloc_fmr failed (err=%d)\n", err); + goto out_no_cigar; + } + + rds_ib_stats_inc(s_ib_rdma_mr_alloc); + return ibmr; + +out_no_cigar: + if (ibmr) { + if (ibmr->fmr) + ib_dealloc_fmr(ibmr->fmr); + kfree(ibmr); + } + atomic_dec(&pool->item_count); + return ERR_PTR(err); +} + +static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, + struct scatterlist *sg, unsigned int nents) +{ + struct ib_device *dev = rds_ibdev->dev; + struct scatterlist *scat = sg; + u64 io_addr = 0; + u64 *dma_pages; + u32 len; + int page_cnt, sg_dma_len; + int i, j; + int ret; + + sg_dma_len = ib_dma_map_sg(dev, sg, nents, + DMA_BIDIRECTIONAL); + if (unlikely(!sg_dma_len)) { + printk(KERN_WARNING "RDS/IB: dma_map_sg failed!\n"); + return -EBUSY; + } + + len = 0; + page_cnt = 0; + + for (i = 0; i < sg_dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]); + u64 dma_addr = ib_sg_dma_address(dev, &scat[i]); + + if (dma_addr & ~rds_ibdev->fmr_page_mask) { + if (i > 0) + return -EINVAL; + else + ++page_cnt; + } + if ((dma_addr + dma_len) & ~rds_ibdev->fmr_page_mask) { + if (i < sg_dma_len - 1) + return -EINVAL; + else + ++page_cnt; + } + + len += dma_len; + } + + page_cnt += len >> rds_ibdev->fmr_page_shift; + if (page_cnt > fmr_message_size) + return -EINVAL; + + dma_pages = kmalloc(sizeof(u64) * page_cnt, GFP_ATOMIC); + if (!dma_pages) + return -ENOMEM; + + page_cnt = 0; + for (i = 0; i < sg_dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]); + u64 dma_addr = ib_sg_dma_address(dev, &scat[i]); + + for (j = 0; j < dma_len; j += rds_ibdev->fmr_page_size) + dma_pages[page_cnt++] = + (dma_addr & rds_ibdev->fmr_page_mask) + j; + } + + ret = ib_map_phys_fmr(ibmr->fmr, + dma_pages, page_cnt, io_addr); + if (ret) + goto out; + + /* Success - we successfully remapped the MR, so we can + * safely tear down the old mapping. */ + rds_ib_teardown_mr(ibmr); + + ibmr->sg = scat; + ibmr->sg_len = nents; + ibmr->sg_dma_len = sg_dma_len; + ibmr->remap_count++; + + rds_ib_stats_inc(s_ib_rdma_mr_used); + ret = 0; + +out: + kfree(dma_pages); + + return ret; +} + +void rds_ib_sync_mr(void *trans_private, int direction) +{ + struct rds_ib_mr *ibmr = trans_private; + struct rds_ib_device *rds_ibdev = ibmr->device; + + switch (direction) { + case DMA_FROM_DEVICE: + ib_dma_sync_sg_for_cpu(rds_ibdev->dev, ibmr->sg, + ibmr->sg_dma_len, DMA_BIDIRECTIONAL); + break; + case DMA_TO_DEVICE: + ib_dma_sync_sg_for_device(rds_ibdev->dev, ibmr->sg, + ibmr->sg_dma_len, DMA_BIDIRECTIONAL); + break; + } +} + +static void __rds_ib_teardown_mr(struct rds_ib_mr *ibmr) +{ + struct rds_ib_device *rds_ibdev = ibmr->device; + + if (ibmr->sg_dma_len) { + ib_dma_unmap_sg(rds_ibdev->dev, + ibmr->sg, ibmr->sg_len, + DMA_BIDIRECTIONAL); + ibmr->sg_dma_len = 0; + } + + /* Release the s/g list */ + if (ibmr->sg_len) { + unsigned int i; + + for (i = 0; i < ibmr->sg_len; ++i) { + struct page *page = sg_page(&ibmr->sg[i]); + + /* FIXME we need a way to tell a r/w MR + * from a r/o MR */ + set_page_dirty(page); + put_page(page); + } + kfree(ibmr->sg); + + ibmr->sg = NULL; + ibmr->sg_len = 0; + } +} + +static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr) +{ + unsigned int pinned = ibmr->sg_len; + + __rds_ib_teardown_mr(ibmr); + if (pinned) { + struct rds_ib_device *rds_ibdev = ibmr->device; + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + + atomic_sub(pinned, &pool->free_pinned); + } +} + +static inline unsigned int rds_ib_flush_goal(struct rds_ib_mr_pool *pool, int free_all) +{ + unsigned int item_count; + + item_count = atomic_read(&pool->item_count); + if (free_all) + return item_count; + + return 0; +} + +/* + * Flush our pool of MRs. + * At a minimum, all currently unused MRs are unmapped. + * If the number of MRs allocated exceeds the limit, we also try + * to free as many MRs as needed to get back to this limit. + */ +static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all) +{ + struct rds_ib_mr *ibmr, *next; + LIST_HEAD(unmap_list); + LIST_HEAD(fmr_list); + unsigned long unpinned = 0; + unsigned long flags; + unsigned int nfreed = 0, ncleaned = 0, free_goal; + int ret = 0; + + rds_ib_stats_inc(s_ib_rdma_mr_pool_flush); + + mutex_lock(&pool->flush_lock); + + spin_lock_irqsave(&pool->list_lock, flags); + /* Get the list of all MRs to be dropped. Ordering matters - + * we want to put drop_list ahead of free_list. */ + list_splice_init(&pool->free_list, &unmap_list); + list_splice_init(&pool->drop_list, &unmap_list); + if (free_all) + list_splice_init(&pool->clean_list, &unmap_list); + spin_unlock_irqrestore(&pool->list_lock, flags); + + free_goal = rds_ib_flush_goal(pool, free_all); + + if (list_empty(&unmap_list)) + goto out; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry(ibmr, &unmap_list, list) + list_add(&ibmr->fmr->list, &fmr_list); + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "RDS/IB: ib_unmap_fmr failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(ibmr, next, &unmap_list, list) { + unpinned += ibmr->sg_len; + __rds_ib_teardown_mr(ibmr); + if (nfreed < free_goal || ibmr->remap_count >= pool->fmr_attr.max_maps) { + rds_ib_stats_inc(s_ib_rdma_mr_free); + list_del(&ibmr->list); + ib_dealloc_fmr(ibmr->fmr); + kfree(ibmr); + nfreed++; + } + ncleaned++; + } + + spin_lock_irqsave(&pool->list_lock, flags); + list_splice(&unmap_list, &pool->clean_list); + spin_unlock_irqrestore(&pool->list_lock, flags); + + atomic_sub(unpinned, &pool->free_pinned); + atomic_sub(ncleaned, &pool->dirty_count); + atomic_sub(nfreed, &pool->item_count); + +out: + mutex_unlock(&pool->flush_lock); + return ret; +} + +static void rds_ib_mr_pool_flush_worker(struct work_struct *work) +{ + struct rds_ib_mr_pool *pool = container_of(work, struct rds_ib_mr_pool, flush_worker); + + rds_ib_flush_mr_pool(pool, 0); +} + +void rds_ib_free_mr(void *trans_private, int invalidate) +{ + struct rds_ib_mr *ibmr = trans_private; + struct rds_ib_device *rds_ibdev = ibmr->device; + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + unsigned long flags; + + rdsdebug("RDS/IB: free_mr nents %u\n", ibmr->sg_len); + + /* Return it to the pool's free list */ + spin_lock_irqsave(&pool->list_lock, flags); + if (ibmr->remap_count >= pool->fmr_attr.max_maps) + list_add(&ibmr->list, &pool->drop_list); + else + list_add(&ibmr->list, &pool->free_list); + + atomic_add(ibmr->sg_len, &pool->free_pinned); + atomic_inc(&pool->dirty_count); + spin_unlock_irqrestore(&pool->list_lock, flags); + + /* If we've pinned too many pages, request a flush */ + if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned + || atomic_read(&pool->dirty_count) >= pool->max_items / 10) + queue_work(rds_wq, &pool->flush_worker); + + if (invalidate) { + if (likely(!in_interrupt())) { + rds_ib_flush_mr_pool(pool, 0); + } else { + /* We get here if the user created a MR marked + * as use_once and invalidate at the same time. */ + queue_work(rds_wq, &pool->flush_worker); + } + } +} + +void rds_ib_flush_mrs(void) +{ + struct rds_ib_device *rds_ibdev; + + list_for_each_entry(rds_ibdev, &rds_ib_devices, list) { + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + + if (pool) + rds_ib_flush_mr_pool(pool, 0); + } +} + +void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, + struct rds_sock *rs, u32 *key_ret) +{ + struct rds_ib_device *rds_ibdev; + struct rds_ib_mr *ibmr = NULL; + int ret; + + rds_ibdev = rds_ib_get_device(rs->rs_bound_addr); + if (!rds_ibdev) { + ret = -ENODEV; + goto out; + } + + if (!rds_ibdev->mr_pool) { + ret = -ENODEV; + goto out; + } + + ibmr = rds_ib_alloc_fmr(rds_ibdev); + if (IS_ERR(ibmr)) + return ibmr; + + ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + if (ret == 0) + *key_ret = ibmr->fmr->rkey; + else + printk(KERN_WARNING "RDS/IB: map_fmr failed (errno=%d)\n", ret); + + ibmr->device = rds_ibdev; + + out: + if (ret) { + if (ibmr) + rds_ib_free_mr(ibmr, 0); + ibmr = ERR_PTR(ret); + } + return ibmr; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:33 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:33 -0800 Subject: [ofa-general] [PATCH 16/26] RDS/IB: Implement IB-specific datagram send. In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-17-git-send-email-andy.grover@oracle.com> Specific to IB is a credits-based flow control mechanism, in addition to the expected usage of the IB API to package outgoing data into work requests. Signed-off-by: Andy Grover --- net/rds/ib_send.c | 874 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 874 insertions(+), 0 deletions(-) create mode 100644 net/rds/ib_send.c diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c new file mode 100644 index 0000000..cb6c52c --- /dev/null +++ b/net/rds/ib_send.c @@ -0,0 +1,874 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "rdma.h" +#include "ib.h" + +static void rds_ib_send_rdma_complete(struct rds_message *rm, + int wc_status) +{ + int notify_status; + + switch (wc_status) { + case IB_WC_WR_FLUSH_ERR: + return; + + case IB_WC_SUCCESS: + notify_status = RDS_RDMA_SUCCESS; + break; + + case IB_WC_REM_ACCESS_ERR: + notify_status = RDS_RDMA_REMOTE_ERROR; + break; + + default: + notify_status = RDS_RDMA_OTHER_ERROR; + break; + } + rds_rdma_send_complete(rm, notify_status); +} + +static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic, + struct rds_rdma_op *op) +{ + if (op->r_mapped) { + ib_dma_unmap_sg(ic->i_cm_id->device, + op->r_sg, op->r_nents, + op->r_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE); + op->r_mapped = 0; + } +} + +static void rds_ib_send_unmap_rm(struct rds_ib_connection *ic, + struct rds_ib_send_work *send, + int wc_status) +{ + struct rds_message *rm = send->s_rm; + + rdsdebug("ic %p send %p rm %p\n", ic, send, rm); + + ib_dma_unmap_sg(ic->i_cm_id->device, + rm->m_sg, rm->m_nents, + DMA_TO_DEVICE); + + if (rm->m_rdma_op != NULL) { + rds_ib_send_unmap_rdma(ic, rm->m_rdma_op); + + /* If the user asked for a completion notification on this + * message, we can implement three different semantics: + * 1. Notify when we received the ACK on the RDS message + * that was queued with the RDMA. This provides reliable + * notification of RDMA status at the expense of a one-way + * packet delay. + * 2. Notify when the IB stack gives us the completion event for + * the RDMA operation. + * 3. Notify when the IB stack gives us the completion event for + * the accompanying RDS messages. + * Here, we implement approach #3. To implement approach #2, + * call rds_rdma_send_complete from the cq_handler. To implement #1, + * don't call rds_rdma_send_complete at all, and fall back to the notify + * handling in the ACK processing code. + * + * Note: There's no need to explicitly sync any RDMA buffers using + * ib_dma_sync_sg_for_cpu - the completion for the RDMA + * operation itself unmapped the RDMA buffers, which takes care + * of synching. + */ + rds_ib_send_rdma_complete(rm, wc_status); + + if (rm->m_rdma_op->r_write) + rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes); + else + rds_stats_add(s_recv_rdma_bytes, rm->m_rdma_op->r_bytes); + } + + /* If anyone waited for this message to get flushed out, wake + * them up now */ + rds_message_unmapped(rm); + + rds_message_put(rm); + send->s_rm = NULL; +} + +void rds_ib_send_init_ring(struct rds_ib_connection *ic) +{ + struct rds_ib_send_work *send; + u32 i; + + for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) { + struct ib_sge *sge; + + send->s_rm = NULL; + send->s_op = NULL; + + send->s_wr.wr_id = i; + send->s_wr.sg_list = send->s_sge; + send->s_wr.num_sge = 1; + send->s_wr.opcode = IB_WR_SEND; + send->s_wr.send_flags = 0; + send->s_wr.ex.imm_data = 0; + + sge = rds_ib_data_sge(ic, send->s_sge); + sge->lkey = ic->i_mr->lkey; + + sge = rds_ib_header_sge(ic, send->s_sge); + sge->addr = ic->i_send_hdrs_dma + (i * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; + } +} + +void rds_ib_send_clear_ring(struct rds_ib_connection *ic) +{ + struct rds_ib_send_work *send; + u32 i; + + for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) { + if (send->s_wr.opcode == 0xdead) + continue; + if (send->s_rm) + rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR); + if (send->s_op) + rds_ib_send_unmap_rdma(ic, send->s_op); + } +} + +/* + * The _oldest/_free ring operations here race cleanly with the alloc/unalloc + * operations performed in the send path. As the sender allocs and potentially + * unallocs the next free entry in the ring it doesn't alter which is + * the next to be freed, which is what this is concerned with. + */ +void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context) +{ + struct rds_connection *conn = context; + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_wc wc; + struct rds_ib_send_work *send; + u32 completed; + u32 oldest; + u32 i = 0; + int ret; + + rdsdebug("cq %p conn %p\n", cq, conn); + rds_ib_stats_inc(s_ib_tx_cq_call); + ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + if (ret) + rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + + while (ib_poll_cq(cq, 1, &wc) > 0) { + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", + (unsigned long long)wc.wr_id, wc.status, wc.byte_len, + be32_to_cpu(wc.ex.imm_data)); + rds_ib_stats_inc(s_ib_tx_cq_event); + + if (wc.wr_id == RDS_IB_ACK_WR_ID) { + if (ic->i_ack_queued + HZ/2 < jiffies) + rds_ib_stats_inc(s_ib_tx_stalled); + rds_ib_ack_send_complete(ic); + continue; + } + + oldest = rds_ib_ring_oldest(&ic->i_send_ring); + + completed = rds_ib_ring_completed(&ic->i_send_ring, wc.wr_id, oldest); + + for (i = 0; i < completed; i++) { + send = &ic->i_sends[oldest]; + + /* In the error case, wc.opcode sometimes contains garbage */ + switch (send->s_wr.opcode) { + case IB_WR_SEND: + if (send->s_rm) + rds_ib_send_unmap_rm(ic, send, wc.status); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_READ: + /* Nothing to be done - the SG list will be unmapped + * when the SEND completes. */ + break; + default: + if (printk_ratelimit()) + printk(KERN_NOTICE + "RDS/IB: %s: unexpected opcode 0x%x in WR!\n", + __func__, send->s_wr.opcode); + break; + } + + send->s_wr.opcode = 0xdead; + send->s_wr.num_sge = 1; + if (send->s_queued + HZ/2 < jiffies) + rds_ib_stats_inc(s_ib_tx_stalled); + + /* If a RDMA operation produced an error, signal this right + * away. If we don't, the subsequent SEND that goes with this + * RDMA will be canceled with ERR_WFLUSH, and the application + * never learn that the RDMA failed. */ + if (unlikely(wc.status == IB_WC_REM_ACCESS_ERR && send->s_op)) { + struct rds_message *rm; + + rm = rds_send_get_message(conn, send->s_op); + if (rm) + rds_ib_send_rdma_complete(rm, wc.status); + } + + oldest = (oldest + 1) % ic->i_send_ring.w_nr; + } + + rds_ib_ring_free(&ic->i_send_ring, completed); + + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags) + || test_bit(0, &conn->c_map_queued)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + + /* We expect errors as the qp is drained during shutdown */ + if (wc.status != IB_WC_SUCCESS && rds_conn_up(conn)) { + rds_ib_conn_error(conn, + "send completion on %pI4 " + "had status %u, disconnecting and reconnecting\n", + &conn->c_faddr, wc.status); + } + } +} + +/* + * This is the main function for allocating credits when sending + * messages. + * + * Conceptually, we have two counters: + * - send credits: this tells us how many WRs we're allowed + * to submit without overruning the reciever's queue. For + * each SEND WR we post, we decrement this by one. + * + * - posted credits: this tells us how many WRs we recently + * posted to the receive queue. This value is transferred + * to the peer as a "credit update" in a RDS header field. + * Every time we transmit credits to the peer, we subtract + * the amount of transferred credits from this counter. + * + * It is essential that we avoid situations where both sides have + * exhausted their send credits, and are unable to send new credits + * to the peer. We achieve this by requiring that we send at least + * one credit update to the peer before exhausting our credits. + * When new credits arrive, we subtract one credit that is withheld + * until we've posted new buffers and are ready to transmit these + * credits (see rds_ib_send_add_credits below). + * + * The RDS send code is essentially single-threaded; rds_send_xmit + * grabs c_send_lock to ensure exclusive access to the send ring. + * However, the ACK sending code is independent and can race with + * message SENDs. + * + * In the send path, we need to update the counters for send credits + * and the counter of posted buffers atomically - when we use the + * last available credit, we cannot allow another thread to race us + * and grab the posted credits counter. Hence, we have to use a + * spinlock to protect the credit counter, or use atomics. + * + * Spinlocks shared between the send and the receive path are bad, + * because they create unnecessary delays. An early implementation + * using a spinlock showed a 5% degradation in throughput at some + * loads. + * + * This implementation avoids spinlocks completely, putting both + * counters into a single atomic, and updating that atomic using + * atomic_add (in the receive path, when receiving fresh credits), + * and using atomic_cmpxchg when updating the two counters. + */ +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, + u32 wanted, u32 *adv_credits, int need_posted) +{ + unsigned int avail, posted, got = 0, advertise; + long oldval, newval; + + *adv_credits = 0; + if (!ic->i_flowctl) + return wanted; + +try_again: + advertise = 0; + oldval = newval = atomic_read(&ic->i_credits); + posted = IB_GET_POST_CREDITS(oldval); + avail = IB_GET_SEND_CREDITS(oldval); + + rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n", + wanted, avail, posted); + + /* The last credit must be used to send a credit update. */ + if (avail && !posted) + avail--; + + if (avail < wanted) { + struct rds_connection *conn = ic->i_cm_id->context; + + /* Oops, there aren't that many credits left! */ + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); + got = avail; + } else { + /* Sometimes you get what you want, lalala. */ + got = wanted; + } + newval -= IB_SET_SEND_CREDITS(got); + + /* + * If need_posted is non-zero, then the caller wants + * the posted regardless of whether any send credits are + * available. + */ + if (posted && (got || need_posted)) { + advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT); + newval -= IB_SET_POST_CREDITS(advertise); + } + + /* Finally bill everything */ + if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval) + goto try_again; + + *adv_credits = advertise; + return got; +} + +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (credits == 0) + return; + + rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n", + credits, + IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)), + test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ? ", ll_send_full" : ""); + + atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits); + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + + WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384); + + rds_ib_stats_inc(s_ib_rx_credit_updates); +} + +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (posted == 0) + return; + + atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits); + + /* Decide whether to send an update to the peer now. + * If we would send a credit update for every single buffer we + * post, we would end up with an ACK storm (ACK arrives, + * consumes buffer, we refill the ring, send ACK to remote + * advertising the newly posted buffer... ad inf) + * + * Performance pretty much depends on how often we send + * credit updates - too frequent updates mean lots of ACKs. + * Too infrequent updates, and the peer will run out of + * credits and has to throttle. + * For the time being, 16 seems to be a good compromise. + */ + if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16) + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); +} + +static inline void +rds_ib_xmit_populate_wr(struct rds_ib_connection *ic, + struct rds_ib_send_work *send, unsigned int pos, + unsigned long buffer, unsigned int length, + int send_flags) +{ + struct ib_sge *sge; + + WARN_ON(pos != send - ic->i_sends); + + send->s_wr.send_flags = send_flags; + send->s_wr.opcode = IB_WR_SEND; + send->s_wr.num_sge = 2; + send->s_wr.next = NULL; + send->s_queued = jiffies; + send->s_op = NULL; + + if (length != 0) { + sge = rds_ib_data_sge(ic, send->s_sge); + sge->addr = buffer; + sge->length = length; + sge->lkey = ic->i_mr->lkey; + + sge = rds_ib_header_sge(ic, send->s_sge); + } else { + /* We're sending a packet with no payload. There is only + * one SGE */ + send->s_wr.num_sge = 1; + sge = &send->s_sge[0]; + } + + sge->addr = ic->i_send_hdrs_dma + (pos * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; +} + +/* + * This can be called multiple times for a given message. The first time + * we see a message we map its scatterlist into the IB device so that + * we can provide that mapped address to the IB scatter gather entries + * in the IB work requests. We translate the scatterlist into a series + * of work requests that fragment the message. These work requests complete + * in order so we pass ownership of the message to the completion handler + * once we send the final fragment. + * + * The RDS core uses the c_send_lock to only enter this function once + * per connection. This makes sure that the tx ring alloc/unalloc pairs + * don't get out of sync and confuse the ring. + */ +int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_device *dev = ic->i_cm_id->device; + struct rds_ib_send_work *send = NULL; + struct rds_ib_send_work *first; + struct rds_ib_send_work *prev; + struct ib_send_wr *failed_wr; + struct scatterlist *scat; + u32 pos; + u32 i; + u32 work_alloc; + u32 credit_alloc; + u32 posted; + u32 adv_credits = 0; + int send_flags = 0; + int sent; + int ret; + int flow_controlled = 0; + + BUG_ON(off % RDS_FRAG_SIZE); + BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); + + /* FIXME we may overallocate here */ + if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) + i = 1; + else + i = ceil(be32_to_cpu(rm->m_inc.i_hdr.h_len), RDS_FRAG_SIZE); + + work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos); + if (work_alloc == 0) { + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); + rds_ib_stats_inc(s_ib_tx_ring_full); + ret = -ENOMEM; + goto out; + } + + credit_alloc = work_alloc; + if (ic->i_flowctl) { + credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &posted, 0); + adv_credits += posted; + if (credit_alloc < work_alloc) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc); + work_alloc = credit_alloc; + flow_controlled++; + } + if (work_alloc == 0) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_ib_stats_inc(s_ib_tx_throttle); + ret = -ENOMEM; + goto out; + } + } + + /* map the message the first time we see it */ + if (ic->i_rm == NULL) { + /* + printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n", + be16_to_cpu(rm->m_inc.i_hdr.h_dport), + rm->m_inc.i_hdr.h_flags, + be32_to_cpu(rm->m_inc.i_hdr.h_len)); + */ + if (rm->m_nents) { + rm->m_count = ib_dma_map_sg(dev, + rm->m_sg, rm->m_nents, DMA_TO_DEVICE); + rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count); + if (rm->m_count == 0) { + rds_ib_stats_inc(s_ib_tx_sg_mapping_failure); + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + ret = -ENOMEM; /* XXX ? */ + goto out; + } + } else { + rm->m_count = 0; + } + + ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs; + ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes; + rds_message_addref(rm); + ic->i_rm = rm; + + /* Finalize the header */ + if (test_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags)) + rm->m_inc.i_hdr.h_flags |= RDS_FLAG_ACK_REQUIRED; + if (test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags)) + rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED; + + /* If it has a RDMA op, tell the peer we did it. This is + * used by the peer to release use-once RDMA MRs. */ + if (rm->m_rdma_op) { + struct rds_ext_header_rdma ext_hdr; + + ext_hdr.h_rdma_rkey = cpu_to_be32(rm->m_rdma_op->r_key); + rds_message_add_extension(&rm->m_inc.i_hdr, + RDS_EXTHDR_RDMA, &ext_hdr, sizeof(ext_hdr)); + } + if (rm->m_rdma_cookie) { + rds_message_add_rdma_dest_extension(&rm->m_inc.i_hdr, + rds_rdma_cookie_key(rm->m_rdma_cookie), + rds_rdma_cookie_offset(rm->m_rdma_cookie)); + } + + /* Note - rds_ib_piggyb_ack clears the ACK_REQUIRED bit, so + * we should not do this unless we have a chance of at least + * sticking the header into the send ring. Which is why we + * should call rds_ib_ring_alloc first. */ + rm->m_inc.i_hdr.h_ack = cpu_to_be64(rds_ib_piggyb_ack(ic)); + rds_message_make_checksum(&rm->m_inc.i_hdr); + + /* + * Update adv_credits since we reset the ACK_REQUIRED bit. + */ + rds_ib_send_grab_credits(ic, 0, &posted, 1); + adv_credits += posted; + BUG_ON(adv_credits > 255); + } else if (ic->i_rm != rm) + BUG(); + + send = &ic->i_sends[pos]; + first = send; + prev = NULL; + scat = &rm->m_sg[sg]; + sent = 0; + i = 0; + + /* Sometimes you want to put a fence between an RDMA + * READ and the following SEND. + * We could either do this all the time + * or when requested by the user. Right now, we let + * the application choose. + */ + if (rm->m_rdma_op && rm->m_rdma_op->r_fence) + send_flags = IB_SEND_FENCE; + + /* + * We could be copying the header into the unused tail of the page. + * That would need to be changed in the future when those pages might + * be mapped userspace pages or page cache pages. So instead we always + * use a second sge and our long-lived ring of mapped headers. We send + * the header after the data so that the data payload can be aligned on + * the receiver. + */ + + /* handle a 0-len message */ + if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) { + rds_ib_xmit_populate_wr(ic, send, pos, 0, 0, send_flags); + goto add_header; + } + + /* if there's data reference it with a chain of work reqs */ + for (; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) { + unsigned int len; + + send = &ic->i_sends[pos]; + + len = min(RDS_FRAG_SIZE, ib_sg_dma_len(dev, scat) - off); + rds_ib_xmit_populate_wr(ic, send, pos, + ib_sg_dma_address(dev, scat) + off, len, + send_flags); + + /* + * We want to delay signaling completions just enough to get + * the batching benefits but not so much that we create dead time + * on the wire. + */ + if (ic->i_unsignaled_wrs-- == 0) { + ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs; + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + } + + ic->i_unsignaled_bytes -= len; + if (ic->i_unsignaled_bytes <= 0) { + ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes; + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + } + + /* + * Always signal the last one if we're stopping due to flow control. + */ + if (flow_controlled && i == (work_alloc-1)) + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + rdsdebug("send %p wr %p num_sge %u next %p\n", send, + &send->s_wr, send->s_wr.num_sge, send->s_wr.next); + + sent += len; + off += len; + if (off == ib_sg_dma_len(dev, scat)) { + scat++; + off = 0; + } + +add_header: + /* Tack on the header after the data. The header SGE should already + * have been set up to point to the right header buffer. */ + memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header)); + + if (0) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n", + be16_to_cpu(hdr->h_dport), + hdr->h_flags, + be32_to_cpu(hdr->h_len)); + } + if (adv_credits) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + /* add credit and redo the header checksum */ + hdr->h_credit = adv_credits; + rds_message_make_checksum(hdr); + adv_credits = 0; + rds_ib_stats_inc(s_ib_tx_credit_updates); + } + + if (prev) + prev->s_wr.next = &send->s_wr; + prev = send; + + pos = (pos + 1) % ic->i_send_ring.w_nr; + } + + /* Account the RDS header in the number of bytes we sent, but just once. + * The caller has no concept of fragmentation. */ + if (hdr_off == 0) + sent += sizeof(struct rds_header); + + /* if we finished the message then send completion owns it */ + if (scat == &rm->m_sg[rm->m_count]) { + prev->s_rm = ic->i_rm; + prev->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + ic->i_rm = NULL; + } + + if (i < work_alloc) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i); + work_alloc = i; + } + if (ic->i_flowctl && i < credit_alloc) + rds_ib_send_add_credits(conn, credit_alloc - i); + + /* XXX need to worry about failed_wr and partial sends. */ + failed_wr = &first->s_wr; + ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr); + rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic, + first, &first->s_wr, ret, failed_wr); + BUG_ON(failed_wr != &first->s_wr); + if (ret) { + printk(KERN_WARNING "RDS/IB: ib_post_send to %pI4 " + "returned %d\n", &conn->c_faddr, ret); + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + if (prev->s_rm) { + ic->i_rm = prev->s_rm; + prev->s_rm = NULL; + } + /* Finesse this later */ + BUG(); + goto out; + } + + ret = sent; +out: + BUG_ON(adv_credits); + return ret; +} + +int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_send_work *send = NULL; + struct rds_ib_send_work *first; + struct rds_ib_send_work *prev; + struct ib_send_wr *failed_wr; + struct rds_ib_device *rds_ibdev; + struct scatterlist *scat; + unsigned long len; + u64 remote_addr = op->r_remote_addr; + u32 pos; + u32 work_alloc; + u32 i; + u32 j; + int sent; + int ret; + int num_sge; + + rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); + + /* map the message the first time we see it */ + if (!op->r_mapped) { + op->r_count = ib_dma_map_sg(ic->i_cm_id->device, + op->r_sg, op->r_nents, (op->r_write) ? + DMA_TO_DEVICE : DMA_FROM_DEVICE); + rdsdebug("ic %p mapping op %p: %d\n", ic, op, op->r_count); + if (op->r_count == 0) { + rds_ib_stats_inc(s_ib_tx_sg_mapping_failure); + ret = -ENOMEM; /* XXX ? */ + goto out; + } + + op->r_mapped = 1; + } + + /* + * Instead of knowing how to return a partial rdma read/write we insist that there + * be enough work requests to send the entire message. + */ + i = ceil(op->r_count, rds_ibdev->max_sge); + + work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos); + if (work_alloc != i) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_ib_stats_inc(s_ib_tx_ring_full); + ret = -ENOMEM; + goto out; + } + + send = &ic->i_sends[pos]; + first = send; + prev = NULL; + scat = &op->r_sg[0]; + sent = 0; + num_sge = op->r_count; + + for (i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++) { + send->s_wr.send_flags = 0; + send->s_queued = jiffies; + /* + * We want to delay signaling completions just enough to get + * the batching benefits but not so much that we create dead time on the wire. + */ + if (ic->i_unsignaled_wrs-- == 0) { + ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs; + send->s_wr.send_flags = IB_SEND_SIGNALED; + } + + send->s_wr.opcode = op->r_write ? IB_WR_RDMA_WRITE : IB_WR_RDMA_READ; + send->s_wr.wr.rdma.remote_addr = remote_addr; + send->s_wr.wr.rdma.rkey = op->r_key; + send->s_op = op; + + if (num_sge > rds_ibdev->max_sge) { + send->s_wr.num_sge = rds_ibdev->max_sge; + num_sge -= rds_ibdev->max_sge; + } else { + send->s_wr.num_sge = num_sge; + } + + send->s_wr.next = NULL; + + if (prev) + prev->s_wr.next = &send->s_wr; + + for (j = 0; j < send->s_wr.num_sge && scat != &op->r_sg[op->r_count]; j++) { + len = ib_sg_dma_len(ic->i_cm_id->device, scat); + send->s_sge[j].addr = + ib_sg_dma_address(ic->i_cm_id->device, scat); + send->s_sge[j].length = len; + send->s_sge[j].lkey = ic->i_mr->lkey; + + sent += len; + rdsdebug("ic %p sent %d remote_addr %llu\n", ic, sent, remote_addr); + + remote_addr += len; + scat++; + } + + rdsdebug("send %p wr %p num_sge %u next %p\n", send, + &send->s_wr, send->s_wr.num_sge, send->s_wr.next); + + prev = send; + if (++send == &ic->i_sends[ic->i_send_ring.w_nr]) + send = ic->i_sends; + } + + /* if we finished the message then send completion owns it */ + if (scat == &op->r_sg[op->r_count]) + prev->s_wr.send_flags = IB_SEND_SIGNALED; + + if (i < work_alloc) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i); + work_alloc = i; + } + + failed_wr = &first->s_wr; + ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr); + rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic, + first, &first->s_wr, ret, failed_wr); + BUG_ON(failed_wr != &first->s_wr); + if (ret) { + printk(KERN_WARNING "RDS/IB: rdma ib_post_send to %pI4 " + "returned %d\n", &conn->c_faddr, ret); + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + goto out; + } + + if (unlikely(failed_wr != &first->s_wr)) { + printk(KERN_WARNING "RDS/IB: ib_post_send() rc=%d, but failed_wqe updated!\n", ret); + BUG_ON(failed_wr != &first->s_wr); + } + + +out: + return ret; +} + +void rds_ib_xmit_complete(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + /* We may have a pending ACK or window update we were unable + * to send previously (due to flow control). Try again. */ + rds_ib_attempt_ack(ic); +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:34 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:34 -0800 Subject: [ofa-general] [PATCH 17/26] RDS/IB: Receive datagrams via IB In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-18-git-send-email-andy.grover@oracle.com> Header parsing, ring refill. It puts the incoming data into an rds_incoming struct, which is passed up to rds-core. Signed-off-by: Andy Grover --- net/rds/ib_recv.c | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 869 insertions(+), 0 deletions(-) create mode 100644 net/rds/ib_recv.c diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c new file mode 100644 index 0000000..5061b55 --- /dev/null +++ b/net/rds/ib_recv.c @@ -0,0 +1,869 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +static struct kmem_cache *rds_ib_incoming_slab; +static struct kmem_cache *rds_ib_frag_slab; +static atomic_t rds_ib_allocation = ATOMIC_INIT(0); + +static void rds_ib_frag_drop_page(struct rds_page_frag *frag) +{ + rdsdebug("frag %p page %p\n", frag, frag->f_page); + __free_page(frag->f_page); + frag->f_page = NULL; +} + +static void rds_ib_frag_free(struct rds_page_frag *frag) +{ + rdsdebug("frag %p page %p\n", frag, frag->f_page); + BUG_ON(frag->f_page != NULL); + kmem_cache_free(rds_ib_frag_slab, frag); +} + +/* + * We map a page at a time. Its fragments are posted in order. This + * is called in fragment order as the fragments get send completion events. + * Only the last frag in the page performs the unmapping. + * + * It's OK for ring cleanup to call this in whatever order it likes because + * DMA is not in flight and so we can unmap while other ring entries still + * hold page references in their frags. + */ +static void rds_ib_recv_unmap_page(struct rds_ib_connection *ic, + struct rds_ib_recv_work *recv) +{ + struct rds_page_frag *frag = recv->r_frag; + + rdsdebug("recv %p frag %p page %p\n", recv, frag, frag->f_page); + if (frag->f_mapped) + ib_dma_unmap_page(ic->i_cm_id->device, + frag->f_mapped, + RDS_FRAG_SIZE, DMA_FROM_DEVICE); + frag->f_mapped = 0; +} + +void rds_ib_recv_init_ring(struct rds_ib_connection *ic) +{ + struct rds_ib_recv_work *recv; + u32 i; + + for (i = 0, recv = ic->i_recvs; i < ic->i_recv_ring.w_nr; i++, recv++) { + struct ib_sge *sge; + + recv->r_ibinc = NULL; + recv->r_frag = NULL; + + recv->r_wr.next = NULL; + recv->r_wr.wr_id = i; + recv->r_wr.sg_list = recv->r_sge; + recv->r_wr.num_sge = RDS_IB_RECV_SGE; + + sge = rds_ib_data_sge(ic, recv->r_sge); + sge->addr = 0; + sge->length = RDS_FRAG_SIZE; + sge->lkey = ic->i_mr->lkey; + + sge = rds_ib_header_sge(ic, recv->r_sge); + sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; + } +} + +static void rds_ib_recv_clear_one(struct rds_ib_connection *ic, + struct rds_ib_recv_work *recv) +{ + if (recv->r_ibinc) { + rds_inc_put(&recv->r_ibinc->ii_inc); + recv->r_ibinc = NULL; + } + if (recv->r_frag) { + rds_ib_recv_unmap_page(ic, recv); + if (recv->r_frag->f_page) + rds_ib_frag_drop_page(recv->r_frag); + rds_ib_frag_free(recv->r_frag); + recv->r_frag = NULL; + } +} + +void rds_ib_recv_clear_ring(struct rds_ib_connection *ic) +{ + u32 i; + + for (i = 0; i < ic->i_recv_ring.w_nr; i++) + rds_ib_recv_clear_one(ic, &ic->i_recvs[i]); + + if (ic->i_frag.f_page) + rds_ib_frag_drop_page(&ic->i_frag); +} + +static int rds_ib_recv_refill_one(struct rds_connection *conn, + struct rds_ib_recv_work *recv, + gfp_t kptr_gfp, gfp_t page_gfp) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + dma_addr_t dma_addr; + struct ib_sge *sge; + int ret = -ENOMEM; + + if (recv->r_ibinc == NULL) { + if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) { + rds_ib_stats_inc(s_ib_rx_alloc_limit); + goto out; + } + recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab, + kptr_gfp); + if (recv->r_ibinc == NULL) + goto out; + atomic_inc(&rds_ib_allocation); + INIT_LIST_HEAD(&recv->r_ibinc->ii_frags); + rds_inc_init(&recv->r_ibinc->ii_inc, conn, conn->c_faddr); + } + + if (recv->r_frag == NULL) { + recv->r_frag = kmem_cache_alloc(rds_ib_frag_slab, kptr_gfp); + if (recv->r_frag == NULL) + goto out; + INIT_LIST_HEAD(&recv->r_frag->f_item); + recv->r_frag->f_page = NULL; + } + + if (ic->i_frag.f_page == NULL) { + ic->i_frag.f_page = alloc_page(page_gfp); + if (ic->i_frag.f_page == NULL) + goto out; + ic->i_frag.f_offset = 0; + } + + dma_addr = ib_dma_map_page(ic->i_cm_id->device, + ic->i_frag.f_page, + ic->i_frag.f_offset, + RDS_FRAG_SIZE, + DMA_FROM_DEVICE); + if (ib_dma_mapping_error(ic->i_cm_id->device, dma_addr)) + goto out; + + /* + * Once we get the RDS_PAGE_LAST_OFF frag then rds_ib_frag_unmap() + * must be called on this recv. This happens as completions hit + * in order or on connection shutdown. + */ + recv->r_frag->f_page = ic->i_frag.f_page; + recv->r_frag->f_offset = ic->i_frag.f_offset; + recv->r_frag->f_mapped = dma_addr; + + sge = rds_ib_data_sge(ic, recv->r_sge); + sge->addr = dma_addr; + sge->length = RDS_FRAG_SIZE; + + sge = rds_ib_header_sge(ic, recv->r_sge); + sge->addr = ic->i_recv_hdrs_dma + (recv - ic->i_recvs) * sizeof(struct rds_header); + sge->length = sizeof(struct rds_header); + + get_page(recv->r_frag->f_page); + + if (ic->i_frag.f_offset < RDS_PAGE_LAST_OFF) { + ic->i_frag.f_offset += RDS_FRAG_SIZE; + } else { + put_page(ic->i_frag.f_page); + ic->i_frag.f_page = NULL; + ic->i_frag.f_offset = 0; + } + + ret = 0; +out: + return ret; +} + +/* + * This tries to allocate and post unused work requests after making sure that + * they have all the allocations they need to queue received fragments into + * sockets. The i_recv_mutex is held here so that ring_alloc and _unalloc + * pairs don't go unmatched. + * + * -1 is returned if posting fails due to temporary resource exhaustion. + */ +int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, + gfp_t page_gfp, int prefill) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_recv_work *recv; + struct ib_recv_wr *failed_wr; + unsigned int posted = 0; + int ret = 0; + u32 pos; + + while ((prefill || rds_conn_up(conn)) + && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) { + if (pos >= ic->i_recv_ring.w_nr) { + printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n", + pos); + ret = -EINVAL; + break; + } + + recv = &ic->i_recvs[pos]; + ret = rds_ib_recv_refill_one(conn, recv, kptr_gfp, page_gfp); + if (ret) { + ret = -1; + break; + } + + /* XXX when can this fail? */ + ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); + rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv, + recv->r_ibinc, recv->r_frag->f_page, + (long) recv->r_frag->f_mapped, ret); + if (ret) { + rds_ib_conn_error(conn, "recv post on " + "%pI4 returned %d, disconnecting and " + "reconnecting\n", &conn->c_faddr, + ret); + ret = -1; + break; + } + + posted++; + } + + /* We're doing flow control - update the window. */ + if (ic->i_flowctl && posted) + rds_ib_advertise_credits(conn, posted); + + if (ret) + rds_ib_ring_unalloc(&ic->i_recv_ring, 1); + return ret; +} + +void rds_ib_inc_purge(struct rds_incoming *inc) +{ + struct rds_ib_incoming *ibinc; + struct rds_page_frag *frag; + struct rds_page_frag *pos; + + ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); + rdsdebug("purging ibinc %p inc %p\n", ibinc, inc); + + list_for_each_entry_safe(frag, pos, &ibinc->ii_frags, f_item) { + list_del_init(&frag->f_item); + rds_ib_frag_drop_page(frag); + rds_ib_frag_free(frag); + } +} + +void rds_ib_inc_free(struct rds_incoming *inc) +{ + struct rds_ib_incoming *ibinc; + + ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); + + rds_ib_inc_purge(inc); + rdsdebug("freeing ibinc %p inc %p\n", ibinc, inc); + BUG_ON(!list_empty(&ibinc->ii_frags)); + kmem_cache_free(rds_ib_incoming_slab, ibinc); + atomic_dec(&rds_ib_allocation); + BUG_ON(atomic_read(&rds_ib_allocation) < 0); +} + +int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *first_iov, + size_t size) +{ + struct rds_ib_incoming *ibinc; + struct rds_page_frag *frag; + struct iovec *iov = first_iov; + unsigned long to_copy; + unsigned long frag_off = 0; + unsigned long iov_off = 0; + int copied = 0; + int ret; + u32 len; + + ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); + frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); + len = be32_to_cpu(inc->i_hdr.h_len); + + while (copied < size && copied < len) { + if (frag_off == RDS_FRAG_SIZE) { + frag = list_entry(frag->f_item.next, + struct rds_page_frag, f_item); + frag_off = 0; + } + while (iov_off == iov->iov_len) { + iov_off = 0; + iov++; + } + + to_copy = min(iov->iov_len - iov_off, RDS_FRAG_SIZE - frag_off); + to_copy = min_t(size_t, to_copy, size - copied); + to_copy = min_t(unsigned long, to_copy, len - copied); + + rdsdebug("%lu bytes to user [%p, %zu] + %lu from frag " + "[%p, %lu] + %lu\n", + to_copy, iov->iov_base, iov->iov_len, iov_off, + frag->f_page, frag->f_offset, frag_off); + + /* XXX needs + offset for multiple recvs per page */ + ret = rds_page_copy_to_user(frag->f_page, + frag->f_offset + frag_off, + iov->iov_base + iov_off, + to_copy); + if (ret) { + copied = ret; + break; + } + + iov_off += to_copy; + frag_off += to_copy; + copied += to_copy; + } + + return copied; +} + +/* ic starts out kzalloc()ed */ +void rds_ib_recv_init_ack(struct rds_ib_connection *ic) +{ + struct ib_send_wr *wr = &ic->i_ack_wr; + struct ib_sge *sge = &ic->i_ack_sge; + + sge->addr = ic->i_ack_dma; + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; + + wr->sg_list = sge; + wr->num_sge = 1; + wr->opcode = IB_WR_SEND; + wr->wr_id = RDS_IB_ACK_WR_ID; + wr->send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; +} + +/* + * You'd think that with reliable IB connections you wouldn't need to ack + * messages that have been received. The problem is that IB hardware generates + * an ack message before it has DMAed the message into memory. This creates a + * potential message loss if the HCA is disabled for any reason between when it + * sends the ack and before the message is DMAed and processed. This is only a + * potential issue if another HCA is available for fail-over. + * + * When the remote host receives our ack they'll free the sent message from + * their send queue. To decrease the latency of this we always send an ack + * immediately after we've received messages. + * + * For simplicity, we only have one ack in flight at a time. This puts + * pressure on senders to have deep enough send queues to absorb the latency of + * a single ack frame being in flight. This might not be good enough. + * + * This is implemented by have a long-lived send_wr and sge which point to a + * statically allocated ack frame. This ack wr does not fall under the ring + * accounting that the tx and rx wrs do. The QP attribute specifically makes + * room for it beyond the ring size. Send completion notices its special + * wr_id and avoids working with the ring in that case. + */ +static void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq, + int ack_required) +{ + rds_ib_set_64bit(&ic->i_ack_next, seq); + if (ack_required) { + smp_mb__before_clear_bit(); + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + } +} + +static u64 rds_ib_get_ack(struct rds_ib_connection *ic) +{ + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + smp_mb__after_clear_bit(); + + return ic->i_ack_next; +} + +static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits) +{ + struct rds_header *hdr = ic->i_ack; + struct ib_send_wr *failed_wr; + u64 seq; + int ret; + + seq = rds_ib_get_ack(ic); + + rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq); + rds_message_populate_header(hdr, 0, 0, 0); + hdr->h_ack = cpu_to_be64(seq); + hdr->h_credit = adv_credits; + rds_message_make_checksum(hdr); + ic->i_ack_queued = jiffies; + + ret = ib_post_send(ic->i_cm_id->qp, &ic->i_ack_wr, &failed_wr); + if (unlikely(ret)) { + /* Failed to send. Release the WR, and + * force another ACK. + */ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + + rds_ib_stats_inc(s_ib_ack_send_failure); + /* Need to finesse this later. */ + BUG(); + } else + rds_ib_stats_inc(s_ib_ack_sent); +} + +/* + * There are 3 ways of getting acknowledgements to the peer: + * 1. We call rds_ib_attempt_ack from the recv completion handler + * to send an ACK-only frame. + * However, there can be only one such frame in the send queue + * at any time, so we may have to postpone it. + * 2. When another (data) packet is transmitted while there's + * an ACK in the queue, we piggyback the ACK sequence number + * on the data packet. + * 3. If the ACK WR is done sending, we get called from the + * send queue completion handler, and check whether there's + * another ACK pending (postponed because the WR was on the + * queue). If so, we transmit it. + * + * We maintain 2 variables: + * - i_ack_flags, which keeps track of whether the ACK WR + * is currently in the send queue or not (IB_ACK_IN_FLIGHT) + * - i_ack_next, which is the last sequence number we received + * + * Potentially, send queue and receive queue handlers can run concurrently. + * + * Reconnecting complicates this picture just slightly. When we + * reconnect, we may be seeing duplicate packets. The peer + * is retransmitting them, because it hasn't seen an ACK for + * them. It is important that we ACK these. + * + * ACK mitigation adds a header flag "ACK_REQUIRED"; any packet with + * this flag set *MUST* be acknowledged immediately. + */ + +/* + * When we get here, we're called from the recv queue handler. + * Check whether we ought to transmit an ACK. + */ +void rds_ib_attempt_ack(struct rds_ib_connection *ic) +{ + unsigned int adv_credits; + + if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) + return; + + if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) { + rds_ib_stats_inc(s_ib_ack_send_delayed); + return; + } + + /* Can we get a send credit? */ + if (!rds_ib_send_grab_credits(ic, 1, &adv_credits, 0)) { + rds_ib_stats_inc(s_ib_tx_throttle); + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + return; + } + + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + rds_ib_send_ack(ic, adv_credits); +} + +/* + * We get here from the send completion handler, when the + * adapter tells us the ACK frame was sent. + */ +void rds_ib_ack_send_complete(struct rds_ib_connection *ic) +{ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + rds_ib_attempt_ack(ic); +} + +/* + * This is called by the regular xmit code when it wants to piggyback + * an ACK on an outgoing frame. + */ +u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic) +{ + if (test_and_clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) + rds_ib_stats_inc(s_ib_ack_send_piggybacked); + return rds_ib_get_ack(ic); +} + +/* + * It's kind of lame that we're copying from the posted receive pages into + * long-lived bitmaps. We could have posted the bitmaps and rdma written into + * them. But receiving new congestion bitmaps should be a *rare* event, so + * hopefully we won't need to invest that complexity in making it more + * efficient. By copying we can share a simpler core with TCP which has to + * copy. + */ +static void rds_ib_cong_recv(struct rds_connection *conn, + struct rds_ib_incoming *ibinc) +{ + struct rds_cong_map *map; + unsigned int map_off; + unsigned int map_page; + struct rds_page_frag *frag; + unsigned long frag_off; + unsigned long to_copy; + unsigned long copied; + uint64_t uncongested = 0; + void *addr; + + /* catch completely corrupt packets */ + if (be32_to_cpu(ibinc->ii_inc.i_hdr.h_len) != RDS_CONG_MAP_BYTES) + return; + + map = conn->c_fcong; + map_page = 0; + map_off = 0; + + frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); + frag_off = 0; + + copied = 0; + + while (copied < RDS_CONG_MAP_BYTES) { + uint64_t *src, *dst; + unsigned int k; + + to_copy = min(RDS_FRAG_SIZE - frag_off, PAGE_SIZE - map_off); + BUG_ON(to_copy & 7); /* Must be 64bit aligned. */ + + addr = kmap_atomic(frag->f_page, KM_SOFTIRQ0); + + src = addr + frag_off; + dst = (void *)map->m_page_addrs[map_page] + map_off; + for (k = 0; k < to_copy; k += 8) { + /* Record ports that became uncongested, ie + * bits that changed from 0 to 1. */ + uncongested |= ~(*src) & *dst; + *dst++ = *src++; + } + kunmap_atomic(addr, KM_SOFTIRQ0); + + copied += to_copy; + + map_off += to_copy; + if (map_off == PAGE_SIZE) { + map_off = 0; + map_page++; + } + + frag_off += to_copy; + if (frag_off == RDS_FRAG_SIZE) { + frag = list_entry(frag->f_item.next, + struct rds_page_frag, f_item); + frag_off = 0; + } + } + + /* the congestion map is in little endian order */ + uncongested = le64_to_cpu(uncongested); + + rds_cong_map_updated(map, uncongested); +} + +/* + * Rings are posted with all the allocations they'll need to queue the + * incoming message to the receiving socket so this can't fail. + * All fragments start with a header, so we can make sure we're not receiving + * garbage, and we can tell a small 8 byte fragment from an ACK frame. + */ +struct rds_ib_ack_state { + u64 ack_next; + u64 ack_recv; + unsigned int ack_required:1; + unsigned int ack_next_valid:1; + unsigned int ack_recv_valid:1; +}; + +static void rds_ib_process_recv(struct rds_connection *conn, + struct rds_ib_recv_work *recv, u32 byte_len, + struct rds_ib_ack_state *state) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_incoming *ibinc = ic->i_ibinc; + struct rds_header *ihdr, *hdr; + + /* XXX shut down the connection if port 0,0 are seen? */ + + rdsdebug("ic %p ibinc %p recv %p byte len %u\n", ic, ibinc, recv, + byte_len); + + if (byte_len < sizeof(struct rds_header)) { + rds_ib_conn_error(conn, "incoming message " + "from %pI4 didn't inclue a " + "header, disconnecting and " + "reconnecting\n", + &conn->c_faddr); + return; + } + byte_len -= sizeof(struct rds_header); + + ihdr = &ic->i_recv_hdrs[recv - ic->i_recvs]; + + /* Validate the checksum. */ + if (!rds_message_verify_checksum(ihdr)) { + rds_ib_conn_error(conn, "incoming message " + "from %pI4 has corrupted header - " + "forcing a reconnect\n", + &conn->c_faddr); + rds_stats_inc(s_recv_drop_bad_checksum); + return; + } + + /* Process the ACK sequence which comes with every packet */ + state->ack_recv = be64_to_cpu(ihdr->h_ack); + state->ack_recv_valid = 1; + + /* Process the credits update if there was one */ + if (ihdr->h_credit) + rds_ib_send_add_credits(conn, ihdr->h_credit); + + if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) { + /* This is an ACK-only packet. The fact that it gets + * special treatment here is that historically, ACKs + * were rather special beasts. + */ + rds_ib_stats_inc(s_ib_ack_received); + + /* + * Usually the frags make their way on to incs and are then freed as + * the inc is freed. We don't go that route, so we have to drop the + * page ref ourselves. We can't just leave the page on the recv + * because that confuses the dma mapping of pages and each recv's use + * of a partial page. We can leave the frag, though, it will be + * reused. + * + * FIXME: Fold this into the code path below. + */ + rds_ib_frag_drop_page(recv->r_frag); + return; + } + + /* + * If we don't already have an inc on the connection then this + * fragment has a header and starts a message.. copy its header + * into the inc and save the inc so we can hang upcoming fragments + * off its list. + */ + if (ibinc == NULL) { + ibinc = recv->r_ibinc; + recv->r_ibinc = NULL; + ic->i_ibinc = ibinc; + + hdr = &ibinc->ii_inc.i_hdr; + memcpy(hdr, ihdr, sizeof(*hdr)); + ic->i_recv_data_rem = be32_to_cpu(hdr->h_len); + + rdsdebug("ic %p ibinc %p rem %u flag 0x%x\n", ic, ibinc, + ic->i_recv_data_rem, hdr->h_flags); + } else { + hdr = &ibinc->ii_inc.i_hdr; + /* We can't just use memcmp here; fragments of a + * single message may carry different ACKs */ + if (hdr->h_sequence != ihdr->h_sequence + || hdr->h_len != ihdr->h_len + || hdr->h_sport != ihdr->h_sport + || hdr->h_dport != ihdr->h_dport) { + rds_ib_conn_error(conn, + "fragment header mismatch; forcing reconnect\n"); + return; + } + } + + list_add_tail(&recv->r_frag->f_item, &ibinc->ii_frags); + recv->r_frag = NULL; + + if (ic->i_recv_data_rem > RDS_FRAG_SIZE) + ic->i_recv_data_rem -= RDS_FRAG_SIZE; + else { + ic->i_recv_data_rem = 0; + ic->i_ibinc = NULL; + + if (ibinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP) + rds_ib_cong_recv(conn, ibinc); + else { + rds_recv_incoming(conn, conn->c_faddr, conn->c_laddr, + &ibinc->ii_inc, GFP_ATOMIC, + KM_SOFTIRQ0); + state->ack_next = be64_to_cpu(hdr->h_sequence); + state->ack_next_valid = 1; + } + + /* Evaluate the ACK_REQUIRED flag *after* we received + * the complete frame, and after bumping the next_rx + * sequence. */ + if (hdr->h_flags & RDS_FLAG_ACK_REQUIRED) { + rds_stats_inc(s_recv_ack_required); + state->ack_required = 1; + } + + rds_inc_put(&ibinc->ii_inc); + } +} + +/* + * Plucking the oldest entry from the ring can be done concurrently with + * the thread refilling the ring. Each ring operation is protected by + * spinlocks and the transient state of refilling doesn't change the + * recording of which entry is oldest. + * + * This relies on IB only calling one cq comp_handler for each cq so that + * there will only be one caller of rds_recv_incoming() per RDS connection. + */ +void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context) +{ + struct rds_connection *conn = context; + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_wc wc; + struct rds_ib_ack_state state = { 0, }; + struct rds_ib_recv_work *recv; + + rdsdebug("conn %p cq %p\n", conn, cq); + + rds_ib_stats_inc(s_ib_rx_cq_call); + + ib_req_notify_cq(cq, IB_CQ_SOLICITED); + + while (ib_poll_cq(cq, 1, &wc) > 0) { + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", + (unsigned long long)wc.wr_id, wc.status, wc.byte_len, + be32_to_cpu(wc.ex.imm_data)); + rds_ib_stats_inc(s_ib_rx_cq_event); + + recv = &ic->i_recvs[rds_ib_ring_oldest(&ic->i_recv_ring)]; + + rds_ib_recv_unmap_page(ic, recv); + + /* + * Also process recvs in connecting state because it is possible + * to get a recv completion _before_ the rdmacm ESTABLISHED + * event is processed. + */ + if (rds_conn_up(conn) || rds_conn_connecting(conn)) { + /* We expect errors as the qp is drained during shutdown */ + if (wc.status == IB_WC_SUCCESS) { + rds_ib_process_recv(conn, recv, wc.byte_len, &state); + } else { + rds_ib_conn_error(conn, "recv completion on " + "%pI4 had status %u, disconnecting and " + "reconnecting\n", &conn->c_faddr, + wc.status); + } + } + + rds_ib_ring_free(&ic->i_recv_ring, 1); + } + + if (state.ack_next_valid) + rds_ib_set_ack(ic, state.ack_next, state.ack_required); + if (state.ack_recv_valid && state.ack_recv > ic->i_ack_recv) { + rds_send_drop_acked(conn, state.ack_recv, NULL); + ic->i_ack_recv = state.ack_recv; + } + if (rds_conn_up(conn)) + rds_ib_attempt_ack(ic); + + /* If we ever end up with a really empty receive ring, we're + * in deep trouble, as the sender will definitely see RNR + * timeouts. */ + if (rds_ib_ring_empty(&ic->i_recv_ring)) + rds_ib_stats_inc(s_ib_rx_ring_empty); + + /* + * If the ring is running low, then schedule the thread to refill. + */ + if (rds_ib_ring_low(&ic->i_recv_ring)) + queue_delayed_work(rds_wq, &conn->c_recv_w, 0); +} + +int rds_ib_recv(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + int ret = 0; + + rdsdebug("conn %p\n", conn); + + /* + * If we get a temporary posting failure in this context then + * we're really low and we want the caller to back off for a bit. + */ + mutex_lock(&ic->i_recv_mutex); + if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0)) + ret = -ENOMEM; + else + rds_ib_stats_inc(s_ib_rx_refill_from_thread); + mutex_unlock(&ic->i_recv_mutex); + + if (rds_conn_up(conn)) + rds_ib_attempt_ack(ic); + + return ret; +} + +int __init rds_ib_recv_init(void) +{ + struct sysinfo si; + int ret = -ENOMEM; + + /* Default to 30% of all available RAM for recv memory */ + si_meminfo(&si); + rds_ib_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE; + + rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming", + sizeof(struct rds_ib_incoming), + 0, 0, NULL); + if (rds_ib_incoming_slab == NULL) + goto out; + + rds_ib_frag_slab = kmem_cache_create("rds_ib_frag", + sizeof(struct rds_page_frag), + 0, 0, NULL); + if (rds_ib_frag_slab == NULL) + kmem_cache_destroy(rds_ib_incoming_slab); + else + ret = 0; +out: + return ret; +} + +void rds_ib_recv_exit(void) +{ + kmem_cache_destroy(rds_ib_incoming_slab); + kmem_cache_destroy(rds_ib_frag_slab); +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:37 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:37 -0800 Subject: [ofa-general] [PATCH 20/26] RDS: Common RDMA transport code In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-21-git-send-email-andy.grover@oracle.com> Although most of IB and iWARP are separated from each other, there is some common code required to handle their shared CM listen port. This code listens for CM events and then dispatches the event to the appropriate transport, either IB or iWARP. Signed-off-by: Andy Grover --- net/rds/rdma_transport.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++ net/rds/rdma_transport.h | 28 ++++++ 2 files changed, 242 insertions(+), 0 deletions(-) create mode 100644 net/rds/rdma_transport.c create mode 100644 net/rds/rdma_transport.h diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c new file mode 100644 index 0000000..7b19024 --- /dev/null +++ b/net/rds/rdma_transport.c @@ -0,0 +1,214 @@ +/* + * Copyright (c) 2009 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rdma_transport.h" + +static struct rdma_cm_id *rds_iw_listen_id; + +int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + /* this can be null in the listening path */ + struct rds_connection *conn = cm_id->context; + struct rds_transport *trans; + int ret = 0; + + rdsdebug("conn %p id %p handling event %u\n", conn, cm_id, + event->event); + + if (cm_id->device->node_type == RDMA_NODE_RNIC) + trans = &rds_iw_transport; + else + trans = &rds_ib_transport; + + /* Prevent shutdown from tearing down the connection + * while we're executing. */ + if (conn) { + mutex_lock(&conn->c_cm_lock); + + /* If the connection is being shut down, bail out + * right away. We return 0 so cm_id doesn't get + * destroyed prematurely */ + if (rds_conn_state(conn) == RDS_CONN_DISCONNECTING) { + /* Reject incoming connections while we're tearing + * down an existing one. */ + if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) + ret = 1; + goto out; + } + } + + switch (event->event) { + case RDMA_CM_EVENT_CONNECT_REQUEST: + ret = trans->cm_handle_connect(cm_id, event); + break; + + case RDMA_CM_EVENT_ADDR_RESOLVED: + /* XXX do we need to clean up if this fails? */ + ret = rdma_resolve_route(cm_id, + RDS_RDMA_RESOLVE_TIMEOUT_MS); + break; + + case RDMA_CM_EVENT_ROUTE_RESOLVED: + /* XXX worry about racing with listen acceptance */ + ret = trans->cm_initiate_connect(cm_id); + break; + + case RDMA_CM_EVENT_ESTABLISHED: + trans->cm_connect_complete(conn, event); + break; + + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + case RDMA_CM_EVENT_DEVICE_REMOVAL: + case RDMA_CM_EVENT_ADDR_CHANGE: + if (conn) + rds_conn_drop(conn); + break; + + case RDMA_CM_EVENT_DISCONNECTED: + printk(KERN_WARNING "RDS/IW: DISCONNECT event - dropping connection " + "%pI4->%pI4\n", &conn->c_laddr, + &conn->c_faddr); + rds_conn_drop(conn); + break; + + default: + /* things like device disconnect? */ + printk(KERN_ERR "unknown event %u\n", event->event); + BUG(); + break; + } + +out: + if (conn) + mutex_unlock(&conn->c_cm_lock); + + rdsdebug("id %p event %u handling ret %d\n", cm_id, event->event, ret); + + return ret; +} + +static int __init rds_rdma_listen_init(void) +{ + struct sockaddr_in sin; + struct rdma_cm_id *cm_id; + int ret; + + cm_id = rdma_create_id(rds_rdma_cm_event_handler, NULL, RDMA_PS_TCP); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + printk(KERN_ERR "RDS/IW: failed to setup listener, " + "rdma_create_id() returned %d\n", ret); + goto out; + } + + sin.sin_family = PF_INET, + sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY); + sin.sin_port = (__force u16)htons(RDS_PORT); + + /* + * XXX I bet this binds the cm_id to a device. If we want to support + * fail-over we'll have to take this into consideration. + */ + ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin); + if (ret) { + printk(KERN_ERR "RDS/IW: failed to setup listener, " + "rdma_bind_addr() returned %d\n", ret); + goto out; + } + + ret = rdma_listen(cm_id, 128); + if (ret) { + printk(KERN_ERR "RDS/IW: failed to setup listener, " + "rdma_listen() returned %d\n", ret); + goto out; + } + + rdsdebug("cm %p listening on port %u\n", cm_id, RDS_PORT); + + rds_iw_listen_id = cm_id; + cm_id = NULL; +out: + if (cm_id) + rdma_destroy_id(cm_id); + return ret; +} + +static void rds_rdma_listen_stop(void) +{ + if (rds_iw_listen_id) { + rdsdebug("cm %p\n", rds_iw_listen_id); + rdma_destroy_id(rds_iw_listen_id); + rds_iw_listen_id = NULL; + } +} + +int __init rds_rdma_init(void) +{ + int ret; + + ret = rds_rdma_listen_init(); + if (ret) + goto out; + + ret = rds_iw_init(); + if (ret) + goto err_iw_init; + + ret = rds_ib_init(); + if (ret) + goto err_ib_init; + + goto out; + +err_ib_init: + rds_iw_exit(); +err_iw_init: + rds_rdma_listen_stop(); +out: + return ret; +} + +void rds_rdma_exit(void) +{ + /* stop listening first to ensure no new connections are attempted */ + rds_rdma_listen_stop(); + rds_ib_exit(); + rds_iw_exit(); +} + diff --git a/net/rds/rdma_transport.h b/net/rds/rdma_transport.h new file mode 100644 index 0000000..2f2c7d9 --- /dev/null +++ b/net/rds/rdma_transport.h @@ -0,0 +1,28 @@ +#ifndef _RDMA_TRANSPORT_H +#define _RDMA_TRANSPORT_H + +#include +#include +#include "rds.h" + +#define RDS_RDMA_RESOLVE_TIMEOUT_MS 5000 + +int rds_rdma_conn_connect(struct rds_connection *conn); +int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event); + +/* from rdma_transport.c */ +int rds_rdma_init(void); +void rds_rdma_exit(void); + +/* from ib.c */ +extern struct rds_transport rds_ib_transport; +int rds_ib_init(void); +void rds_ib_exit(void); + +/* from iw.c */ +extern struct rds_transport rds_iw_transport; +int rds_iw_init(void); +void rds_iw_exit(void); + +#endif -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:35 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:35 -0800 Subject: [ofa-general] [PATCH 18/26] RDS/IB: Stats and sysctls In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-19-git-send-email-andy.grover@oracle.com> IB-specific stats and sysctls. Signed-off-by: Andy Grover --- net/rds/ib_stats.c | 95 +++++++++++++++++++++++++++++++++++ net/rds/ib_sysctl.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 232 insertions(+), 0 deletions(-) create mode 100644 net/rds/ib_stats.c create mode 100644 net/rds/ib_sysctl.c diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c new file mode 100644 index 0000000..02e3e3d --- /dev/null +++ b/net/rds/ib_stats.c @@ -0,0 +1,95 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +DEFINE_PER_CPU(struct rds_ib_statistics, rds_ib_stats) ____cacheline_aligned; + +static char *rds_ib_stat_names[] = { + "ib_connect_raced", + "ib_listen_closed_stale", + "ib_tx_cq_call", + "ib_tx_cq_event", + "ib_tx_ring_full", + "ib_tx_throttle", + "ib_tx_sg_mapping_failure", + "ib_tx_stalled", + "ib_tx_credit_updates", + "ib_rx_cq_call", + "ib_rx_cq_event", + "ib_rx_ring_empty", + "ib_rx_refill_from_cq", + "ib_rx_refill_from_thread", + "ib_rx_alloc_limit", + "ib_rx_credit_updates", + "ib_ack_sent", + "ib_ack_send_failure", + "ib_ack_send_delayed", + "ib_ack_send_piggybacked", + "ib_ack_received", + "ib_rdma_mr_alloc", + "ib_rdma_mr_free", + "ib_rdma_mr_used", + "ib_rdma_mr_pool_flush", + "ib_rdma_mr_pool_wait", + "ib_rdma_mr_pool_depleted", +}; + +unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail) +{ + struct rds_ib_statistics stats = {0, }; + uint64_t *src; + uint64_t *sum; + size_t i; + int cpu; + + if (avail < ARRAY_SIZE(rds_ib_stat_names)) + goto out; + + for_each_online_cpu(cpu) { + src = (uint64_t *)&(per_cpu(rds_ib_stats, cpu)); + sum = (uint64_t *)&stats; + for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++) + *(sum++) += *(src++); + } + + rds_stats_info_copy(iter, (uint64_t *)&stats, rds_ib_stat_names, + ARRAY_SIZE(rds_ib_stat_names)); +out: + return ARRAY_SIZE(rds_ib_stat_names); +} diff --git a/net/rds/ib_sysctl.c b/net/rds/ib_sysctl.c new file mode 100644 index 0000000..d87830d --- /dev/null +++ b/net/rds/ib_sysctl.c @@ -0,0 +1,137 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "ib.h" + +static struct ctl_table_header *rds_ib_sysctl_hdr; + +unsigned long rds_ib_sysctl_max_send_wr = RDS_IB_DEFAULT_SEND_WR; +unsigned long rds_ib_sysctl_max_recv_wr = RDS_IB_DEFAULT_RECV_WR; +unsigned long rds_ib_sysctl_max_recv_allocation = (128 * 1024 * 1024) / RDS_FRAG_SIZE; +static unsigned long rds_ib_sysctl_max_wr_min = 1; +/* hardware will fail CQ creation long before this */ +static unsigned long rds_ib_sysctl_max_wr_max = (u32)~0; + +unsigned long rds_ib_sysctl_max_unsig_wrs = 16; +static unsigned long rds_ib_sysctl_max_unsig_wr_min = 1; +static unsigned long rds_ib_sysctl_max_unsig_wr_max = 64; + +unsigned long rds_ib_sysctl_max_unsig_bytes = (16 << 20); +static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1; +static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL; + +unsigned int rds_ib_sysctl_flow_control = 1; + +ctl_table rds_ib_sysctl_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_send_wr", + .data = &rds_ib_sysctl_max_send_wr, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_wr_min, + .extra2 = &rds_ib_sysctl_max_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_recv_wr", + .data = &rds_ib_sysctl_max_recv_wr, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_wr_min, + .extra2 = &rds_ib_sysctl_max_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unsignaled_wr", + .data = &rds_ib_sysctl_max_unsig_wrs, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_unsig_wr_min, + .extra2 = &rds_ib_sysctl_max_unsig_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unsignaled_bytes", + .data = &rds_ib_sysctl_max_unsig_bytes, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_unsig_bytes_min, + .extra2 = &rds_ib_sysctl_max_unsig_bytes_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_recv_allocation", + .data = &rds_ib_sysctl_max_recv_allocation, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "flow_control", + .data = &rds_ib_sysctl_flow_control, + .maxlen = sizeof(rds_ib_sysctl_flow_control), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = 0} +}; + +static struct ctl_path rds_ib_sysctl_path[] = { + { .procname = "net", .ctl_name = CTL_NET, }, + { .procname = "rds", .ctl_name = CTL_UNNUMBERED, }, + { .procname = "ib", .ctl_name = CTL_UNNUMBERED, }, + { } +}; + +void rds_ib_sysctl_exit(void) +{ + if (rds_ib_sysctl_hdr) + unregister_sysctl_table(rds_ib_sysctl_hdr); +} + +int __init rds_ib_sysctl_init(void) +{ + rds_ib_sysctl_hdr = register_sysctl_paths(rds_ib_sysctl_path, rds_ib_sysctl_table); + if (rds_ib_sysctl_hdr == NULL) + return -ENOMEM; + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:38 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:38 -0800 Subject: [ofa-general] [PATCH 21/26] RDS: Documentation In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-22-git-send-email-andy.grover@oracle.com> This file documents the specifics of the RDS sockets API, as well as covering some of the details of its internal implementation. Signed-off-by: Andy Grover --- Documentation/networking/rds.txt | 356 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 356 insertions(+), 0 deletions(-) create mode 100644 Documentation/networking/rds.txt diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt new file mode 100644 index 0000000..c67077c --- /dev/null +++ b/Documentation/networking/rds.txt @@ -0,0 +1,356 @@ + +Overview +======== + +This readme tries to provide some background on the hows and whys of RDS, +and will hopefully help you find your way around the code. + +In addition, please see this email about RDS origins: +http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html + +RDS Architecture +================ + +RDS provides reliable, ordered datagram delivery by using a single +reliable connection between any two nodes in the cluster. This allows +applications to use a single socket to talk to any other process in the +cluster - so in a cluster with N processes you need N sockets, in contrast +to N*N if you use a connection-oriented socket transport like TCP. + +RDS is not Infiniband-specific; it was designed to support different +transports. The current implementation used to support RDS over TCP as well +as IB. Work is in progress to support RDS over iWARP, and using DCE to +guarantee no dropped packets on Ethernet, it may be possible to use RDS over +UDP in the future. + +The high-level semantics of RDS from the application's point of view are + + * Addressing + RDS uses IPv4 addresses and 16bit port numbers to identify + the end point of a connection. All socket operations that involve + passing addresses between kernel and user space generally + use a struct sockaddr_in. + + The fact that IPv4 addresses are used does not mean the underlying + transport has to be IP-based. In fact, RDS over IB uses a + reliable IB connection; the IP address is used exclusively to + locate the remote node's GID (by ARPing for the given IP). + + The port space is entirely independent of UDP, TCP or any other + protocol. + + * Socket interface + RDS sockets work *mostly* as you would expect from a BSD + socket. The next section will cover the details. At any rate, + all I/O is performed through the standard BSD socket API. + Some additions like zerocopy support are implemented through + control messages, while other extensions use the getsockopt/ + setsockopt calls. + + Sockets must be bound before you can send or receive data. + This is needed because binding also selects a transport and + attaches it to the socket. Once bound, the transport assignment + does not change. RDS will tolerate IPs moving around (eg in + a active-active HA scenario), but only as long as the address + doesn't move to a different transport. + + * sysctls + RDS supports a number of sysctls in /proc/sys/net/rds + + +Socket Interface +================ + + AF_RDS, PF_RDS, SOL_RDS + These constants haven't been assigned yet, because RDS isn't in + mainline yet. Currently, the kernel module assigns some constant + and publishes it to user space through two sysctl files + /proc/sys/net/rds/pf_rds + /proc/sys/net/rds/sol_rds + + fd = socket(PF_RDS, SOCK_SEQPACKET, 0); + This creates a new, unbound RDS socket. + + setsockopt(SOL_SOCKET): send and receive buffer size + RDS honors the send and receive buffer size socket options. + You are not allowed to queue more than SO_SNDSIZE bytes to + a socket. A message is queued when sendmsg is called, and + it leaves the queue when the remote system acknowledges + its arrival. + + The SO_RCVSIZE option controls the maximum receive queue length. + This is a soft limit rather than a hard limit - RDS will + continue to accept and queue incoming messages, even if that + takes the queue length over the limit. However, it will also + mark the port as "congested" and send a congestion update to + the source node. The source node is supposed to throttle any + processes sending to this congested port. + + bind(fd, &sockaddr_in, ...) + This binds the socket to a local IP address and port, and a + transport. + + sendmsg(fd, ...) + Sends a message to the indicated recipient. The kernel will + transparently establish the underlying reliable connection + if it isn't up yet. + + An attempt to send a message that exceeds SO_SNDSIZE will + return with -EMSGSIZE + + An attempt to send a message that would take the total number + of queued bytes over the SO_SNDSIZE threshold will return + EAGAIN. + + An attempt to send a message to a destination that is marked + as "congested" will return ENOBUFS. + + recvmsg(fd, ...) + Receives a message that was queued to this socket. The sockets + recv queue accounting is adjusted, and if the queue length + drops below SO_SNDSIZE, the port is marked uncongested, and + a congestion update is sent to all peers. + + Applications can ask the RDS kernel module to receive + notifications via control messages (for instance, there is a + notification when a congestion update arrived, or when a RDMA + operation completes). These notifications are received through + the msg.msg_control buffer of struct msghdr. The format of the + messages is described in manpages. + + poll(fd) + RDS supports the poll interface to allow the application + to implement async I/O. + + POLLIN handling is pretty straightforward. When there's an + incoming message queued to the socket, or a pending notification, + we signal POLLIN. + + POLLOUT is a little harder. Since you can essentially send + to any destination, RDS will always signal POLLOUT as long as + there's room on the send queue (ie the number of bytes queued + is less than the sendbuf size). + + However, the kernel will refuse to accept messages to + a destination marked congested - in this case you will loop + forever if you rely on poll to tell you what to do. + This isn't a trivial problem, but applications can deal with + this - by using congestion notifications, and by checking for + ENOBUFS errors returned by sendmsg. + + setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) + This allows the application to discard all messages queued to a + specific destination on this particular socket. + + This allows the application to cancel outstanding messages if + it detects a timeout. For instance, if it tried to send a message, + and the remote host is unreachable, RDS will keep trying forever. + The application may decide it's not worth it, and cancel the + operation. In this case, it would use RDS_CANCEL_SENT_TO to + nuke any pending messages. + + +RDMA for RDS +============ + + see rds-rdma(7) manpage (available in rds-tools) + + +Congestion Notifications +======================== + + see rds(7) manpage + + +RDS Protocol +============ + + Message header + + The message header is a 'struct rds_header' (see rds.h): + Fields: + h_sequence: + per-packet sequence number + h_ack: + piggybacked acknowledgment of last packet received + h_len: + length of data, not including header + h_sport: + source port + h_dport: + destination port + h_flags: + CONG_BITMAP - this is a congestion update bitmap + ACK_REQUIRED - receiver must ack this packet + RETRANSMITTED - packet has previously been sent + h_credit: + indicate to other end of connection that + it has more credits available (i.e. there is + more send room) + h_padding[4]: + unused, for future use + h_csum: + header checksum + h_exthdr: + optional data can be passed here. This is currently used for + passing RDMA-related information. + + ACK and retransmit handling + + One might think that with reliable IB connections you wouldn't need + to ack messages that have been received. The problem is that IB + hardware generates an ack message before it has DMAed the message + into memory. This creates a potential message loss if the HCA is + disabled for any reason between when it sends the ack and before + the message is DMAed and processed. This is only a potential issue + if another HCA is available for fail-over. + + Sending an ack immediately would allow the sender to free the sent + message from their send queue quickly, but could cause excessive + traffic to be used for acks. RDS piggybacks acks on sent data + packets. Ack-only packets are reduced by only allowing one to be + in flight at a time, and by the sender only asking for acks when + its send buffers start to fill up. All retransmissions are also + acked. + + Flow Control + + RDS's IB transport uses a credit-based mechanism to verify that + there is space in the peer's receive buffers for more data. This + eliminates the need for hardware retries on the connection. + + Congestion + + Messages waiting in the receive queue on the receiving socket + are accounted against the sockets SO_RCVBUF option value. Only + the payload bytes in the message are accounted for. If the + number of bytes queued equals or exceeds rcvbuf then the socket + is congested. All sends attempted to this socket's address + should return block or return -EWOULDBLOCK. + + Applications are expected to be reasonably tuned such that this + situation very rarely occurs. An application encountering this + "back-pressure" is considered a bug. + + This is implemented by having each node maintain bitmaps which + indicate which ports on bound addresses are congested. As the + bitmap changes it is sent through all the connections which + terminate in the local address of the bitmap which changed. + + The bitmaps are allocated as connections are brought up. This + avoids allocation in the interrupt handling path which queues + sages on sockets. The dense bitmaps let transports send the + entire bitmap on any bitmap change reasonably efficiently. This + is much easier to implement than some finer-grained + communication of per-port congestion. The sender does a very + inexpensive bit test to test if the port it's about to send to + is congested or not. + + +RDS Transport Layer +================== + + As mentioned above, RDS is not IB-specific. Its code is divided + into a general RDS layer and a transport layer. + + The general layer handles the socket API, congestion handling, + loopback, stats, usermem pinning, and the connection state machine. + + The transport layer handles the details of the transport. The IB + transport, for example, handles all the queue pairs, work requests, + CM event handlers, and other Infiniband details. + + +RDS Kernel Structures +===================== + + struct rds_message + aka possibly "rds_outgoing", the generic RDS layer copies data to + be sent and sets header fields as needed, based on the socket API. + This is then queued for the individual connection and sent by the + connection's transport. + struct rds_incoming + a generic struct referring to incoming data that can be handed from + the transport to the general code and queued by the general code + while the socket is awoken. It is then passed back to the transport + code to handle the actual copy-to-user. + struct rds_socket + per-socket information + struct rds_connection + per-connection information + struct rds_transport + pointers to transport-specific functions + struct rds_statistics + non-transport-specific statistics + struct rds_cong_map + wraps the raw congestion bitmap, contains rbnode, waitq, etc. + +Connection management +===================== + + Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and + ERROR states. + + The first time an attempt is made by an RDS socket to send data to + a node, a connection is allocated and connected. That connection is + then maintained forever -- if there are transport errors, the + connection will be dropped and re-established. + + Dropping a connection while packets are queued will cause queued or + partially-sent datagrams to be retransmitted when the connection is + re-established. + + +The send path +============= + + rds_sendmsg() + struct rds_message built from incoming data + CMSGs parsed (e.g. RDMA ops) + transport connection alloced and connected if not already + rds_message placed on send queue + send worker awoken + rds_send_worker() + calls rds_send_xmit() until queue is empty + rds_send_xmit() + transmits congestion map if one is pending + may set ACK_REQUIRED + calls transport to send either non-RDMA or RDMA message + (RDMA ops never retransmitted) + rds_ib_xmit() + allocs work requests from send ring + adds any new send credits available to peer (h_credits) + maps the rds_message's sg list + piggybacks ack + populates work requests + post send to connection's queue pair + +The recv path +============= + + rds_ib_recv_cq_comp_handler() + looks at write completions + unmaps recv buffer from device + no errors, call rds_ib_process_recv() + refill recv ring + rds_ib_process_recv() + validate header checksum + copy header to rds_ib_incoming struct if start of a new datagram + add to ibinc's fraglist + if competed datagram: + update cong map if datagram was cong update + call rds_recv_incoming() otherwise + note if ack is required + rds_recv_incoming() + drop duplicate packets + respond to pings + find the sock associated with this datagram + add to sock queue + wake up sock + do some congestion calculations + rds_recvmsg + copy data into user iovec + handle CMSGs + return to application + + -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:39 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:39 -0800 Subject: [ofa-general] [PATCH 22/26] RDS: Kconfig and Makefile In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-23-git-send-email-andy.grover@oracle.com> Add RDS Kconfig and Makefile, and modify net/'s to add us to the build. Signed-off-by: Andy Grover --- net/Kconfig | 1 + net/Makefile | 1 + net/rds/Kconfig | 13 +++++++++++++ net/rds/Makefile | 14 ++++++++++++++ 4 files changed, 29 insertions(+), 0 deletions(-) create mode 100644 net/rds/Kconfig create mode 100644 net/rds/Makefile diff --git a/net/Kconfig b/net/Kconfig index a12bae0..6b39ede 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -171,6 +171,7 @@ endif source "net/dccp/Kconfig" source "net/sctp/Kconfig" +source "net/rds/Kconfig" source "net/tipc/Kconfig" source "net/atm/Kconfig" source "net/802/Kconfig" diff --git a/net/Makefile b/net/Makefile index 0fcce89..9e00a55 100644 --- a/net/Makefile +++ b/net/Makefile @@ -49,6 +49,7 @@ obj-y += 8021q/ endif obj-$(CONFIG_IP_DCCP) += dccp/ obj-$(CONFIG_IP_SCTP) += sctp/ +obj-$(CONFIG_RDS) += rds/ obj-y += wireless/ obj-$(CONFIG_MAC80211) += mac80211/ obj-$(CONFIG_TIPC) += tipc/ diff --git a/net/rds/Kconfig b/net/rds/Kconfig new file mode 100644 index 0000000..63bd370 --- /dev/null +++ b/net/rds/Kconfig @@ -0,0 +1,13 @@ + +config RDS + tristate "Reliable Datagram Sockets (RDS) (EXPERIMENTAL)" + depends on INET && INFINIBAND_IPOIB && EXPERIMENTAL + ---help--- + RDS provides reliable, sequenced delivery of datagrams + over Infiniband. + +config RDS_DEBUG + bool "Debugging messages" + depends on RDS + default n + diff --git a/net/rds/Makefile b/net/rds/Makefile new file mode 100644 index 0000000..51f2758 --- /dev/null +++ b/net/rds/Makefile @@ -0,0 +1,14 @@ +obj-$(CONFIG_RDS) += rds.o +rds-y := af_rds.o bind.o cong.o connection.o info.o message.o \ + recv.o send.o stats.o sysctl.o threads.o transport.o \ + loop.o page.o rdma.o \ + rdma_transport.o \ + ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ + ib_sysctl.o ib_rdma.o \ + iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \ + iw_sysctl.o iw_rdma.o + +ifeq ($(CONFIG_RDS_DEBUG), y) +EXTRA_CFLAGS += -DDEBUG +endif + -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:40 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:40 -0800 Subject: [ofa-general] [PATCH 23/26] RDS: Add AF and PF #defines for RDS sockets In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-24-git-send-email-andy.grover@oracle.com> RDS is a reliable datagram protocol used for IPC on Oracle database clusters. This adds address and protocol family numbers for it. Signed-off-by: Andy Grover --- include/linux/socket.h | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/include/linux/socket.h b/include/linux/socket.h index 20fc4bb..3cdc041 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -191,7 +191,8 @@ struct ucred { #define AF_RXRPC 33 /* RxRPC sockets */ #define AF_ISDN 34 /* mISDN sockets */ #define AF_PHONET 35 /* Phonet sockets */ -#define AF_MAX 36 /* For now.. */ +#define AF_RDS 36 /* RDS sockets */ +#define AF_MAX 37 /* For now.. */ /* Protocol families, same as address families. */ #define PF_UNSPEC AF_UNSPEC @@ -229,6 +230,7 @@ struct ucred { #define PF_RXRPC AF_RXRPC #define PF_ISDN AF_ISDN #define PF_PHONET AF_PHONET +#define PF_RDS AF_RDS #define PF_MAX AF_MAX /* Maximum queue length specifiable by listen. */ @@ -298,6 +300,7 @@ struct ucred { #define SOL_PPPOL2TP 273 #define SOL_BLUETOOTH 274 #define SOL_PNPIPE 275 +#define SOL_RDS 276 /* IPX options */ #define IPX_TYPE 1 -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:42 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:42 -0800 Subject: [ofa-general] [PATCH 25/26] RDS: Add userspace header In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-26-git-send-email-andy.grover@oracle.com> Applications include this header in order to use RDS sockets. Signed-off-by: Andy Grover --- include/linux/rds.h | 250 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 250 insertions(+), 0 deletions(-) create mode 100644 include/linux/rds.h diff --git a/include/linux/rds.h b/include/linux/rds.h new file mode 100644 index 0000000..d91dc91 --- /dev/null +++ b/include/linux/rds.h @@ -0,0 +1,250 @@ +/* + * Copyright (c) 2008 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _LINUX_RDS_H +#define _LINUX_RDS_H + +#include + +/* These sparse annotated types shouldn't be in any user + * visible header file. We should clean this up rather + * than kludging around them. */ +#ifndef __KERNEL__ +#define __be16 u_int16_t +#define __be32 u_int32_t +#define __be64 u_int64_t +#endif + +#define RDS_IB_ABI_VERSION 0x301 + +/* + * setsockopt/getsockopt for SOL_RDS + */ +#define RDS_CANCEL_SENT_TO 1 +#define RDS_GET_MR 2 +#define RDS_FREE_MR 3 +/* deprecated: RDS_BARRIER 4 */ +#define RDS_RECVERR 5 +#define RDS_CONG_MONITOR 6 + +/* + * Control message types for SOL_RDS. + * + * CMSG_RDMA_ARGS (sendmsg) + * Request a RDMA transfer to/from the specified + * memory ranges. + * The cmsg_data is a struct rds_rdma_args. + * RDS_CMSG_RDMA_DEST (recvmsg, sendmsg) + * Kernel informs application about intended + * source/destination of a RDMA transfer + * RDS_CMSG_RDMA_MAP (sendmsg) + * Application asks kernel to map the given + * memory range into a IB MR, and send the + * R_Key along in an RDS extension header. + * The cmsg_data is a struct rds_get_mr_args, + * the same as for the GET_MR setsockopt. + * RDS_CMSG_RDMA_STATUS (recvmsg) + * Returns the status of a completed RDMA operation. + */ +#define RDS_CMSG_RDMA_ARGS 1 +#define RDS_CMSG_RDMA_DEST 2 +#define RDS_CMSG_RDMA_MAP 3 +#define RDS_CMSG_RDMA_STATUS 4 +#define RDS_CMSG_CONG_UPDATE 5 + +#define RDS_INFO_FIRST 10000 +#define RDS_INFO_COUNTERS 10000 +#define RDS_INFO_CONNECTIONS 10001 +/* 10002 aka RDS_INFO_FLOWS is deprecated */ +#define RDS_INFO_SEND_MESSAGES 10003 +#define RDS_INFO_RETRANS_MESSAGES 10004 +#define RDS_INFO_RECV_MESSAGES 10005 +#define RDS_INFO_SOCKETS 10006 +#define RDS_INFO_TCP_SOCKETS 10007 +#define RDS_INFO_IB_CONNECTIONS 10008 +#define RDS_INFO_CONNECTION_STATS 10009 +#define RDS_INFO_IWARP_CONNECTIONS 10010 +#define RDS_INFO_LAST 10010 + +struct rds_info_counter { + u_int8_t name[32]; + u_int64_t value; +} __attribute__((packed)); + +#define RDS_INFO_CONNECTION_FLAG_SENDING 0x01 +#define RDS_INFO_CONNECTION_FLAG_CONNECTING 0x02 +#define RDS_INFO_CONNECTION_FLAG_CONNECTED 0x04 + +#define TRANSNAMSIZ 16 + +struct rds_info_connection { + u_int64_t next_tx_seq; + u_int64_t next_rx_seq; + __be32 laddr; + __be32 faddr; + u_int8_t transport[TRANSNAMSIZ]; /* null term ascii */ + u_int8_t flags; +} __attribute__((packed)); + +struct rds_info_flow { + __be32 laddr; + __be32 faddr; + u_int32_t bytes; + __be16 lport; + __be16 fport; +} __attribute__((packed)); + +#define RDS_INFO_MESSAGE_FLAG_ACK 0x01 +#define RDS_INFO_MESSAGE_FLAG_FAST_ACK 0x02 + +struct rds_info_message { + u_int64_t seq; + u_int32_t len; + __be32 laddr; + __be32 faddr; + __be16 lport; + __be16 fport; + u_int8_t flags; +} __attribute__((packed)); + +struct rds_info_socket { + u_int32_t sndbuf; + __be32 bound_addr; + __be32 connected_addr; + __be16 bound_port; + __be16 connected_port; + u_int32_t rcvbuf; + u_int64_t inum; +} __attribute__((packed)); + +#define RDS_IB_GID_LEN 16 +struct rds_info_rdma_connection { + __be32 src_addr; + __be32 dst_addr; + uint8_t src_gid[RDS_IB_GID_LEN]; + uint8_t dst_gid[RDS_IB_GID_LEN]; + + uint32_t max_send_wr; + uint32_t max_recv_wr; + uint32_t max_send_sge; + uint32_t rdma_mr_max; + uint32_t rdma_mr_size; +}; + +/* + * Congestion monitoring. + * Congestion control in RDS happens at the host connection + * level by exchanging a bitmap marking congested ports. + * By default, a process sleeping in poll() is always woken + * up when the congestion map is updated. + * With explicit monitoring, an application can have more + * fine-grained control. + * The application installs a 64bit mask value in the socket, + * where each bit corresponds to a group of ports. + * When a congestion update arrives, RDS checks the set of + * ports that are now uncongested against the list bit mask + * installed in the socket, and if they overlap, we queue a + * cong_notification on the socket. + * + * To install the congestion monitor bitmask, use RDS_CONG_MONITOR + * with the 64bit mask. + * Congestion updates are received via RDS_CMSG_CONG_UPDATE + * control messages. + * + * The correspondence between bits and ports is + * 1 << (portnum % 64) + */ +#define RDS_CONG_MONITOR_SIZE 64 +#define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE) +#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port)) + +/* + * RDMA related types + */ + +/* + * This encapsulates a remote memory location. + * In the current implementation, it contains the R_Key + * of the remote memory region, and the offset into it + * (so that the application does not have to worry about + * alignment). + */ +typedef u_int64_t rds_rdma_cookie_t; + +struct rds_iovec { + u_int64_t addr; + u_int64_t bytes; +}; + +struct rds_get_mr_args { + struct rds_iovec vec; + u_int64_t cookie_addr; + uint64_t flags; +}; + +struct rds_free_mr_args { + rds_rdma_cookie_t cookie; + u_int64_t flags; +}; + +struct rds_rdma_args { + rds_rdma_cookie_t cookie; + struct rds_iovec remote_vec; + u_int64_t local_vec_addr; + u_int64_t nr_local; + u_int64_t flags; + u_int64_t user_token; +}; + +struct rds_rdma_notify { + u_int64_t user_token; + int32_t status; +}; + +#define RDS_RDMA_SUCCESS 0 +#define RDS_RDMA_REMOTE_ERROR 1 +#define RDS_RDMA_CANCELED 2 +#define RDS_RDMA_DROPPED 3 +#define RDS_RDMA_OTHER_ERROR 4 + +/* + * Common set of flags for all RDMA related structs + */ +#define RDS_RDMA_READWRITE 0x0001 +#define RDS_RDMA_FENCE 0x0002 /* use FENCE for immediate send */ +#define RDS_RDMA_INVALIDATE 0x0004 /* invalidate R_Key after freeing MR */ +#define RDS_RDMA_USE_ONCE 0x0008 /* free MR after use */ +#define RDS_RDMA_DONTWAIT 0x0010 /* Don't wait in SET_BARRIER */ +#define RDS_RDMA_NOTIFY_ME 0x0020 /* Notify when operation completes */ + +#endif /* IB_RDS_H */ -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:41 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:41 -0800 Subject: [ofa-general] [PATCH 24/26] RDS: Add MAINTAINERS entry In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-25-git-send-email-andy.grover@oracle.com> Signed-off-by: Andy Grover --- MAINTAINERS | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 59fd2d1..fd68b34 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3639,6 +3639,12 @@ M: florian.fainelli at telecomint.eu L: netdev at vger.kernel.org S: Maintained +RDS - RELIABLE DATAGRAM SOCKETS +P: Andy Grover +M: andy.grover at oracle.com +L: rds-devel at oss.oracle.com +S: Supported + READ-COPY UPDATE (RCU) P: Dipankar Sarma M: dipankar at in.ibm.com -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:36 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:36 -0800 Subject: [ofa-general] [PATCH 19/26] RDS: Add iWARP support In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-20-git-send-email-andy.grover@oracle.com> Support for iWARP NICs is implemented as a separate RDS transport from IB. The code, however, is very similar to IB (it was forked, basically.) so let's keep it in one changeset. The reason for this duplicationis that despite its similarity to IB, there are a number of places where it has different semantics. iwarp zcopy support is still under development, and giving it its own sandbox ensures that IB code isn't disrupted while iwarp changes. Over time these transports will re-converge. Signed-off-by: Andy Grover --- net/rds/iw.c | 333 ++++++++++++++++++ net/rds/iw.h | 395 +++++++++++++++++++++ net/rds/iw_cm.c | 750 +++++++++++++++++++++++++++++++++++++++ net/rds/iw_rdma.c | 888 ++++++++++++++++++++++++++++++++++++++++++++++ net/rds/iw_recv.c | 869 +++++++++++++++++++++++++++++++++++++++++++++ net/rds/iw_ring.c | 169 +++++++++ net/rds/iw_send.c | 975 +++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/iw_stats.c | 95 +++++ net/rds/iw_sysctl.c | 137 +++++++ 9 files changed, 4611 insertions(+), 0 deletions(-) create mode 100644 net/rds/iw.c create mode 100644 net/rds/iw.h create mode 100644 net/rds/iw_cm.c create mode 100644 net/rds/iw_rdma.c create mode 100644 net/rds/iw_recv.c create mode 100644 net/rds/iw_ring.c create mode 100644 net/rds/iw_send.c create mode 100644 net/rds/iw_stats.c create mode 100644 net/rds/iw_sysctl.c diff --git a/net/rds/iw.c b/net/rds/iw.c new file mode 100644 index 0000000..1b56905 --- /dev/null +++ b/net/rds/iw.c @@ -0,0 +1,333 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include + +#include "rds.h" +#include "iw.h" + +unsigned int fastreg_pool_size = RDS_FASTREG_POOL_SIZE; +unsigned int fastreg_message_size = RDS_FASTREG_SIZE + 1; /* +1 allows for unaligned MRs */ + +module_param(fastreg_pool_size, int, 0444); +MODULE_PARM_DESC(fastreg_pool_size, " Max number of fastreg MRs per device"); +module_param(fastreg_message_size, int, 0444); +MODULE_PARM_DESC(fastreg_message_size, " Max size of a RDMA transfer (fastreg MRs)"); + +struct list_head rds_iw_devices; + +DEFINE_SPINLOCK(iw_nodev_conns_lock); +LIST_HEAD(iw_nodev_conns); + +void rds_iw_add_one(struct ib_device *device) +{ + struct rds_iw_device *rds_iwdev; + struct ib_device_attr *dev_attr; + + /* Only handle iwarp devices */ + if (device->node_type != RDMA_NODE_RNIC) + return; + + dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); + if (!dev_attr) + return; + + if (ib_query_device(device, dev_attr)) { + rdsdebug("Query device failed for %s\n", device->name); + goto free_attr; + } + + rds_iwdev = kmalloc(sizeof *rds_iwdev, GFP_KERNEL); + if (!rds_iwdev) + goto free_attr; + + spin_lock_init(&rds_iwdev->spinlock); + + rds_iwdev->dma_local_lkey = !!(dev_attr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY); + rds_iwdev->max_wrs = dev_attr->max_qp_wr; + rds_iwdev->max_sge = min(dev_attr->max_sge, RDS_IW_MAX_SGE); + + rds_iwdev->page_shift = max(PAGE_SHIFT, ffs(dev_attr->page_size_cap) - 1); + + rds_iwdev->dev = device; + rds_iwdev->pd = ib_alloc_pd(device); + if (IS_ERR(rds_iwdev->pd)) + goto free_dev; + + if (!rds_iwdev->dma_local_lkey) { + if (device->node_type != RDMA_NODE_RNIC) { + rds_iwdev->mr = ib_get_dma_mr(rds_iwdev->pd, + IB_ACCESS_LOCAL_WRITE); + } else { + rds_iwdev->mr = ib_get_dma_mr(rds_iwdev->pd, + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_LOCAL_WRITE); + } + if (IS_ERR(rds_iwdev->mr)) + goto err_pd; + } else + rds_iwdev->mr = NULL; + + rds_iwdev->mr_pool = rds_iw_create_mr_pool(rds_iwdev); + if (IS_ERR(rds_iwdev->mr_pool)) { + rds_iwdev->mr_pool = NULL; + goto err_mr; + } + + INIT_LIST_HEAD(&rds_iwdev->cm_id_list); + INIT_LIST_HEAD(&rds_iwdev->conn_list); + list_add_tail(&rds_iwdev->list, &rds_iw_devices); + + ib_set_client_data(device, &rds_iw_client, rds_iwdev); + + goto free_attr; + +err_mr: + if (rds_iwdev->mr) + ib_dereg_mr(rds_iwdev->mr); +err_pd: + ib_dealloc_pd(rds_iwdev->pd); +free_dev: + kfree(rds_iwdev); +free_attr: + kfree(dev_attr); +} + +void rds_iw_remove_one(struct ib_device *device) +{ + struct rds_iw_device *rds_iwdev; + struct rds_iw_cm_id *i_cm_id, *next; + + rds_iwdev = ib_get_client_data(device, &rds_iw_client); + if (!rds_iwdev) + return; + + spin_lock_irq(&rds_iwdev->spinlock); + list_for_each_entry_safe(i_cm_id, next, &rds_iwdev->cm_id_list, list) { + list_del(&i_cm_id->list); + kfree(i_cm_id); + } + spin_unlock_irq(&rds_iwdev->spinlock); + + rds_iw_remove_conns(rds_iwdev); + + if (rds_iwdev->mr_pool) + rds_iw_destroy_mr_pool(rds_iwdev->mr_pool); + + if (rds_iwdev->mr) + ib_dereg_mr(rds_iwdev->mr); + + while (ib_dealloc_pd(rds_iwdev->pd)) { + rdsdebug("Failed to dealloc pd %p\n", rds_iwdev->pd); + msleep(1); + } + + list_del(&rds_iwdev->list); + kfree(rds_iwdev); +} + +struct ib_client rds_iw_client = { + .name = "rds_iw", + .add = rds_iw_add_one, + .remove = rds_iw_remove_one +}; + +static int rds_iw_conn_info_visitor(struct rds_connection *conn, + void *buffer) +{ + struct rds_info_rdma_connection *iinfo = buffer; + struct rds_iw_connection *ic; + + /* We will only ever look at IB transports */ + if (conn->c_trans != &rds_iw_transport) + return 0; + + iinfo->src_addr = conn->c_laddr; + iinfo->dst_addr = conn->c_faddr; + + memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid)); + memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid)); + if (rds_conn_state(conn) == RDS_CONN_UP) { + struct rds_iw_device *rds_iwdev; + struct rdma_dev_addr *dev_addr; + + ic = conn->c_transport_data; + dev_addr = &ic->i_cm_id->route.addr.dev_addr; + + ib_addr_get_sgid(dev_addr, (union ib_gid *) &iinfo->src_gid); + ib_addr_get_dgid(dev_addr, (union ib_gid *) &iinfo->dst_gid); + + rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client); + iinfo->max_send_wr = ic->i_send_ring.w_nr; + iinfo->max_recv_wr = ic->i_recv_ring.w_nr; + iinfo->max_send_sge = rds_iwdev->max_sge; + rds_iw_get_mr_info(rds_iwdev, iinfo); + } + return 1; +} + +static void rds_iw_ic_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_for_each_conn_info(sock, len, iter, lens, + rds_iw_conn_info_visitor, + sizeof(struct rds_info_rdma_connection)); +} + + +/* + * Early RDS/IB was built to only bind to an address if there is an IPoIB + * device with that address set. + * + * If it were me, I'd advocate for something more flexible. Sending and + * receiving should be device-agnostic. Transports would try and maintain + * connections between peers who have messages queued. Userspace would be + * allowed to influence which paths have priority. We could call userspace + * asserting this policy "routing". + */ +static int rds_iw_laddr_check(__be32 addr) +{ + int ret; + struct rdma_cm_id *cm_id; + struct sockaddr_in sin; + + /* Create a CMA ID and try to bind it. This catches both + * IB and iWARP capable NICs. + */ + cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP); + if (!cm_id) + return -EADDRNOTAVAIL; + + memset(&sin, 0, sizeof(sin)); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = addr; + + /* rdma_bind_addr will only succeed for IB & iWARP devices */ + ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin); + /* due to this, we will claim to support IB devices unless we + check node_type. */ + if (ret || cm_id->device->node_type != RDMA_NODE_RNIC) + ret = -EADDRNOTAVAIL; + + rdsdebug("addr %pI4 ret %d node type %d\n", + &addr, ret, + cm_id->device ? cm_id->device->node_type : -1); + + rdma_destroy_id(cm_id); + + return ret; +} + +void rds_iw_exit(void) +{ + rds_info_deregister_func(RDS_INFO_IWARP_CONNECTIONS, rds_iw_ic_info); + rds_iw_remove_nodev_conns(); + ib_unregister_client(&rds_iw_client); + rds_iw_sysctl_exit(); + rds_iw_recv_exit(); + rds_trans_unregister(&rds_iw_transport); +} + +struct rds_transport rds_iw_transport = { + .laddr_check = rds_iw_laddr_check, + .xmit_complete = rds_iw_xmit_complete, + .xmit = rds_iw_xmit, + .xmit_cong_map = NULL, + .xmit_rdma = rds_iw_xmit_rdma, + .recv = rds_iw_recv, + .conn_alloc = rds_iw_conn_alloc, + .conn_free = rds_iw_conn_free, + .conn_connect = rds_iw_conn_connect, + .conn_shutdown = rds_iw_conn_shutdown, + .inc_copy_to_user = rds_iw_inc_copy_to_user, + .inc_purge = rds_iw_inc_purge, + .inc_free = rds_iw_inc_free, + .cm_initiate_connect = rds_iw_cm_initiate_connect, + .cm_handle_connect = rds_iw_cm_handle_connect, + .cm_connect_complete = rds_iw_cm_connect_complete, + .stats_info_copy = rds_iw_stats_info_copy, + .exit = rds_iw_exit, + .get_mr = rds_iw_get_mr, + .sync_mr = rds_iw_sync_mr, + .free_mr = rds_iw_free_mr, + .flush_mrs = rds_iw_flush_mrs, + .t_owner = THIS_MODULE, + .t_name = "iwarp", + .t_prefer_loopback = 1, +}; + +int __init rds_iw_init(void) +{ + int ret; + + INIT_LIST_HEAD(&rds_iw_devices); + + ret = ib_register_client(&rds_iw_client); + if (ret) + goto out; + + ret = rds_iw_sysctl_init(); + if (ret) + goto out_ibreg; + + ret = rds_iw_recv_init(); + if (ret) + goto out_sysctl; + + ret = rds_trans_register(&rds_iw_transport); + if (ret) + goto out_recv; + + rds_info_register_func(RDS_INFO_IWARP_CONNECTIONS, rds_iw_ic_info); + + goto out; + +out_recv: + rds_iw_recv_exit(); +out_sysctl: + rds_iw_sysctl_exit(); +out_ibreg: + ib_unregister_client(&rds_iw_client); +out: + return ret; +} + +MODULE_LICENSE("GPL"); + diff --git a/net/rds/iw.h b/net/rds/iw.h new file mode 100644 index 0000000..0ddda34 --- /dev/null +++ b/net/rds/iw.h @@ -0,0 +1,395 @@ +#ifndef _RDS_IW_H +#define _RDS_IW_H + +#include +#include +#include "rds.h" +#include "rdma_transport.h" + +#define RDS_FASTREG_SIZE 20 +#define RDS_FASTREG_POOL_SIZE 2048 + +#define RDS_IW_MAX_SGE 8 +#define RDS_IW_RECV_SGE 2 + +#define RDS_IW_DEFAULT_RECV_WR 1024 +#define RDS_IW_DEFAULT_SEND_WR 256 + +#define RDS_IW_SUPPORTED_PROTOCOLS 0x00000003 /* minor versions supported */ + +extern struct list_head rds_iw_devices; + +/* + * IB posts RDS_FRAG_SIZE fragments of pages to the receive queues to + * try and minimize the amount of memory tied up both the device and + * socket receive queues. + */ +/* page offset of the final full frag that fits in the page */ +#define RDS_PAGE_LAST_OFF (((PAGE_SIZE / RDS_FRAG_SIZE) - 1) * RDS_FRAG_SIZE) +struct rds_page_frag { + struct list_head f_item; + struct page *f_page; + unsigned long f_offset; + dma_addr_t f_mapped; +}; + +struct rds_iw_incoming { + struct list_head ii_frags; + struct rds_incoming ii_inc; +}; + +struct rds_iw_connect_private { + /* Add new fields at the end, and don't permute existing fields. */ + __be32 dp_saddr; + __be32 dp_daddr; + u8 dp_protocol_major; + u8 dp_protocol_minor; + __be16 dp_protocol_minor_mask; /* bitmask */ + __be32 dp_reserved1; + __be64 dp_ack_seq; + __be32 dp_credit; /* non-zero enables flow ctl */ +}; + +struct rds_iw_scatterlist { + struct scatterlist *list; + unsigned int len; + int dma_len; + unsigned int dma_npages; + unsigned int bytes; +}; + +struct rds_iw_mapping { + spinlock_t m_lock; /* protect the mapping struct */ + struct list_head m_list; + struct rds_iw_mr *m_mr; + uint32_t m_rkey; + struct rds_iw_scatterlist m_sg; +}; + +struct rds_iw_send_work { + struct rds_message *s_rm; + + /* We should really put these into a union: */ + struct rds_rdma_op *s_op; + struct rds_iw_mapping *s_mapping; + struct ib_mr *s_mr; + struct ib_fast_reg_page_list *s_page_list; + unsigned char s_remap_count; + + struct ib_send_wr s_wr; + struct ib_sge s_sge[RDS_IW_MAX_SGE]; + unsigned long s_queued; +}; + +struct rds_iw_recv_work { + struct rds_iw_incoming *r_iwinc; + struct rds_page_frag *r_frag; + struct ib_recv_wr r_wr; + struct ib_sge r_sge[2]; +}; + +struct rds_iw_work_ring { + u32 w_nr; + u32 w_alloc_ptr; + u32 w_alloc_ctr; + u32 w_free_ptr; + atomic_t w_free_ctr; +}; + +struct rds_iw_device; + +struct rds_iw_connection { + + struct list_head iw_node; + struct rds_iw_device *rds_iwdev; + struct rds_connection *conn; + + /* alphabet soup, IBTA style */ + struct rdma_cm_id *i_cm_id; + struct ib_pd *i_pd; + struct ib_mr *i_mr; + struct ib_cq *i_send_cq; + struct ib_cq *i_recv_cq; + + /* tx */ + struct rds_iw_work_ring i_send_ring; + struct rds_message *i_rm; + struct rds_header *i_send_hdrs; + u64 i_send_hdrs_dma; + struct rds_iw_send_work *i_sends; + + /* rx */ + struct mutex i_recv_mutex; + struct rds_iw_work_ring i_recv_ring; + struct rds_iw_incoming *i_iwinc; + u32 i_recv_data_rem; + struct rds_header *i_recv_hdrs; + u64 i_recv_hdrs_dma; + struct rds_iw_recv_work *i_recvs; + struct rds_page_frag i_frag; + u64 i_ack_recv; /* last ACK received */ + + /* sending acks */ + unsigned long i_ack_flags; + u64 i_ack_next; /* next ACK to send */ + struct rds_header *i_ack; + struct ib_send_wr i_ack_wr; + struct ib_sge i_ack_sge; + u64 i_ack_dma; + unsigned long i_ack_queued; + + /* Flow control related information + * + * Our algorithm uses a pair variables that we need to access + * atomically - one for the send credits, and one posted + * recv credits we need to transfer to remote. + * Rather than protect them using a slow spinlock, we put both into + * a single atomic_t and update it using cmpxchg + */ + atomic_t i_credits; + + /* Protocol version specific information */ + unsigned int i_flowctl:1; /* enable/disable flow ctl */ + unsigned int i_dma_local_lkey:1; + unsigned int i_fastreg_posted:1; /* fastreg posted on this connection */ + /* Batched completions */ + unsigned int i_unsignaled_wrs; + long i_unsignaled_bytes; +}; + +/* This assumes that atomic_t is at least 32 bits */ +#define IB_GET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_GET_POST_CREDITS(v) ((v) >> 16) +#define IB_SET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_SET_POST_CREDITS(v) ((v) << 16) + +struct rds_iw_cm_id { + struct list_head list; + struct rdma_cm_id *cm_id; +}; + +struct rds_iw_device { + struct list_head list; + struct list_head cm_id_list; + struct list_head conn_list; + struct ib_device *dev; + struct ib_pd *pd; + struct ib_mr *mr; + struct rds_iw_mr_pool *mr_pool; + int page_shift; + int max_sge; + unsigned int max_wrs; + unsigned int dma_local_lkey:1; + spinlock_t spinlock; /* protect the above */ +}; + +/* bits for i_ack_flags */ +#define IB_ACK_IN_FLIGHT 0 +#define IB_ACK_REQUESTED 1 + +/* Magic WR_ID for ACKs */ +#define RDS_IW_ACK_WR_ID ((u64)0xffffffffffffffffULL) +#define RDS_IW_FAST_REG_WR_ID ((u64)0xefefefefefefefefULL) +#define RDS_IW_LOCAL_INV_WR_ID ((u64)0xdfdfdfdfdfdfdfdfULL) + +struct rds_iw_statistics { + uint64_t s_iw_connect_raced; + uint64_t s_iw_listen_closed_stale; + uint64_t s_iw_tx_cq_call; + uint64_t s_iw_tx_cq_event; + uint64_t s_iw_tx_ring_full; + uint64_t s_iw_tx_throttle; + uint64_t s_iw_tx_sg_mapping_failure; + uint64_t s_iw_tx_stalled; + uint64_t s_iw_tx_credit_updates; + uint64_t s_iw_rx_cq_call; + uint64_t s_iw_rx_cq_event; + uint64_t s_iw_rx_ring_empty; + uint64_t s_iw_rx_refill_from_cq; + uint64_t s_iw_rx_refill_from_thread; + uint64_t s_iw_rx_alloc_limit; + uint64_t s_iw_rx_credit_updates; + uint64_t s_iw_ack_sent; + uint64_t s_iw_ack_send_failure; + uint64_t s_iw_ack_send_delayed; + uint64_t s_iw_ack_send_piggybacked; + uint64_t s_iw_ack_received; + uint64_t s_iw_rdma_mr_alloc; + uint64_t s_iw_rdma_mr_free; + uint64_t s_iw_rdma_mr_used; + uint64_t s_iw_rdma_mr_pool_flush; + uint64_t s_iw_rdma_mr_pool_wait; + uint64_t s_iw_rdma_mr_pool_depleted; +}; + +extern struct workqueue_struct *rds_iw_wq; + +/* + * Fake ib_dma_sync_sg_for_{cpu,device} as long as ib_verbs.h + * doesn't define it. + */ +static inline void rds_iw_dma_sync_sg_for_cpu(struct ib_device *dev, + struct scatterlist *sg, unsigned int sg_dma_len, int direction) +{ + unsigned int i; + + for (i = 0; i < sg_dma_len; ++i) { + ib_dma_sync_single_for_cpu(dev, + ib_sg_dma_address(dev, &sg[i]), + ib_sg_dma_len(dev, &sg[i]), + direction); + } +} +#define ib_dma_sync_sg_for_cpu rds_iw_dma_sync_sg_for_cpu + +static inline void rds_iw_dma_sync_sg_for_device(struct ib_device *dev, + struct scatterlist *sg, unsigned int sg_dma_len, int direction) +{ + unsigned int i; + + for (i = 0; i < sg_dma_len; ++i) { + ib_dma_sync_single_for_device(dev, + ib_sg_dma_address(dev, &sg[i]), + ib_sg_dma_len(dev, &sg[i]), + direction); + } +} +#define ib_dma_sync_sg_for_device rds_iw_dma_sync_sg_for_device + +static inline u32 rds_iw_local_dma_lkey(struct rds_iw_connection *ic) +{ + return ic->i_dma_local_lkey ? ic->i_cm_id->device->local_dma_lkey : ic->i_mr->lkey; +} + +/* ib.c */ +extern struct rds_transport rds_iw_transport; +extern void rds_iw_add_one(struct ib_device *device); +extern void rds_iw_remove_one(struct ib_device *device); +extern struct ib_client rds_iw_client; + +extern unsigned int fastreg_pool_size; +extern unsigned int fastreg_message_size; + +extern spinlock_t iw_nodev_conns_lock; +extern struct list_head iw_nodev_conns; + +/* ib_cm.c */ +int rds_iw_conn_alloc(struct rds_connection *conn, gfp_t gfp); +void rds_iw_conn_free(void *arg); +int rds_iw_conn_connect(struct rds_connection *conn); +void rds_iw_conn_shutdown(struct rds_connection *conn); +void rds_iw_state_change(struct sock *sk); +int __init rds_iw_listen_init(void); +void rds_iw_listen_stop(void); +void __rds_iw_conn_error(struct rds_connection *conn, const char *, ...); +int rds_iw_cm_handle_connect(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event); +int rds_iw_cm_initiate_connect(struct rdma_cm_id *cm_id); +void rds_iw_cm_connect_complete(struct rds_connection *conn, + struct rdma_cm_event *event); + + +#define rds_iw_conn_error(conn, fmt...) \ + __rds_iw_conn_error(conn, KERN_WARNING "RDS/IW: " fmt) + +/* ib_rdma.c */ +int rds_iw_update_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id); +int rds_iw_add_conn(struct rds_iw_device *rds_iwdev, struct rds_connection *conn); +void rds_iw_remove_nodev_conns(void); +void rds_iw_remove_conns(struct rds_iw_device *rds_iwdev); +struct rds_iw_mr_pool *rds_iw_create_mr_pool(struct rds_iw_device *); +void rds_iw_get_mr_info(struct rds_iw_device *rds_iwdev, struct rds_info_rdma_connection *iinfo); +void rds_iw_destroy_mr_pool(struct rds_iw_mr_pool *); +void *rds_iw_get_mr(struct scatterlist *sg, unsigned long nents, + struct rds_sock *rs, u32 *key_ret); +void rds_iw_sync_mr(void *trans_private, int dir); +void rds_iw_free_mr(void *trans_private, int invalidate); +void rds_iw_flush_mrs(void); +void rds_iw_remove_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id); + +/* ib_recv.c */ +int __init rds_iw_recv_init(void); +void rds_iw_recv_exit(void); +int rds_iw_recv(struct rds_connection *conn); +int rds_iw_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, + gfp_t page_gfp, int prefill); +void rds_iw_inc_purge(struct rds_incoming *inc); +void rds_iw_inc_free(struct rds_incoming *inc); +int rds_iw_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov, + size_t size); +void rds_iw_recv_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_iw_recv_init_ring(struct rds_iw_connection *ic); +void rds_iw_recv_clear_ring(struct rds_iw_connection *ic); +void rds_iw_recv_init_ack(struct rds_iw_connection *ic); +void rds_iw_attempt_ack(struct rds_iw_connection *ic); +void rds_iw_ack_send_complete(struct rds_iw_connection *ic); +u64 rds_iw_piggyb_ack(struct rds_iw_connection *ic); + +/* ib_ring.c */ +void rds_iw_ring_init(struct rds_iw_work_ring *ring, u32 nr); +void rds_iw_ring_resize(struct rds_iw_work_ring *ring, u32 nr); +u32 rds_iw_ring_alloc(struct rds_iw_work_ring *ring, u32 val, u32 *pos); +void rds_iw_ring_free(struct rds_iw_work_ring *ring, u32 val); +void rds_iw_ring_unalloc(struct rds_iw_work_ring *ring, u32 val); +int rds_iw_ring_empty(struct rds_iw_work_ring *ring); +int rds_iw_ring_low(struct rds_iw_work_ring *ring); +u32 rds_iw_ring_oldest(struct rds_iw_work_ring *ring); +u32 rds_iw_ring_completed(struct rds_iw_work_ring *ring, u32 wr_id, u32 oldest); +extern wait_queue_head_t rds_iw_ring_empty_wait; + +/* ib_send.c */ +void rds_iw_xmit_complete(struct rds_connection *conn); +int rds_iw_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off); +void rds_iw_send_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_iw_send_init_ring(struct rds_iw_connection *ic); +void rds_iw_send_clear_ring(struct rds_iw_connection *ic); +int rds_iw_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op); +void rds_iw_send_add_credits(struct rds_connection *conn, unsigned int credits); +void rds_iw_advertise_credits(struct rds_connection *conn, unsigned int posted); +int rds_iw_send_grab_credits(struct rds_iw_connection *ic, u32 wanted, + u32 *adv_credits, int need_posted); + +/* ib_stats.c */ +DECLARE_PER_CPU(struct rds_iw_statistics, rds_iw_stats); +#define rds_iw_stats_inc(member) rds_stats_inc_which(rds_iw_stats, member) +unsigned int rds_iw_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail); + +/* ib_sysctl.c */ +int __init rds_iw_sysctl_init(void); +void rds_iw_sysctl_exit(void); +extern unsigned long rds_iw_sysctl_max_send_wr; +extern unsigned long rds_iw_sysctl_max_recv_wr; +extern unsigned long rds_iw_sysctl_max_unsig_wrs; +extern unsigned long rds_iw_sysctl_max_unsig_bytes; +extern unsigned long rds_iw_sysctl_max_recv_allocation; +extern unsigned int rds_iw_sysctl_flow_control; +extern ctl_table rds_iw_sysctl_table[]; + +/* + * Helper functions for getting/setting the header and data SGEs in + * RDS packets (not RDMA) + */ +static inline struct ib_sge * +rds_iw_header_sge(struct rds_iw_connection *ic, struct ib_sge *sge) +{ + return &sge[0]; +} + +static inline struct ib_sge * +rds_iw_data_sge(struct rds_iw_connection *ic, struct ib_sge *sge) +{ + return &sge[1]; +} + +static inline void rds_iw_set_64bit(u64 *ptr, u64 val) +{ +#if BITS_PER_LONG == 64 + *ptr = val; +#else + set_64bit(ptr, val); +#endif +} + +#endif diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c new file mode 100644 index 0000000..57ecb3d --- /dev/null +++ b/net/rds/iw_cm.c @@ -0,0 +1,750 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "iw.h" + +/* + * Set the selected protocol version + */ +static void rds_iw_set_protocol(struct rds_connection *conn, unsigned int version) +{ + conn->c_version = version; +} + +/* + * Set up flow control + */ +static void rds_iw_set_flow_control(struct rds_connection *conn, u32 credits) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + + if (rds_iw_sysctl_flow_control && credits != 0) { + /* We're doing flow control */ + ic->i_flowctl = 1; + rds_iw_send_add_credits(conn, credits); + } else { + ic->i_flowctl = 0; + } +} + +/* + * Connection established. + * We get here for both outgoing and incoming connection. + */ +void rds_iw_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_event *event) +{ + const struct rds_iw_connect_private *dp = NULL; + struct rds_iw_connection *ic = conn->c_transport_data; + struct rds_iw_device *rds_iwdev; + int err; + + if (event->param.conn.private_data_len) { + dp = event->param.conn.private_data; + + rds_iw_set_protocol(conn, + RDS_PROTOCOL(dp->dp_protocol_major, + dp->dp_protocol_minor)); + rds_iw_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); + } + + /* update ib_device with this local ipaddr & conn */ + rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client); + err = rds_iw_update_cm_id(rds_iwdev, ic->i_cm_id); + if (err) + printk(KERN_ERR "rds_iw_update_ipaddr failed (%d)\n", err); + err = rds_iw_add_conn(rds_iwdev, conn); + if (err) + printk(KERN_ERR "rds_iw_add_conn failed (%d)\n", err); + + /* If the peer gave us the last packet it saw, process this as if + * we had received a regular ACK. */ + if (dp && dp->dp_ack_seq) + rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL); + + printk(KERN_NOTICE "RDS/IW: connected to %pI4<->%pI4 version %u.%u%s\n", + &conn->c_laddr, &conn->c_faddr, + RDS_PROTOCOL_MAJOR(conn->c_version), + RDS_PROTOCOL_MINOR(conn->c_version), + ic->i_flowctl ? ", flow control" : ""); + + rds_connect_complete(conn); +} + +static void rds_iw_cm_fill_conn_param(struct rds_connection *conn, + struct rdma_conn_param *conn_param, + struct rds_iw_connect_private *dp, + u32 protocol_version) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + + memset(conn_param, 0, sizeof(struct rdma_conn_param)); + /* XXX tune these? */ + conn_param->responder_resources = 1; + conn_param->initiator_depth = 1; + + if (dp) { + memset(dp, 0, sizeof(*dp)); + dp->dp_saddr = conn->c_laddr; + dp->dp_daddr = conn->c_faddr; + dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version); + dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version); + dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IW_SUPPORTED_PROTOCOLS); + dp->dp_ack_seq = rds_iw_piggyb_ack(ic); + + /* Advertise flow control */ + if (ic->i_flowctl) { + unsigned int credits; + + credits = IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)); + dp->dp_credit = cpu_to_be32(credits); + atomic_sub(IB_SET_POST_CREDITS(credits), &ic->i_credits); + } + + conn_param->private_data = dp; + conn_param->private_data_len = sizeof(*dp); + } +} + +static void rds_iw_cq_event_handler(struct ib_event *event, void *data) +{ + rdsdebug("event %u data %p\n", event->event, data); +} + +static void rds_iw_qp_event_handler(struct ib_event *event, void *data) +{ + struct rds_connection *conn = data; + struct rds_iw_connection *ic = conn->c_transport_data; + + rdsdebug("conn %p ic %p event %u\n", conn, ic, event->event); + + switch (event->event) { + case IB_EVENT_COMM_EST: + rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST); + break; + case IB_EVENT_QP_REQ_ERR: + case IB_EVENT_QP_FATAL: + default: + rds_iw_conn_error(conn, "RDS/IW: Fatal QP Event %u - connection %pI4->%pI4...reconnecting\n", + event->event, &conn->c_laddr, + &conn->c_faddr); + break; + } +} + +/* + * Create a QP + */ +static int rds_iw_init_qp_attrs(struct ib_qp_init_attr *attr, + struct rds_iw_device *rds_iwdev, + struct rds_iw_work_ring *send_ring, + void (*send_cq_handler)(struct ib_cq *, void *), + struct rds_iw_work_ring *recv_ring, + void (*recv_cq_handler)(struct ib_cq *, void *), + void *context) +{ + struct ib_device *dev = rds_iwdev->dev; + unsigned int send_size, recv_size; + int ret; + + /* The offset of 1 is to accomodate the additional ACK WR. */ + send_size = min_t(unsigned int, rds_iwdev->max_wrs, rds_iw_sysctl_max_send_wr + 1); + recv_size = min_t(unsigned int, rds_iwdev->max_wrs, rds_iw_sysctl_max_recv_wr + 1); + rds_iw_ring_resize(send_ring, send_size - 1); + rds_iw_ring_resize(recv_ring, recv_size - 1); + + memset(attr, 0, sizeof(*attr)); + attr->event_handler = rds_iw_qp_event_handler; + attr->qp_context = context; + attr->cap.max_send_wr = send_size; + attr->cap.max_recv_wr = recv_size; + attr->cap.max_send_sge = rds_iwdev->max_sge; + attr->cap.max_recv_sge = RDS_IW_RECV_SGE; + attr->sq_sig_type = IB_SIGNAL_REQ_WR; + attr->qp_type = IB_QPT_RC; + + attr->send_cq = ib_create_cq(dev, send_cq_handler, + rds_iw_cq_event_handler, + context, send_size, 0); + if (IS_ERR(attr->send_cq)) { + ret = PTR_ERR(attr->send_cq); + attr->send_cq = NULL; + rdsdebug("ib_create_cq send failed: %d\n", ret); + goto out; + } + + attr->recv_cq = ib_create_cq(dev, recv_cq_handler, + rds_iw_cq_event_handler, + context, recv_size, 0); + if (IS_ERR(attr->recv_cq)) { + ret = PTR_ERR(attr->recv_cq); + attr->recv_cq = NULL; + rdsdebug("ib_create_cq send failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(attr->send_cq, IB_CQ_NEXT_COMP); + if (ret) { + rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(attr->recv_cq, IB_CQ_SOLICITED); + if (ret) { + rdsdebug("ib_req_notify_cq recv failed: %d\n", ret); + goto out; + } + +out: + if (ret) { + if (attr->send_cq) + ib_destroy_cq(attr->send_cq); + if (attr->recv_cq) + ib_destroy_cq(attr->recv_cq); + } + return ret; +} + +/* + * This needs to be very careful to not leave IS_ERR pointers around for + * cleanup to trip over. + */ +static int rds_iw_setup_qp(struct rds_connection *conn) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + struct ib_device *dev = ic->i_cm_id->device; + struct ib_qp_init_attr attr; + struct rds_iw_device *rds_iwdev; + int ret; + + /* rds_iw_add_one creates a rds_iw_device object per IB device, + * and allocates a protection domain, memory range and MR pool + * for each. If that fails for any reason, it will not register + * the rds_iwdev at all. + */ + rds_iwdev = ib_get_client_data(dev, &rds_iw_client); + if (rds_iwdev == NULL) { + if (printk_ratelimit()) + printk(KERN_NOTICE "RDS/IW: No client_data for device %s\n", + dev->name); + return -EOPNOTSUPP; + } + + /* Protection domain and memory range */ + ic->i_pd = rds_iwdev->pd; + ic->i_mr = rds_iwdev->mr; + + ret = rds_iw_init_qp_attrs(&attr, rds_iwdev, + &ic->i_send_ring, rds_iw_send_cq_comp_handler, + &ic->i_recv_ring, rds_iw_recv_cq_comp_handler, + conn); + if (ret < 0) + goto out; + + ic->i_send_cq = attr.send_cq; + ic->i_recv_cq = attr.recv_cq; + + /* + * XXX this can fail if max_*_wr is too large? Are we supposed + * to back off until we get a value that the hardware can support? + */ + ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr); + if (ret) { + rdsdebug("rdma_create_qp failed: %d\n", ret); + goto out; + } + + ic->i_send_hdrs = ib_dma_alloc_coherent(dev, + ic->i_send_ring.w_nr * + sizeof(struct rds_header), + &ic->i_send_hdrs_dma, GFP_KERNEL); + if (ic->i_send_hdrs == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent send failed\n"); + goto out; + } + + ic->i_recv_hdrs = ib_dma_alloc_coherent(dev, + ic->i_recv_ring.w_nr * + sizeof(struct rds_header), + &ic->i_recv_hdrs_dma, GFP_KERNEL); + if (ic->i_recv_hdrs == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent recv failed\n"); + goto out; + } + + ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header), + &ic->i_ack_dma, GFP_KERNEL); + if (ic->i_ack == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent ack failed\n"); + goto out; + } + + ic->i_sends = vmalloc(ic->i_send_ring.w_nr * sizeof(struct rds_iw_send_work)); + if (ic->i_sends == NULL) { + ret = -ENOMEM; + rdsdebug("send allocation failed\n"); + goto out; + } + rds_iw_send_init_ring(ic); + + ic->i_recvs = vmalloc(ic->i_recv_ring.w_nr * sizeof(struct rds_iw_recv_work)); + if (ic->i_recvs == NULL) { + ret = -ENOMEM; + rdsdebug("recv allocation failed\n"); + goto out; + } + + rds_iw_recv_init_ring(ic); + rds_iw_recv_init_ack(ic); + + /* Post receive buffers - as a side effect, this will update + * the posted credit count. */ + rds_iw_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1); + + rdsdebug("conn %p pd %p mr %p cq %p %p\n", conn, ic->i_pd, ic->i_mr, + ic->i_send_cq, ic->i_recv_cq); + +out: + return ret; +} + +static u32 rds_iw_protocol_compatible(const struct rds_iw_connect_private *dp) +{ + u16 common; + u32 version = 0; + + /* rdma_cm private data is odd - when there is any private data in the + * request, we will be given a pretty large buffer without telling us the + * original size. The only way to tell the difference is by looking at + * the contents, which are initialized to zero. + * If the protocol version fields aren't set, this is a connection attempt + * from an older version. This could could be 3.0 or 2.0 - we can't tell. + * We really should have changed this for OFED 1.3 :-( */ + if (dp->dp_protocol_major == 0) + return RDS_PROTOCOL_3_0; + + common = be16_to_cpu(dp->dp_protocol_minor_mask) & RDS_IW_SUPPORTED_PROTOCOLS; + if (dp->dp_protocol_major == 3 && common) { + version = RDS_PROTOCOL_3_0; + while ((common >>= 1) != 0) + version++; + } else if (printk_ratelimit()) { + printk(KERN_NOTICE "RDS: Connection from %pI4 using " + "incompatible protocol version %u.%u\n", + &dp->dp_saddr, + dp->dp_protocol_major, + dp->dp_protocol_minor); + } + return version; +} + +int rds_iw_cm_handle_connect(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + const struct rds_iw_connect_private *dp = event->param.conn.private_data; + struct rds_iw_connect_private dp_rep; + struct rds_connection *conn = NULL; + struct rds_iw_connection *ic = NULL; + struct rdma_conn_param conn_param; + struct rds_iw_device *rds_iwdev; + u32 version; + int err, destroy = 1; + + /* Check whether the remote protocol version matches ours. */ + version = rds_iw_protocol_compatible(dp); + if (!version) + goto out; + + rdsdebug("saddr %pI4 daddr %pI4 RDSv%u.%u\n", + &dp->dp_saddr, &dp->dp_daddr, + RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version)); + + conn = rds_conn_create(dp->dp_daddr, dp->dp_saddr, &rds_iw_transport, + GFP_KERNEL); + if (IS_ERR(conn)) { + rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn)); + conn = NULL; + goto out; + } + + /* + * The connection request may occur while the + * previous connection exist, e.g. in case of failover. + * But as connections may be initiated simultaneously + * by both hosts, we have a random backoff mechanism - + * see the comment above rds_queue_reconnect() + */ + mutex_lock(&conn->c_cm_lock); + if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) { + if (rds_conn_state(conn) == RDS_CONN_UP) { + rdsdebug("incoming connect while connecting\n"); + rds_conn_drop(conn); + rds_iw_stats_inc(s_iw_listen_closed_stale); + } else + if (rds_conn_state(conn) == RDS_CONN_CONNECTING) { + /* Wait and see - our connect may still be succeeding */ + rds_iw_stats_inc(s_iw_connect_raced); + } + mutex_unlock(&conn->c_cm_lock); + goto out; + } + + ic = conn->c_transport_data; + + rds_iw_set_protocol(conn, version); + rds_iw_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); + + /* If the peer gave us the last packet it saw, process this as if + * we had received a regular ACK. */ + if (dp->dp_ack_seq) + rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL); + + BUG_ON(cm_id->context); + BUG_ON(ic->i_cm_id); + + ic->i_cm_id = cm_id; + cm_id->context = conn; + + rds_iwdev = ib_get_client_data(cm_id->device, &rds_iw_client); + ic->i_dma_local_lkey = rds_iwdev->dma_local_lkey; + + /* We got halfway through setting up the ib_connection, if we + * fail now, we have to take the long route out of this mess. */ + destroy = 0; + + err = rds_iw_setup_qp(conn); + if (err) { + rds_iw_conn_error(conn, "rds_iw_setup_qp failed (%d)\n", err); + goto out; + } + + rds_iw_cm_fill_conn_param(conn, &conn_param, &dp_rep, version); + + /* rdma_accept() calls rdma_reject() internally if it fails */ + err = rdma_accept(cm_id, &conn_param); + mutex_unlock(&conn->c_cm_lock); + if (err) { + rds_iw_conn_error(conn, "rdma_accept failed (%d)\n", err); + goto out; + } + + return 0; + +out: + rdma_reject(cm_id, NULL, 0); + return destroy; +} + + +int rds_iw_cm_initiate_connect(struct rdma_cm_id *cm_id) +{ + struct rds_connection *conn = cm_id->context; + struct rds_iw_connection *ic = conn->c_transport_data; + struct rdma_conn_param conn_param; + struct rds_iw_connect_private dp; + int ret; + + /* If the peer doesn't do protocol negotiation, we must + * default to RDSv3.0 */ + rds_iw_set_protocol(conn, RDS_PROTOCOL_3_0); + ic->i_flowctl = rds_iw_sysctl_flow_control; /* advertise flow control */ + + ret = rds_iw_setup_qp(conn); + if (ret) { + rds_iw_conn_error(conn, "rds_iw_setup_qp failed (%d)\n", ret); + goto out; + } + + rds_iw_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION); + + ret = rdma_connect(cm_id, &conn_param); + if (ret) + rds_iw_conn_error(conn, "rdma_connect failed (%d)\n", ret); + +out: + /* Beware - returning non-zero tells the rdma_cm to destroy + * the cm_id. We should certainly not do it as long as we still + * "own" the cm_id. */ + if (ret) { + struct rds_iw_connection *ic = conn->c_transport_data; + + if (ic->i_cm_id == cm_id) + ret = 0; + } + return ret; +} + +int rds_iw_conn_connect(struct rds_connection *conn) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + struct rds_iw_device *rds_iwdev; + struct sockaddr_in src, dest; + int ret; + + /* XXX I wonder what affect the port space has */ + /* delegate cm event handler to rdma_transport */ + ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn, + RDMA_PS_TCP); + if (IS_ERR(ic->i_cm_id)) { + ret = PTR_ERR(ic->i_cm_id); + ic->i_cm_id = NULL; + rdsdebug("rdma_create_id() failed: %d\n", ret); + goto out; + } + + rdsdebug("created cm id %p for conn %p\n", ic->i_cm_id, conn); + + src.sin_family = AF_INET; + src.sin_addr.s_addr = (__force u32)conn->c_laddr; + src.sin_port = (__force u16)htons(0); + + /* First, bind to the local address and device. */ + ret = rdma_bind_addr(ic->i_cm_id, (struct sockaddr *) &src); + if (ret) { + rdsdebug("rdma_bind_addr(%pI4) failed: %d\n", + &conn->c_laddr, ret); + rdma_destroy_id(ic->i_cm_id); + ic->i_cm_id = NULL; + goto out; + } + + rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client); + ic->i_dma_local_lkey = rds_iwdev->dma_local_lkey; + + dest.sin_family = AF_INET; + dest.sin_addr.s_addr = (__force u32)conn->c_faddr; + dest.sin_port = (__force u16)htons(RDS_PORT); + + ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src, + (struct sockaddr *)&dest, + RDS_RDMA_RESOLVE_TIMEOUT_MS); + if (ret) { + rdsdebug("addr resolve failed for cm id %p: %d\n", ic->i_cm_id, + ret); + rdma_destroy_id(ic->i_cm_id); + ic->i_cm_id = NULL; + } + +out: + return ret; +} + +/* + * This is so careful about only cleaning up resources that were built up + * so that it can be called at any point during startup. In fact it + * can be called multiple times for a given connection. + */ +void rds_iw_conn_shutdown(struct rds_connection *conn) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + int err = 0; + struct ib_qp_attr qp_attr; + + rdsdebug("cm %p pd %p cq %p %p qp %p\n", ic->i_cm_id, + ic->i_pd, ic->i_send_cq, ic->i_recv_cq, + ic->i_cm_id ? ic->i_cm_id->qp : NULL); + + if (ic->i_cm_id) { + struct ib_device *dev = ic->i_cm_id->device; + + rdsdebug("disconnecting cm %p\n", ic->i_cm_id); + err = rdma_disconnect(ic->i_cm_id); + if (err) { + /* Actually this may happen quite frequently, when + * an outgoing connect raced with an incoming connect. + */ + rdsdebug("rds_iw_conn_shutdown: failed to disconnect," + " cm: %p err %d\n", ic->i_cm_id, err); + } + + if (ic->i_cm_id->qp) { + qp_attr.qp_state = IB_QPS_ERR; + ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE); + } + + wait_event(rds_iw_ring_empty_wait, + rds_iw_ring_empty(&ic->i_send_ring) && + rds_iw_ring_empty(&ic->i_recv_ring)); + + if (ic->i_send_hdrs) + ib_dma_free_coherent(dev, + ic->i_send_ring.w_nr * + sizeof(struct rds_header), + ic->i_send_hdrs, + ic->i_send_hdrs_dma); + + if (ic->i_recv_hdrs) + ib_dma_free_coherent(dev, + ic->i_recv_ring.w_nr * + sizeof(struct rds_header), + ic->i_recv_hdrs, + ic->i_recv_hdrs_dma); + + if (ic->i_ack) + ib_dma_free_coherent(dev, sizeof(struct rds_header), + ic->i_ack, ic->i_ack_dma); + + if (ic->i_sends) + rds_iw_send_clear_ring(ic); + if (ic->i_recvs) + rds_iw_recv_clear_ring(ic); + + if (ic->i_cm_id->qp) + rdma_destroy_qp(ic->i_cm_id); + if (ic->i_send_cq) + ib_destroy_cq(ic->i_send_cq); + if (ic->i_recv_cq) + ib_destroy_cq(ic->i_recv_cq); + + /* + * If associated with an rds_iw_device: + * Move connection back to the nodev list. + * Remove cm_id from the device cm_id list. + */ + if (ic->rds_iwdev) { + + spin_lock_irq(&ic->rds_iwdev->spinlock); + BUG_ON(list_empty(&ic->iw_node)); + list_del(&ic->iw_node); + spin_unlock_irq(&ic->rds_iwdev->spinlock); + + spin_lock_irq(&iw_nodev_conns_lock); + list_add_tail(&ic->iw_node, &iw_nodev_conns); + spin_unlock_irq(&iw_nodev_conns_lock); + rds_iw_remove_cm_id(ic->rds_iwdev, ic->i_cm_id); + ic->rds_iwdev = NULL; + } + + rdma_destroy_id(ic->i_cm_id); + + ic->i_cm_id = NULL; + ic->i_pd = NULL; + ic->i_mr = NULL; + ic->i_send_cq = NULL; + ic->i_recv_cq = NULL; + ic->i_send_hdrs = NULL; + ic->i_recv_hdrs = NULL; + ic->i_ack = NULL; + } + BUG_ON(ic->rds_iwdev); + + /* Clear pending transmit */ + if (ic->i_rm) { + rds_message_put(ic->i_rm); + ic->i_rm = NULL; + } + + /* Clear the ACK state */ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + rds_iw_set_64bit(&ic->i_ack_next, 0); + ic->i_ack_recv = 0; + + /* Clear flow control state */ + ic->i_flowctl = 0; + atomic_set(&ic->i_credits, 0); + + rds_iw_ring_init(&ic->i_send_ring, rds_iw_sysctl_max_send_wr); + rds_iw_ring_init(&ic->i_recv_ring, rds_iw_sysctl_max_recv_wr); + + if (ic->i_iwinc) { + rds_inc_put(&ic->i_iwinc->ii_inc); + ic->i_iwinc = NULL; + } + + vfree(ic->i_sends); + ic->i_sends = NULL; + vfree(ic->i_recvs); + ic->i_recvs = NULL; + rdsdebug("shutdown complete\n"); +} + +int rds_iw_conn_alloc(struct rds_connection *conn, gfp_t gfp) +{ + struct rds_iw_connection *ic; + unsigned long flags; + + /* XXX too lazy? */ + ic = kzalloc(sizeof(struct rds_iw_connection), GFP_KERNEL); + if (ic == NULL) + return -ENOMEM; + + INIT_LIST_HEAD(&ic->iw_node); + mutex_init(&ic->i_recv_mutex); + + /* + * rds_iw_conn_shutdown() waits for these to be emptied so they + * must be initialized before it can be called. + */ + rds_iw_ring_init(&ic->i_send_ring, rds_iw_sysctl_max_send_wr); + rds_iw_ring_init(&ic->i_recv_ring, rds_iw_sysctl_max_recv_wr); + + ic->conn = conn; + conn->c_transport_data = ic; + + spin_lock_irqsave(&iw_nodev_conns_lock, flags); + list_add_tail(&ic->iw_node, &iw_nodev_conns); + spin_unlock_irqrestore(&iw_nodev_conns_lock, flags); + + + rdsdebug("conn %p conn ic %p\n", conn, conn->c_transport_data); + return 0; +} + +void rds_iw_conn_free(void *arg) +{ + struct rds_iw_connection *ic = arg; + rdsdebug("ic %p\n", ic); + list_del(&ic->iw_node); + kfree(ic); +} + +/* + * An error occurred on the connection + */ +void +__rds_iw_conn_error(struct rds_connection *conn, const char *fmt, ...) +{ + va_list ap; + + rds_conn_drop(conn); + + va_start(ap, fmt); + vprintk(fmt, ap); + va_end(ap); +} diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c new file mode 100644 index 0000000..1c02a8f --- /dev/null +++ b/net/rds/iw_rdma.c @@ -0,0 +1,888 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "rdma.h" +#include "iw.h" + + +/* + * This is stored as mr->r_trans_private. + */ +struct rds_iw_mr { + struct rds_iw_device *device; + struct rds_iw_mr_pool *pool; + struct rdma_cm_id *cm_id; + + struct ib_mr *mr; + struct ib_fast_reg_page_list *page_list; + + struct rds_iw_mapping mapping; + unsigned char remap_count; +}; + +/* + * Our own little MR pool + */ +struct rds_iw_mr_pool { + struct rds_iw_device *device; /* back ptr to the device that owns us */ + + struct mutex flush_lock; /* serialize fmr invalidate */ + struct work_struct flush_worker; /* flush worker */ + + spinlock_t list_lock; /* protect variables below */ + atomic_t item_count; /* total # of MRs */ + atomic_t dirty_count; /* # dirty of MRs */ + struct list_head dirty_list; /* dirty mappings */ + struct list_head clean_list; /* unused & unamapped MRs */ + atomic_t free_pinned; /* memory pinned by free MRs */ + unsigned long max_message_size; /* in pages */ + unsigned long max_items; + unsigned long max_items_soft; + unsigned long max_free_pinned; + int max_pages; +}; + +static int rds_iw_flush_mr_pool(struct rds_iw_mr_pool *pool, int free_all); +static void rds_iw_mr_pool_flush_worker(struct work_struct *work); +static int rds_iw_init_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr *ibmr); +static int rds_iw_map_fastreg(struct rds_iw_mr_pool *pool, + struct rds_iw_mr *ibmr, + struct scatterlist *sg, unsigned int nents); +static void rds_iw_free_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr *ibmr); +static unsigned int rds_iw_unmap_fastreg_list(struct rds_iw_mr_pool *pool, + struct list_head *unmap_list, + struct list_head *kill_list); +static void rds_iw_destroy_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr *ibmr); + +static int rds_iw_get_device(struct rds_sock *rs, struct rds_iw_device **rds_iwdev, struct rdma_cm_id **cm_id) +{ + struct rds_iw_device *iwdev; + struct rds_iw_cm_id *i_cm_id; + + *rds_iwdev = NULL; + *cm_id = NULL; + + list_for_each_entry(iwdev, &rds_iw_devices, list) { + spin_lock_irq(&iwdev->spinlock); + list_for_each_entry(i_cm_id, &iwdev->cm_id_list, list) { + struct sockaddr_in *src_addr, *dst_addr; + + src_addr = (struct sockaddr_in *)&i_cm_id->cm_id->route.addr.src_addr; + dst_addr = (struct sockaddr_in *)&i_cm_id->cm_id->route.addr.dst_addr; + + rdsdebug("local ipaddr = %x port %d, " + "remote ipaddr = %x port %d" + "..looking for %x port %d, " + "remote ipaddr = %x port %d\n", + src_addr->sin_addr.s_addr, + src_addr->sin_port, + dst_addr->sin_addr.s_addr, + dst_addr->sin_port, + rs->rs_bound_addr, + rs->rs_bound_port, + rs->rs_conn_addr, + rs->rs_conn_port); +#ifdef WORKING_TUPLE_DETECTION + if (src_addr->sin_addr.s_addr == rs->rs_bound_addr && + src_addr->sin_port == rs->rs_bound_port && + dst_addr->sin_addr.s_addr == rs->rs_conn_addr && + dst_addr->sin_port == rs->rs_conn_port) { +#else + /* FIXME - needs to compare the local and remote + * ipaddr/port tuple, but the ipaddr is the only + * available infomation in the rds_sock (as the rest are + * zero'ed. It doesn't appear to be properly populated + * during connection setup... + */ + if (src_addr->sin_addr.s_addr == rs->rs_bound_addr) { +#endif + spin_unlock_irq(&iwdev->spinlock); + *rds_iwdev = iwdev; + *cm_id = i_cm_id->cm_id; + return 0; + } + } + spin_unlock_irq(&iwdev->spinlock); + } + + return 1; +} + +static int rds_iw_add_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id) +{ + struct rds_iw_cm_id *i_cm_id; + + i_cm_id = kmalloc(sizeof *i_cm_id, GFP_KERNEL); + if (!i_cm_id) + return -ENOMEM; + + i_cm_id->cm_id = cm_id; + + spin_lock_irq(&rds_iwdev->spinlock); + list_add_tail(&i_cm_id->list, &rds_iwdev->cm_id_list); + spin_unlock_irq(&rds_iwdev->spinlock); + + return 0; +} + +void rds_iw_remove_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id) +{ + struct rds_iw_cm_id *i_cm_id; + + spin_lock_irq(&rds_iwdev->spinlock); + list_for_each_entry(i_cm_id, &rds_iwdev->cm_id_list, list) { + if (i_cm_id->cm_id == cm_id) { + list_del(&i_cm_id->list); + kfree(i_cm_id); + break; + } + } + spin_unlock_irq(&rds_iwdev->spinlock); +} + + +int rds_iw_update_cm_id(struct rds_iw_device *rds_iwdev, struct rdma_cm_id *cm_id) +{ + struct sockaddr_in *src_addr, *dst_addr; + struct rds_iw_device *rds_iwdev_old; + struct rds_sock rs; + struct rdma_cm_id *pcm_id; + int rc; + + src_addr = (struct sockaddr_in *)&cm_id->route.addr.src_addr; + dst_addr = (struct sockaddr_in *)&cm_id->route.addr.dst_addr; + + rs.rs_bound_addr = src_addr->sin_addr.s_addr; + rs.rs_bound_port = src_addr->sin_port; + rs.rs_conn_addr = dst_addr->sin_addr.s_addr; + rs.rs_conn_port = dst_addr->sin_port; + + rc = rds_iw_get_device(&rs, &rds_iwdev_old, &pcm_id); + if (rc) + rds_iw_remove_cm_id(rds_iwdev, cm_id); + + return rds_iw_add_cm_id(rds_iwdev, cm_id); +} + +int rds_iw_add_conn(struct rds_iw_device *rds_iwdev, struct rds_connection *conn) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + + /* conn was previously on the nodev_conns_list */ + spin_lock_irq(&iw_nodev_conns_lock); + BUG_ON(list_empty(&iw_nodev_conns)); + BUG_ON(list_empty(&ic->iw_node)); + list_del(&ic->iw_node); + spin_unlock_irq(&iw_nodev_conns_lock); + + spin_lock_irq(&rds_iwdev->spinlock); + list_add_tail(&ic->iw_node, &rds_iwdev->conn_list); + spin_unlock_irq(&rds_iwdev->spinlock); + + ic->rds_iwdev = rds_iwdev; + + return 0; +} + +void rds_iw_remove_nodev_conns(void) +{ + struct rds_iw_connection *ic, *_ic; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&iw_nodev_conns_lock); + list_splice(&iw_nodev_conns, &tmp_list); + INIT_LIST_HEAD(&iw_nodev_conns); + spin_unlock_irq(&iw_nodev_conns_lock); + + list_for_each_entry_safe(ic, _ic, &tmp_list, iw_node) { + if (ic->conn->c_passive) + rds_conn_destroy(ic->conn->c_passive); + rds_conn_destroy(ic->conn); + } +} + +void rds_iw_remove_conns(struct rds_iw_device *rds_iwdev) +{ + struct rds_iw_connection *ic, *_ic; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&rds_iwdev->spinlock); + list_splice(&rds_iwdev->conn_list, &tmp_list); + INIT_LIST_HEAD(&rds_iwdev->conn_list); + spin_unlock_irq(&rds_iwdev->spinlock); + + list_for_each_entry_safe(ic, _ic, &tmp_list, iw_node) { + if (ic->conn->c_passive) + rds_conn_destroy(ic->conn->c_passive); + rds_conn_destroy(ic->conn); + } +} + +static void rds_iw_set_scatterlist(struct rds_iw_scatterlist *sg, + struct scatterlist *list, unsigned int sg_len) +{ + sg->list = list; + sg->len = sg_len; + sg->dma_len = 0; + sg->dma_npages = 0; + sg->bytes = 0; +} + +static u64 *rds_iw_map_scatterlist(struct rds_iw_device *rds_iwdev, + struct rds_iw_scatterlist *sg, + unsigned int dma_page_shift) +{ + struct ib_device *dev = rds_iwdev->dev; + u64 *dma_pages = NULL; + u64 dma_mask; + unsigned int dma_page_size; + int i, j, ret; + + dma_page_size = 1 << dma_page_shift; + dma_mask = dma_page_size - 1; + + WARN_ON(sg->dma_len); + + sg->dma_len = ib_dma_map_sg(dev, sg->list, sg->len, DMA_BIDIRECTIONAL); + if (unlikely(!sg->dma_len)) { + printk(KERN_WARNING "RDS/IW: dma_map_sg failed!\n"); + return ERR_PTR(-EBUSY); + } + + sg->bytes = 0; + sg->dma_npages = 0; + + ret = -EINVAL; + for (i = 0; i < sg->dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &sg->list[i]); + u64 dma_addr = ib_sg_dma_address(dev, &sg->list[i]); + u64 end_addr; + + sg->bytes += dma_len; + + end_addr = dma_addr + dma_len; + if (dma_addr & dma_mask) { + if (i > 0) + goto out_unmap; + dma_addr &= ~dma_mask; + } + if (end_addr & dma_mask) { + if (i < sg->dma_len - 1) + goto out_unmap; + end_addr = (end_addr + dma_mask) & ~dma_mask; + } + + sg->dma_npages += (end_addr - dma_addr) >> dma_page_shift; + } + + /* Now gather the dma addrs into one list */ + if (sg->dma_npages > fastreg_message_size) + goto out_unmap; + + dma_pages = kmalloc(sizeof(u64) * sg->dma_npages, GFP_ATOMIC); + if (!dma_pages) { + ret = -ENOMEM; + goto out_unmap; + } + + for (i = j = 0; i < sg->dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &sg->list[i]); + u64 dma_addr = ib_sg_dma_address(dev, &sg->list[i]); + u64 end_addr; + + end_addr = dma_addr + dma_len; + dma_addr &= ~dma_mask; + for (; dma_addr < end_addr; dma_addr += dma_page_size) + dma_pages[j++] = dma_addr; + BUG_ON(j > sg->dma_npages); + } + + return dma_pages; + +out_unmap: + ib_dma_unmap_sg(rds_iwdev->dev, sg->list, sg->len, DMA_BIDIRECTIONAL); + sg->dma_len = 0; + kfree(dma_pages); + return ERR_PTR(ret); +} + + +struct rds_iw_mr_pool *rds_iw_create_mr_pool(struct rds_iw_device *rds_iwdev) +{ + struct rds_iw_mr_pool *pool; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) { + printk(KERN_WARNING "RDS/IW: rds_iw_create_mr_pool alloc error\n"); + return ERR_PTR(-ENOMEM); + } + + pool->device = rds_iwdev; + INIT_LIST_HEAD(&pool->dirty_list); + INIT_LIST_HEAD(&pool->clean_list); + mutex_init(&pool->flush_lock); + spin_lock_init(&pool->list_lock); + INIT_WORK(&pool->flush_worker, rds_iw_mr_pool_flush_worker); + + pool->max_message_size = fastreg_message_size; + pool->max_items = fastreg_pool_size; + pool->max_free_pinned = pool->max_items * pool->max_message_size / 4; + pool->max_pages = fastreg_message_size; + + /* We never allow more than max_items MRs to be allocated. + * When we exceed more than max_items_soft, we start freeing + * items more aggressively. + * Make sure that max_items > max_items_soft > max_items / 2 + */ + pool->max_items_soft = pool->max_items * 3 / 4; + + return pool; +} + +void rds_iw_get_mr_info(struct rds_iw_device *rds_iwdev, struct rds_info_rdma_connection *iinfo) +{ + struct rds_iw_mr_pool *pool = rds_iwdev->mr_pool; + + iinfo->rdma_mr_max = pool->max_items; + iinfo->rdma_mr_size = pool->max_pages; +} + +void rds_iw_destroy_mr_pool(struct rds_iw_mr_pool *pool) +{ + flush_workqueue(rds_wq); + rds_iw_flush_mr_pool(pool, 1); + BUG_ON(atomic_read(&pool->item_count)); + BUG_ON(atomic_read(&pool->free_pinned)); + kfree(pool); +} + +static inline struct rds_iw_mr *rds_iw_reuse_fmr(struct rds_iw_mr_pool *pool) +{ + struct rds_iw_mr *ibmr = NULL; + unsigned long flags; + + spin_lock_irqsave(&pool->list_lock, flags); + if (!list_empty(&pool->clean_list)) { + ibmr = list_entry(pool->clean_list.next, struct rds_iw_mr, mapping.m_list); + list_del_init(&ibmr->mapping.m_list); + } + spin_unlock_irqrestore(&pool->list_lock, flags); + + return ibmr; +} + +static struct rds_iw_mr *rds_iw_alloc_mr(struct rds_iw_device *rds_iwdev) +{ + struct rds_iw_mr_pool *pool = rds_iwdev->mr_pool; + struct rds_iw_mr *ibmr = NULL; + int err = 0, iter = 0; + + while (1) { + ibmr = rds_iw_reuse_fmr(pool); + if (ibmr) + return ibmr; + + /* No clean MRs - now we have the choice of either + * allocating a fresh MR up to the limit imposed by the + * driver, or flush any dirty unused MRs. + * We try to avoid stalling in the send path if possible, + * so we allocate as long as we're allowed to. + * + * We're fussy with enforcing the FMR limit, though. If the driver + * tells us we can't use more than N fmrs, we shouldn't start + * arguing with it */ + if (atomic_inc_return(&pool->item_count) <= pool->max_items) + break; + + atomic_dec(&pool->item_count); + + if (++iter > 2) { + rds_iw_stats_inc(s_iw_rdma_mr_pool_depleted); + return ERR_PTR(-EAGAIN); + } + + /* We do have some empty MRs. Flush them out. */ + rds_iw_stats_inc(s_iw_rdma_mr_pool_wait); + rds_iw_flush_mr_pool(pool, 0); + } + + ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL); + if (!ibmr) { + err = -ENOMEM; + goto out_no_cigar; + } + + spin_lock_init(&ibmr->mapping.m_lock); + INIT_LIST_HEAD(&ibmr->mapping.m_list); + ibmr->mapping.m_mr = ibmr; + + err = rds_iw_init_fastreg(pool, ibmr); + if (err) + goto out_no_cigar; + + rds_iw_stats_inc(s_iw_rdma_mr_alloc); + return ibmr; + +out_no_cigar: + if (ibmr) { + rds_iw_destroy_fastreg(pool, ibmr); + kfree(ibmr); + } + atomic_dec(&pool->item_count); + return ERR_PTR(err); +} + +void rds_iw_sync_mr(void *trans_private, int direction) +{ + struct rds_iw_mr *ibmr = trans_private; + struct rds_iw_device *rds_iwdev = ibmr->device; + + switch (direction) { + case DMA_FROM_DEVICE: + ib_dma_sync_sg_for_cpu(rds_iwdev->dev, ibmr->mapping.m_sg.list, + ibmr->mapping.m_sg.dma_len, DMA_BIDIRECTIONAL); + break; + case DMA_TO_DEVICE: + ib_dma_sync_sg_for_device(rds_iwdev->dev, ibmr->mapping.m_sg.list, + ibmr->mapping.m_sg.dma_len, DMA_BIDIRECTIONAL); + break; + } +} + +static inline unsigned int rds_iw_flush_goal(struct rds_iw_mr_pool *pool, int free_all) +{ + unsigned int item_count; + + item_count = atomic_read(&pool->item_count); + if (free_all) + return item_count; + + return 0; +} + +/* + * Flush our pool of MRs. + * At a minimum, all currently unused MRs are unmapped. + * If the number of MRs allocated exceeds the limit, we also try + * to free as many MRs as needed to get back to this limit. + */ +static int rds_iw_flush_mr_pool(struct rds_iw_mr_pool *pool, int free_all) +{ + struct rds_iw_mr *ibmr, *next; + LIST_HEAD(unmap_list); + LIST_HEAD(kill_list); + unsigned long flags; + unsigned int nfreed = 0, ncleaned = 0, free_goal; + int ret = 0; + + rds_iw_stats_inc(s_iw_rdma_mr_pool_flush); + + mutex_lock(&pool->flush_lock); + + spin_lock_irqsave(&pool->list_lock, flags); + /* Get the list of all mappings to be destroyed */ + list_splice_init(&pool->dirty_list, &unmap_list); + if (free_all) + list_splice_init(&pool->clean_list, &kill_list); + spin_unlock_irqrestore(&pool->list_lock, flags); + + free_goal = rds_iw_flush_goal(pool, free_all); + + /* Batched invalidate of dirty MRs. + * For FMR based MRs, the mappings on the unmap list are + * actually members of an ibmr (ibmr->mapping). They either + * migrate to the kill_list, or have been cleaned and should be + * moved to the clean_list. + * For fastregs, they will be dynamically allocated, and + * will be destroyed by the unmap function. + */ + if (!list_empty(&unmap_list)) { + ncleaned = rds_iw_unmap_fastreg_list(pool, &unmap_list, &kill_list); + /* If we've been asked to destroy all MRs, move those + * that were simply cleaned to the kill list */ + if (free_all) + list_splice_init(&unmap_list, &kill_list); + } + + /* Destroy any MRs that are past their best before date */ + list_for_each_entry_safe(ibmr, next, &kill_list, mapping.m_list) { + rds_iw_stats_inc(s_iw_rdma_mr_free); + list_del(&ibmr->mapping.m_list); + rds_iw_destroy_fastreg(pool, ibmr); + kfree(ibmr); + nfreed++; + } + + /* Anything that remains are laundered ibmrs, which we can add + * back to the clean list. */ + if (!list_empty(&unmap_list)) { + spin_lock_irqsave(&pool->list_lock, flags); + list_splice(&unmap_list, &pool->clean_list); + spin_unlock_irqrestore(&pool->list_lock, flags); + } + + atomic_sub(ncleaned, &pool->dirty_count); + atomic_sub(nfreed, &pool->item_count); + + mutex_unlock(&pool->flush_lock); + return ret; +} + +static void rds_iw_mr_pool_flush_worker(struct work_struct *work) +{ + struct rds_iw_mr_pool *pool = container_of(work, struct rds_iw_mr_pool, flush_worker); + + rds_iw_flush_mr_pool(pool, 0); +} + +void rds_iw_free_mr(void *trans_private, int invalidate) +{ + struct rds_iw_mr *ibmr = trans_private; + struct rds_iw_mr_pool *pool = ibmr->device->mr_pool; + + rdsdebug("RDS/IW: free_mr nents %u\n", ibmr->mapping.m_sg.len); + if (!pool) + return; + + /* Return it to the pool's free list */ + rds_iw_free_fastreg(pool, ibmr); + + /* If we've pinned too many pages, request a flush */ + if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned + || atomic_read(&pool->dirty_count) >= pool->max_items / 10) + queue_work(rds_wq, &pool->flush_worker); + + if (invalidate) { + if (likely(!in_interrupt())) { + rds_iw_flush_mr_pool(pool, 0); + } else { + /* We get here if the user created a MR marked + * as use_once and invalidate at the same time. */ + queue_work(rds_wq, &pool->flush_worker); + } + } +} + +void rds_iw_flush_mrs(void) +{ + struct rds_iw_device *rds_iwdev; + + list_for_each_entry(rds_iwdev, &rds_iw_devices, list) { + struct rds_iw_mr_pool *pool = rds_iwdev->mr_pool; + + if (pool) + rds_iw_flush_mr_pool(pool, 0); + } +} + +void *rds_iw_get_mr(struct scatterlist *sg, unsigned long nents, + struct rds_sock *rs, u32 *key_ret) +{ + struct rds_iw_device *rds_iwdev; + struct rds_iw_mr *ibmr = NULL; + struct rdma_cm_id *cm_id; + int ret; + + ret = rds_iw_get_device(rs, &rds_iwdev, &cm_id); + if (ret || !cm_id) { + ret = -ENODEV; + goto out; + } + + if (!rds_iwdev->mr_pool) { + ret = -ENODEV; + goto out; + } + + ibmr = rds_iw_alloc_mr(rds_iwdev); + if (IS_ERR(ibmr)) + return ibmr; + + ibmr->cm_id = cm_id; + ibmr->device = rds_iwdev; + + ret = rds_iw_map_fastreg(rds_iwdev->mr_pool, ibmr, sg, nents); + if (ret == 0) + *key_ret = ibmr->mr->rkey; + else + printk(KERN_WARNING "RDS/IW: failed to map mr (errno=%d)\n", ret); + +out: + if (ret) { + if (ibmr) + rds_iw_free_mr(ibmr, 0); + ibmr = ERR_PTR(ret); + } + return ibmr; +} + +/* + * iWARP fastreg handling + * + * The life cycle of a fastreg registration is a bit different from + * FMRs. + * The idea behind fastreg is to have one MR, to which we bind different + * mappings over time. To avoid stalling on the expensive map and invalidate + * operations, these operations are pipelined on the same send queue on + * which we want to send the message containing the r_key. + * + * This creates a bit of a problem for us, as we do not have the destination + * IP in GET_MR, so the connection must be setup prior to the GET_MR call for + * RDMA to be correctly setup. If a fastreg request is present, rds_iw_xmit + * will try to queue a LOCAL_INV (if needed) and a FAST_REG_MR work request + * before queuing the SEND. When completions for these arrive, they are + * dispatched to the MR has a bit set showing that RDMa can be performed. + * + * There is another interesting aspect that's related to invalidation. + * The application can request that a mapping is invalidated in FREE_MR. + * The expectation there is that this invalidation step includes ALL + * PREVIOUSLY FREED MRs. + */ +static int rds_iw_init_fastreg(struct rds_iw_mr_pool *pool, + struct rds_iw_mr *ibmr) +{ + struct rds_iw_device *rds_iwdev = pool->device; + struct ib_fast_reg_page_list *page_list = NULL; + struct ib_mr *mr; + int err; + + mr = ib_alloc_fast_reg_mr(rds_iwdev->pd, pool->max_message_size); + if (IS_ERR(mr)) { + err = PTR_ERR(mr); + + printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_mr failed (err=%d)\n", err); + return err; + } + + /* FIXME - this is overkill, but mapping->m_sg.dma_len/mapping->m_sg.dma_npages + * is not filled in. + */ + page_list = ib_alloc_fast_reg_page_list(rds_iwdev->dev, pool->max_message_size); + if (IS_ERR(page_list)) { + err = PTR_ERR(page_list); + + printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_page_list failed (err=%d)\n", err); + ib_dereg_mr(mr); + return err; + } + + ibmr->page_list = page_list; + ibmr->mr = mr; + return 0; +} + +static int rds_iw_rdma_build_fastreg(struct rds_iw_mapping *mapping) +{ + struct rds_iw_mr *ibmr = mapping->m_mr; + struct ib_send_wr f_wr, *failed_wr; + int ret; + + /* + * Perform a WR for the fast_reg_mr. Each individual page + * in the sg list is added to the fast reg page list and placed + * inside the fast_reg_mr WR. The key used is a rolling 8bit + * counter, which should guarantee uniqueness. + */ + ib_update_fast_reg_key(ibmr->mr, ibmr->remap_count++); + mapping->m_rkey = ibmr->mr->rkey; + + memset(&f_wr, 0, sizeof(f_wr)); + f_wr.wr_id = RDS_IW_FAST_REG_WR_ID; + f_wr.opcode = IB_WR_FAST_REG_MR; + f_wr.wr.fast_reg.length = mapping->m_sg.bytes; + f_wr.wr.fast_reg.rkey = mapping->m_rkey; + f_wr.wr.fast_reg.page_list = ibmr->page_list; + f_wr.wr.fast_reg.page_list_len = mapping->m_sg.dma_len; + f_wr.wr.fast_reg.page_shift = ibmr->device->page_shift; + f_wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE; + f_wr.wr.fast_reg.iova_start = 0; + f_wr.send_flags = IB_SEND_SIGNALED; + + failed_wr = &f_wr; + ret = ib_post_send(ibmr->cm_id->qp, &f_wr, &failed_wr); + BUG_ON(failed_wr != &f_wr); + if (ret && printk_ratelimit()) + printk(KERN_WARNING "RDS/IW: %s:%d ib_post_send returned %d\n", + __func__, __LINE__, ret); + return ret; +} + +static int rds_iw_rdma_fastreg_inv(struct rds_iw_mr *ibmr) +{ + struct ib_send_wr s_wr, *failed_wr; + int ret = 0; + + if (!ibmr->cm_id->qp || !ibmr->mr) + goto out; + + memset(&s_wr, 0, sizeof(s_wr)); + s_wr.wr_id = RDS_IW_LOCAL_INV_WR_ID; + s_wr.opcode = IB_WR_LOCAL_INV; + s_wr.ex.invalidate_rkey = ibmr->mr->rkey; + s_wr.send_flags = IB_SEND_SIGNALED; + + failed_wr = &s_wr; + ret = ib_post_send(ibmr->cm_id->qp, &s_wr, &failed_wr); + if (ret && printk_ratelimit()) { + printk(KERN_WARNING "RDS/IW: %s:%d ib_post_send returned %d\n", + __func__, __LINE__, ret); + goto out; + } +out: + return ret; +} + +static int rds_iw_map_fastreg(struct rds_iw_mr_pool *pool, + struct rds_iw_mr *ibmr, + struct scatterlist *sg, + unsigned int sg_len) +{ + struct rds_iw_device *rds_iwdev = pool->device; + struct rds_iw_mapping *mapping = &ibmr->mapping; + u64 *dma_pages; + int i, ret = 0; + + rds_iw_set_scatterlist(&mapping->m_sg, sg, sg_len); + + dma_pages = rds_iw_map_scatterlist(rds_iwdev, + &mapping->m_sg, + rds_iwdev->page_shift); + if (IS_ERR(dma_pages)) { + ret = PTR_ERR(dma_pages); + dma_pages = NULL; + goto out; + } + + if (mapping->m_sg.dma_len > pool->max_message_size) { + ret = -EMSGSIZE; + goto out; + } + + for (i = 0; i < mapping->m_sg.dma_npages; ++i) + ibmr->page_list->page_list[i] = dma_pages[i]; + + ret = rds_iw_rdma_build_fastreg(mapping); + if (ret) + goto out; + + rds_iw_stats_inc(s_iw_rdma_mr_used); + +out: + kfree(dma_pages); + + return ret; +} + +/* + * "Free" a fastreg MR. + */ +static void rds_iw_free_fastreg(struct rds_iw_mr_pool *pool, + struct rds_iw_mr *ibmr) +{ + unsigned long flags; + int ret; + + if (!ibmr->mapping.m_sg.dma_len) + return; + + ret = rds_iw_rdma_fastreg_inv(ibmr); + if (ret) + return; + + /* Try to post the LOCAL_INV WR to the queue. */ + spin_lock_irqsave(&pool->list_lock, flags); + + list_add_tail(&ibmr->mapping.m_list, &pool->dirty_list); + atomic_add(ibmr->mapping.m_sg.len, &pool->free_pinned); + atomic_inc(&pool->dirty_count); + + spin_unlock_irqrestore(&pool->list_lock, flags); +} + +static unsigned int rds_iw_unmap_fastreg_list(struct rds_iw_mr_pool *pool, + struct list_head *unmap_list, + struct list_head *kill_list) +{ + struct rds_iw_mapping *mapping, *next; + unsigned int ncleaned = 0; + LIST_HEAD(laundered); + + /* Batched invalidation of fastreg MRs. + * Why do we do it this way, even though we could pipeline unmap + * and remap? The reason is the application semantics - when the + * application requests an invalidation of MRs, it expects all + * previously released R_Keys to become invalid. + * + * If we implement MR reuse naively, we risk memory corruption + * (this has actually been observed). So the default behavior + * requires that a MR goes through an explicit unmap operation before + * we can reuse it again. + * + * We could probably improve on this a little, by allowing immediate + * reuse of a MR on the same socket (eg you could add small + * cache of unused MRs to strct rds_socket - GET_MR could grab one + * of these without requiring an explicit invalidate). + */ + while (!list_empty(unmap_list)) { + unsigned long flags; + + spin_lock_irqsave(&pool->list_lock, flags); + list_for_each_entry_safe(mapping, next, unmap_list, m_list) { + list_move(&mapping->m_list, &laundered); + ncleaned++; + } + spin_unlock_irqrestore(&pool->list_lock, flags); + } + + /* Move all laundered mappings back to the unmap list. + * We do not kill any WRs right now - it doesn't seem the + * fastreg API has a max_remap limit. */ + list_splice_init(&laundered, unmap_list); + + return ncleaned; +} + +static void rds_iw_destroy_fastreg(struct rds_iw_mr_pool *pool, + struct rds_iw_mr *ibmr) +{ + if (ibmr->page_list) + ib_free_fast_reg_page_list(ibmr->page_list); + if (ibmr->mr) + ib_dereg_mr(ibmr->mr); +} diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c new file mode 100644 index 0000000..a1931f0 --- /dev/null +++ b/net/rds/iw_recv.c @@ -0,0 +1,869 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "iw.h" + +static struct kmem_cache *rds_iw_incoming_slab; +static struct kmem_cache *rds_iw_frag_slab; +static atomic_t rds_iw_allocation = ATOMIC_INIT(0); + +static void rds_iw_frag_drop_page(struct rds_page_frag *frag) +{ + rdsdebug("frag %p page %p\n", frag, frag->f_page); + __free_page(frag->f_page); + frag->f_page = NULL; +} + +static void rds_iw_frag_free(struct rds_page_frag *frag) +{ + rdsdebug("frag %p page %p\n", frag, frag->f_page); + BUG_ON(frag->f_page != NULL); + kmem_cache_free(rds_iw_frag_slab, frag); +} + +/* + * We map a page at a time. Its fragments are posted in order. This + * is called in fragment order as the fragments get send completion events. + * Only the last frag in the page performs the unmapping. + * + * It's OK for ring cleanup to call this in whatever order it likes because + * DMA is not in flight and so we can unmap while other ring entries still + * hold page references in their frags. + */ +static void rds_iw_recv_unmap_page(struct rds_iw_connection *ic, + struct rds_iw_recv_work *recv) +{ + struct rds_page_frag *frag = recv->r_frag; + + rdsdebug("recv %p frag %p page %p\n", recv, frag, frag->f_page); + if (frag->f_mapped) + ib_dma_unmap_page(ic->i_cm_id->device, + frag->f_mapped, + RDS_FRAG_SIZE, DMA_FROM_DEVICE); + frag->f_mapped = 0; +} + +void rds_iw_recv_init_ring(struct rds_iw_connection *ic) +{ + struct rds_iw_recv_work *recv; + u32 i; + + for (i = 0, recv = ic->i_recvs; i < ic->i_recv_ring.w_nr; i++, recv++) { + struct ib_sge *sge; + + recv->r_iwinc = NULL; + recv->r_frag = NULL; + + recv->r_wr.next = NULL; + recv->r_wr.wr_id = i; + recv->r_wr.sg_list = recv->r_sge; + recv->r_wr.num_sge = RDS_IW_RECV_SGE; + + sge = rds_iw_data_sge(ic, recv->r_sge); + sge->addr = 0; + sge->length = RDS_FRAG_SIZE; + sge->lkey = 0; + + sge = rds_iw_header_sge(ic, recv->r_sge); + sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = 0; + } +} + +static void rds_iw_recv_clear_one(struct rds_iw_connection *ic, + struct rds_iw_recv_work *recv) +{ + if (recv->r_iwinc) { + rds_inc_put(&recv->r_iwinc->ii_inc); + recv->r_iwinc = NULL; + } + if (recv->r_frag) { + rds_iw_recv_unmap_page(ic, recv); + if (recv->r_frag->f_page) + rds_iw_frag_drop_page(recv->r_frag); + rds_iw_frag_free(recv->r_frag); + recv->r_frag = NULL; + } +} + +void rds_iw_recv_clear_ring(struct rds_iw_connection *ic) +{ + u32 i; + + for (i = 0; i < ic->i_recv_ring.w_nr; i++) + rds_iw_recv_clear_one(ic, &ic->i_recvs[i]); + + if (ic->i_frag.f_page) + rds_iw_frag_drop_page(&ic->i_frag); +} + +static int rds_iw_recv_refill_one(struct rds_connection *conn, + struct rds_iw_recv_work *recv, + gfp_t kptr_gfp, gfp_t page_gfp) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + dma_addr_t dma_addr; + struct ib_sge *sge; + int ret = -ENOMEM; + + if (recv->r_iwinc == NULL) { + if (atomic_read(&rds_iw_allocation) >= rds_iw_sysctl_max_recv_allocation) { + rds_iw_stats_inc(s_iw_rx_alloc_limit); + goto out; + } + recv->r_iwinc = kmem_cache_alloc(rds_iw_incoming_slab, + kptr_gfp); + if (recv->r_iwinc == NULL) + goto out; + atomic_inc(&rds_iw_allocation); + INIT_LIST_HEAD(&recv->r_iwinc->ii_frags); + rds_inc_init(&recv->r_iwinc->ii_inc, conn, conn->c_faddr); + } + + if (recv->r_frag == NULL) { + recv->r_frag = kmem_cache_alloc(rds_iw_frag_slab, kptr_gfp); + if (recv->r_frag == NULL) + goto out; + INIT_LIST_HEAD(&recv->r_frag->f_item); + recv->r_frag->f_page = NULL; + } + + if (ic->i_frag.f_page == NULL) { + ic->i_frag.f_page = alloc_page(page_gfp); + if (ic->i_frag.f_page == NULL) + goto out; + ic->i_frag.f_offset = 0; + } + + dma_addr = ib_dma_map_page(ic->i_cm_id->device, + ic->i_frag.f_page, + ic->i_frag.f_offset, + RDS_FRAG_SIZE, + DMA_FROM_DEVICE); + if (ib_dma_mapping_error(ic->i_cm_id->device, dma_addr)) + goto out; + + /* + * Once we get the RDS_PAGE_LAST_OFF frag then rds_iw_frag_unmap() + * must be called on this recv. This happens as completions hit + * in order or on connection shutdown. + */ + recv->r_frag->f_page = ic->i_frag.f_page; + recv->r_frag->f_offset = ic->i_frag.f_offset; + recv->r_frag->f_mapped = dma_addr; + + sge = rds_iw_data_sge(ic, recv->r_sge); + sge->addr = dma_addr; + sge->length = RDS_FRAG_SIZE; + + sge = rds_iw_header_sge(ic, recv->r_sge); + sge->addr = ic->i_recv_hdrs_dma + (recv - ic->i_recvs) * sizeof(struct rds_header); + sge->length = sizeof(struct rds_header); + + get_page(recv->r_frag->f_page); + + if (ic->i_frag.f_offset < RDS_PAGE_LAST_OFF) { + ic->i_frag.f_offset += RDS_FRAG_SIZE; + } else { + put_page(ic->i_frag.f_page); + ic->i_frag.f_page = NULL; + ic->i_frag.f_offset = 0; + } + + ret = 0; +out: + return ret; +} + +/* + * This tries to allocate and post unused work requests after making sure that + * they have all the allocations they need to queue received fragments into + * sockets. The i_recv_mutex is held here so that ring_alloc and _unalloc + * pairs don't go unmatched. + * + * -1 is returned if posting fails due to temporary resource exhaustion. + */ +int rds_iw_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, + gfp_t page_gfp, int prefill) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + struct rds_iw_recv_work *recv; + struct ib_recv_wr *failed_wr; + unsigned int posted = 0; + int ret = 0; + u32 pos; + + while ((prefill || rds_conn_up(conn)) + && rds_iw_ring_alloc(&ic->i_recv_ring, 1, &pos)) { + if (pos >= ic->i_recv_ring.w_nr) { + printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n", + pos); + ret = -EINVAL; + break; + } + + recv = &ic->i_recvs[pos]; + ret = rds_iw_recv_refill_one(conn, recv, kptr_gfp, page_gfp); + if (ret) { + ret = -1; + break; + } + + /* XXX when can this fail? */ + ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); + rdsdebug("recv %p iwinc %p page %p addr %lu ret %d\n", recv, + recv->r_iwinc, recv->r_frag->f_page, + (long) recv->r_frag->f_mapped, ret); + if (ret) { + rds_iw_conn_error(conn, "recv post on " + "%pI4 returned %d, disconnecting and " + "reconnecting\n", &conn->c_faddr, + ret); + ret = -1; + break; + } + + posted++; + } + + /* We're doing flow control - update the window. */ + if (ic->i_flowctl && posted) + rds_iw_advertise_credits(conn, posted); + + if (ret) + rds_iw_ring_unalloc(&ic->i_recv_ring, 1); + return ret; +} + +void rds_iw_inc_purge(struct rds_incoming *inc) +{ + struct rds_iw_incoming *iwinc; + struct rds_page_frag *frag; + struct rds_page_frag *pos; + + iwinc = container_of(inc, struct rds_iw_incoming, ii_inc); + rdsdebug("purging iwinc %p inc %p\n", iwinc, inc); + + list_for_each_entry_safe(frag, pos, &iwinc->ii_frags, f_item) { + list_del_init(&frag->f_item); + rds_iw_frag_drop_page(frag); + rds_iw_frag_free(frag); + } +} + +void rds_iw_inc_free(struct rds_incoming *inc) +{ + struct rds_iw_incoming *iwinc; + + iwinc = container_of(inc, struct rds_iw_incoming, ii_inc); + + rds_iw_inc_purge(inc); + rdsdebug("freeing iwinc %p inc %p\n", iwinc, inc); + BUG_ON(!list_empty(&iwinc->ii_frags)); + kmem_cache_free(rds_iw_incoming_slab, iwinc); + atomic_dec(&rds_iw_allocation); + BUG_ON(atomic_read(&rds_iw_allocation) < 0); +} + +int rds_iw_inc_copy_to_user(struct rds_incoming *inc, struct iovec *first_iov, + size_t size) +{ + struct rds_iw_incoming *iwinc; + struct rds_page_frag *frag; + struct iovec *iov = first_iov; + unsigned long to_copy; + unsigned long frag_off = 0; + unsigned long iov_off = 0; + int copied = 0; + int ret; + u32 len; + + iwinc = container_of(inc, struct rds_iw_incoming, ii_inc); + frag = list_entry(iwinc->ii_frags.next, struct rds_page_frag, f_item); + len = be32_to_cpu(inc->i_hdr.h_len); + + while (copied < size && copied < len) { + if (frag_off == RDS_FRAG_SIZE) { + frag = list_entry(frag->f_item.next, + struct rds_page_frag, f_item); + frag_off = 0; + } + while (iov_off == iov->iov_len) { + iov_off = 0; + iov++; + } + + to_copy = min(iov->iov_len - iov_off, RDS_FRAG_SIZE - frag_off); + to_copy = min_t(size_t, to_copy, size - copied); + to_copy = min_t(unsigned long, to_copy, len - copied); + + rdsdebug("%lu bytes to user [%p, %zu] + %lu from frag " + "[%p, %lu] + %lu\n", + to_copy, iov->iov_base, iov->iov_len, iov_off, + frag->f_page, frag->f_offset, frag_off); + + /* XXX needs + offset for multiple recvs per page */ + ret = rds_page_copy_to_user(frag->f_page, + frag->f_offset + frag_off, + iov->iov_base + iov_off, + to_copy); + if (ret) { + copied = ret; + break; + } + + iov_off += to_copy; + frag_off += to_copy; + copied += to_copy; + } + + return copied; +} + +/* ic starts out kzalloc()ed */ +void rds_iw_recv_init_ack(struct rds_iw_connection *ic) +{ + struct ib_send_wr *wr = &ic->i_ack_wr; + struct ib_sge *sge = &ic->i_ack_sge; + + sge->addr = ic->i_ack_dma; + sge->length = sizeof(struct rds_header); + sge->lkey = rds_iw_local_dma_lkey(ic); + + wr->sg_list = sge; + wr->num_sge = 1; + wr->opcode = IB_WR_SEND; + wr->wr_id = RDS_IW_ACK_WR_ID; + wr->send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; +} + +/* + * You'd think that with reliable IB connections you wouldn't need to ack + * messages that have been received. The problem is that IB hardware generates + * an ack message before it has DMAed the message into memory. This creates a + * potential message loss if the HCA is disabled for any reason between when it + * sends the ack and before the message is DMAed and processed. This is only a + * potential issue if another HCA is available for fail-over. + * + * When the remote host receives our ack they'll free the sent message from + * their send queue. To decrease the latency of this we always send an ack + * immediately after we've received messages. + * + * For simplicity, we only have one ack in flight at a time. This puts + * pressure on senders to have deep enough send queues to absorb the latency of + * a single ack frame being in flight. This might not be good enough. + * + * This is implemented by have a long-lived send_wr and sge which point to a + * statically allocated ack frame. This ack wr does not fall under the ring + * accounting that the tx and rx wrs do. The QP attribute specifically makes + * room for it beyond the ring size. Send completion notices its special + * wr_id and avoids working with the ring in that case. + */ +static void rds_iw_set_ack(struct rds_iw_connection *ic, u64 seq, + int ack_required) +{ + rds_iw_set_64bit(&ic->i_ack_next, seq); + if (ack_required) { + smp_mb__before_clear_bit(); + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + } +} + +static u64 rds_iw_get_ack(struct rds_iw_connection *ic) +{ + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + smp_mb__after_clear_bit(); + + return ic->i_ack_next; +} + +static void rds_iw_send_ack(struct rds_iw_connection *ic, unsigned int adv_credits) +{ + struct rds_header *hdr = ic->i_ack; + struct ib_send_wr *failed_wr; + u64 seq; + int ret; + + seq = rds_iw_get_ack(ic); + + rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq); + rds_message_populate_header(hdr, 0, 0, 0); + hdr->h_ack = cpu_to_be64(seq); + hdr->h_credit = adv_credits; + rds_message_make_checksum(hdr); + ic->i_ack_queued = jiffies; + + ret = ib_post_send(ic->i_cm_id->qp, &ic->i_ack_wr, &failed_wr); + if (unlikely(ret)) { + /* Failed to send. Release the WR, and + * force another ACK. + */ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + + rds_iw_stats_inc(s_iw_ack_send_failure); + /* Need to finesse this later. */ + BUG(); + } else + rds_iw_stats_inc(s_iw_ack_sent); +} + +/* + * There are 3 ways of getting acknowledgements to the peer: + * 1. We call rds_iw_attempt_ack from the recv completion handler + * to send an ACK-only frame. + * However, there can be only one such frame in the send queue + * at any time, so we may have to postpone it. + * 2. When another (data) packet is transmitted while there's + * an ACK in the queue, we piggyback the ACK sequence number + * on the data packet. + * 3. If the ACK WR is done sending, we get called from the + * send queue completion handler, and check whether there's + * another ACK pending (postponed because the WR was on the + * queue). If so, we transmit it. + * + * We maintain 2 variables: + * - i_ack_flags, which keeps track of whether the ACK WR + * is currently in the send queue or not (IB_ACK_IN_FLIGHT) + * - i_ack_next, which is the last sequence number we received + * + * Potentially, send queue and receive queue handlers can run concurrently. + * + * Reconnecting complicates this picture just slightly. When we + * reconnect, we may be seeing duplicate packets. The peer + * is retransmitting them, because it hasn't seen an ACK for + * them. It is important that we ACK these. + * + * ACK mitigation adds a header flag "ACK_REQUIRED"; any packet with + * this flag set *MUST* be acknowledged immediately. + */ + +/* + * When we get here, we're called from the recv queue handler. + * Check whether we ought to transmit an ACK. + */ +void rds_iw_attempt_ack(struct rds_iw_connection *ic) +{ + unsigned int adv_credits; + + if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) + return; + + if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) { + rds_iw_stats_inc(s_iw_ack_send_delayed); + return; + } + + /* Can we get a send credit? */ + if (!rds_iw_send_grab_credits(ic, 1, &adv_credits, 0)) { + rds_iw_stats_inc(s_iw_tx_throttle); + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + return; + } + + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + rds_iw_send_ack(ic, adv_credits); +} + +/* + * We get here from the send completion handler, when the + * adapter tells us the ACK frame was sent. + */ +void rds_iw_ack_send_complete(struct rds_iw_connection *ic) +{ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + rds_iw_attempt_ack(ic); +} + +/* + * This is called by the regular xmit code when it wants to piggyback + * an ACK on an outgoing frame. + */ +u64 rds_iw_piggyb_ack(struct rds_iw_connection *ic) +{ + if (test_and_clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) + rds_iw_stats_inc(s_iw_ack_send_piggybacked); + return rds_iw_get_ack(ic); +} + +/* + * It's kind of lame that we're copying from the posted receive pages into + * long-lived bitmaps. We could have posted the bitmaps and rdma written into + * them. But receiving new congestion bitmaps should be a *rare* event, so + * hopefully we won't need to invest that complexity in making it more + * efficient. By copying we can share a simpler core with TCP which has to + * copy. + */ +static void rds_iw_cong_recv(struct rds_connection *conn, + struct rds_iw_incoming *iwinc) +{ + struct rds_cong_map *map; + unsigned int map_off; + unsigned int map_page; + struct rds_page_frag *frag; + unsigned long frag_off; + unsigned long to_copy; + unsigned long copied; + uint64_t uncongested = 0; + void *addr; + + /* catch completely corrupt packets */ + if (be32_to_cpu(iwinc->ii_inc.i_hdr.h_len) != RDS_CONG_MAP_BYTES) + return; + + map = conn->c_fcong; + map_page = 0; + map_off = 0; + + frag = list_entry(iwinc->ii_frags.next, struct rds_page_frag, f_item); + frag_off = 0; + + copied = 0; + + while (copied < RDS_CONG_MAP_BYTES) { + uint64_t *src, *dst; + unsigned int k; + + to_copy = min(RDS_FRAG_SIZE - frag_off, PAGE_SIZE - map_off); + BUG_ON(to_copy & 7); /* Must be 64bit aligned. */ + + addr = kmap_atomic(frag->f_page, KM_SOFTIRQ0); + + src = addr + frag_off; + dst = (void *)map->m_page_addrs[map_page] + map_off; + for (k = 0; k < to_copy; k += 8) { + /* Record ports that became uncongested, ie + * bits that changed from 0 to 1. */ + uncongested |= ~(*src) & *dst; + *dst++ = *src++; + } + kunmap_atomic(addr, KM_SOFTIRQ0); + + copied += to_copy; + + map_off += to_copy; + if (map_off == PAGE_SIZE) { + map_off = 0; + map_page++; + } + + frag_off += to_copy; + if (frag_off == RDS_FRAG_SIZE) { + frag = list_entry(frag->f_item.next, + struct rds_page_frag, f_item); + frag_off = 0; + } + } + + /* the congestion map is in little endian order */ + uncongested = le64_to_cpu(uncongested); + + rds_cong_map_updated(map, uncongested); +} + +/* + * Rings are posted with all the allocations they'll need to queue the + * incoming message to the receiving socket so this can't fail. + * All fragments start with a header, so we can make sure we're not receiving + * garbage, and we can tell a small 8 byte fragment from an ACK frame. + */ +struct rds_iw_ack_state { + u64 ack_next; + u64 ack_recv; + unsigned int ack_required:1; + unsigned int ack_next_valid:1; + unsigned int ack_recv_valid:1; +}; + +static void rds_iw_process_recv(struct rds_connection *conn, + struct rds_iw_recv_work *recv, u32 byte_len, + struct rds_iw_ack_state *state) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + struct rds_iw_incoming *iwinc = ic->i_iwinc; + struct rds_header *ihdr, *hdr; + + /* XXX shut down the connection if port 0,0 are seen? */ + + rdsdebug("ic %p iwinc %p recv %p byte len %u\n", ic, iwinc, recv, + byte_len); + + if (byte_len < sizeof(struct rds_header)) { + rds_iw_conn_error(conn, "incoming message " + "from %pI4 didn't inclue a " + "header, disconnecting and " + "reconnecting\n", + &conn->c_faddr); + return; + } + byte_len -= sizeof(struct rds_header); + + ihdr = &ic->i_recv_hdrs[recv - ic->i_recvs]; + + /* Validate the checksum. */ + if (!rds_message_verify_checksum(ihdr)) { + rds_iw_conn_error(conn, "incoming message " + "from %pI4 has corrupted header - " + "forcing a reconnect\n", + &conn->c_faddr); + rds_stats_inc(s_recv_drop_bad_checksum); + return; + } + + /* Process the ACK sequence which comes with every packet */ + state->ack_recv = be64_to_cpu(ihdr->h_ack); + state->ack_recv_valid = 1; + + /* Process the credits update if there was one */ + if (ihdr->h_credit) + rds_iw_send_add_credits(conn, ihdr->h_credit); + + if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) { + /* This is an ACK-only packet. The fact that it gets + * special treatment here is that historically, ACKs + * were rather special beasts. + */ + rds_iw_stats_inc(s_iw_ack_received); + + /* + * Usually the frags make their way on to incs and are then freed as + * the inc is freed. We don't go that route, so we have to drop the + * page ref ourselves. We can't just leave the page on the recv + * because that confuses the dma mapping of pages and each recv's use + * of a partial page. We can leave the frag, though, it will be + * reused. + * + * FIXME: Fold this into the code path below. + */ + rds_iw_frag_drop_page(recv->r_frag); + return; + } + + /* + * If we don't already have an inc on the connection then this + * fragment has a header and starts a message.. copy its header + * into the inc and save the inc so we can hang upcoming fragments + * off its list. + */ + if (iwinc == NULL) { + iwinc = recv->r_iwinc; + recv->r_iwinc = NULL; + ic->i_iwinc = iwinc; + + hdr = &iwinc->ii_inc.i_hdr; + memcpy(hdr, ihdr, sizeof(*hdr)); + ic->i_recv_data_rem = be32_to_cpu(hdr->h_len); + + rdsdebug("ic %p iwinc %p rem %u flag 0x%x\n", ic, iwinc, + ic->i_recv_data_rem, hdr->h_flags); + } else { + hdr = &iwinc->ii_inc.i_hdr; + /* We can't just use memcmp here; fragments of a + * single message may carry different ACKs */ + if (hdr->h_sequence != ihdr->h_sequence + || hdr->h_len != ihdr->h_len + || hdr->h_sport != ihdr->h_sport + || hdr->h_dport != ihdr->h_dport) { + rds_iw_conn_error(conn, + "fragment header mismatch; forcing reconnect\n"); + return; + } + } + + list_add_tail(&recv->r_frag->f_item, &iwinc->ii_frags); + recv->r_frag = NULL; + + if (ic->i_recv_data_rem > RDS_FRAG_SIZE) + ic->i_recv_data_rem -= RDS_FRAG_SIZE; + else { + ic->i_recv_data_rem = 0; + ic->i_iwinc = NULL; + + if (iwinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP) + rds_iw_cong_recv(conn, iwinc); + else { + rds_recv_incoming(conn, conn->c_faddr, conn->c_laddr, + &iwinc->ii_inc, GFP_ATOMIC, + KM_SOFTIRQ0); + state->ack_next = be64_to_cpu(hdr->h_sequence); + state->ack_next_valid = 1; + } + + /* Evaluate the ACK_REQUIRED flag *after* we received + * the complete frame, and after bumping the next_rx + * sequence. */ + if (hdr->h_flags & RDS_FLAG_ACK_REQUIRED) { + rds_stats_inc(s_recv_ack_required); + state->ack_required = 1; + } + + rds_inc_put(&iwinc->ii_inc); + } +} + +/* + * Plucking the oldest entry from the ring can be done concurrently with + * the thread refilling the ring. Each ring operation is protected by + * spinlocks and the transient state of refilling doesn't change the + * recording of which entry is oldest. + * + * This relies on IB only calling one cq comp_handler for each cq so that + * there will only be one caller of rds_recv_incoming() per RDS connection. + */ +void rds_iw_recv_cq_comp_handler(struct ib_cq *cq, void *context) +{ + struct rds_connection *conn = context; + struct rds_iw_connection *ic = conn->c_transport_data; + struct ib_wc wc; + struct rds_iw_ack_state state = { 0, }; + struct rds_iw_recv_work *recv; + + rdsdebug("conn %p cq %p\n", conn, cq); + + rds_iw_stats_inc(s_iw_rx_cq_call); + + ib_req_notify_cq(cq, IB_CQ_SOLICITED); + + while (ib_poll_cq(cq, 1, &wc) > 0) { + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", + (unsigned long long)wc.wr_id, wc.status, wc.byte_len, + be32_to_cpu(wc.ex.imm_data)); + rds_iw_stats_inc(s_iw_rx_cq_event); + + recv = &ic->i_recvs[rds_iw_ring_oldest(&ic->i_recv_ring)]; + + rds_iw_recv_unmap_page(ic, recv); + + /* + * Also process recvs in connecting state because it is possible + * to get a recv completion _before_ the rdmacm ESTABLISHED + * event is processed. + */ + if (rds_conn_up(conn) || rds_conn_connecting(conn)) { + /* We expect errors as the qp is drained during shutdown */ + if (wc.status == IB_WC_SUCCESS) { + rds_iw_process_recv(conn, recv, wc.byte_len, &state); + } else { + rds_iw_conn_error(conn, "recv completion on " + "%pI4 had status %u, disconnecting and " + "reconnecting\n", &conn->c_faddr, + wc.status); + } + } + + rds_iw_ring_free(&ic->i_recv_ring, 1); + } + + if (state.ack_next_valid) + rds_iw_set_ack(ic, state.ack_next, state.ack_required); + if (state.ack_recv_valid && state.ack_recv > ic->i_ack_recv) { + rds_send_drop_acked(conn, state.ack_recv, NULL); + ic->i_ack_recv = state.ack_recv; + } + if (rds_conn_up(conn)) + rds_iw_attempt_ack(ic); + + /* If we ever end up with a really empty receive ring, we're + * in deep trouble, as the sender will definitely see RNR + * timeouts. */ + if (rds_iw_ring_empty(&ic->i_recv_ring)) + rds_iw_stats_inc(s_iw_rx_ring_empty); + + /* + * If the ring is running low, then schedule the thread to refill. + */ + if (rds_iw_ring_low(&ic->i_recv_ring)) + queue_delayed_work(rds_wq, &conn->c_recv_w, 0); +} + +int rds_iw_recv(struct rds_connection *conn) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + int ret = 0; + + rdsdebug("conn %p\n", conn); + + /* + * If we get a temporary posting failure in this context then + * we're really low and we want the caller to back off for a bit. + */ + mutex_lock(&ic->i_recv_mutex); + if (rds_iw_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0)) + ret = -ENOMEM; + else + rds_iw_stats_inc(s_iw_rx_refill_from_thread); + mutex_unlock(&ic->i_recv_mutex); + + if (rds_conn_up(conn)) + rds_iw_attempt_ack(ic); + + return ret; +} + +int __init rds_iw_recv_init(void) +{ + struct sysinfo si; + int ret = -ENOMEM; + + /* Default to 30% of all available RAM for recv memory */ + si_meminfo(&si); + rds_iw_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE; + + rds_iw_incoming_slab = kmem_cache_create("rds_iw_incoming", + sizeof(struct rds_iw_incoming), + 0, 0, NULL); + if (rds_iw_incoming_slab == NULL) + goto out; + + rds_iw_frag_slab = kmem_cache_create("rds_iw_frag", + sizeof(struct rds_page_frag), + 0, 0, NULL); + if (rds_iw_frag_slab == NULL) + kmem_cache_destroy(rds_iw_incoming_slab); + else + ret = 0; +out: + return ret; +} + +void rds_iw_recv_exit(void) +{ + kmem_cache_destroy(rds_iw_incoming_slab); + kmem_cache_destroy(rds_iw_frag_slab); +} diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c new file mode 100644 index 0000000..d422d4b --- /dev/null +++ b/net/rds/iw_ring.c @@ -0,0 +1,169 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "iw.h" + +/* + * Locking for IB rings. + * We assume that allocation is always protected by a mutex + * in the caller (this is a valid assumption for the current + * implementation). + * + * Freeing always happens in an interrupt, and hence only + * races with allocations, but not with other free()s. + * + * The interaction between allocation and freeing is that + * the alloc code has to determine the number of free entries. + * To this end, we maintain two counters; an allocation counter + * and a free counter. Both are allowed to run freely, and wrap + * around. + * The number of used entries is always (alloc_ctr - free_ctr) % NR. + * + * The current implementation makes free_ctr atomic. When the + * caller finds an allocation fails, it should set an "alloc fail" + * bit and retry the allocation. The "alloc fail" bit essentially tells + * the CQ completion handlers to wake it up after freeing some + * more entries. + */ + +/* + * This only happens on shutdown. + */ +DECLARE_WAIT_QUEUE_HEAD(rds_iw_ring_empty_wait); + +void rds_iw_ring_init(struct rds_iw_work_ring *ring, u32 nr) +{ + memset(ring, 0, sizeof(*ring)); + ring->w_nr = nr; + rdsdebug("ring %p nr %u\n", ring, ring->w_nr); +} + +static inline u32 __rds_iw_ring_used(struct rds_iw_work_ring *ring) +{ + u32 diff; + + /* This assumes that atomic_t has at least as many bits as u32 */ + diff = ring->w_alloc_ctr - (u32) atomic_read(&ring->w_free_ctr); + BUG_ON(diff > ring->w_nr); + + return diff; +} + +void rds_iw_ring_resize(struct rds_iw_work_ring *ring, u32 nr) +{ + /* We only ever get called from the connection setup code, + * prior to creating the QP. */ + BUG_ON(__rds_iw_ring_used(ring)); + ring->w_nr = nr; +} + +static int __rds_iw_ring_empty(struct rds_iw_work_ring *ring) +{ + return __rds_iw_ring_used(ring) == 0; +} + +u32 rds_iw_ring_alloc(struct rds_iw_work_ring *ring, u32 val, u32 *pos) +{ + u32 ret = 0, avail; + + avail = ring->w_nr - __rds_iw_ring_used(ring); + + rdsdebug("ring %p val %u next %u free %u\n", ring, val, + ring->w_alloc_ptr, avail); + + if (val && avail) { + ret = min(val, avail); + *pos = ring->w_alloc_ptr; + + ring->w_alloc_ptr = (ring->w_alloc_ptr + ret) % ring->w_nr; + ring->w_alloc_ctr += ret; + } + + return ret; +} + +void rds_iw_ring_free(struct rds_iw_work_ring *ring, u32 val) +{ + ring->w_free_ptr = (ring->w_free_ptr + val) % ring->w_nr; + atomic_add(val, &ring->w_free_ctr); + + if (__rds_iw_ring_empty(ring) && + waitqueue_active(&rds_iw_ring_empty_wait)) + wake_up(&rds_iw_ring_empty_wait); +} + +void rds_iw_ring_unalloc(struct rds_iw_work_ring *ring, u32 val) +{ + ring->w_alloc_ptr = (ring->w_alloc_ptr - val) % ring->w_nr; + ring->w_alloc_ctr -= val; +} + +int rds_iw_ring_empty(struct rds_iw_work_ring *ring) +{ + return __rds_iw_ring_empty(ring); +} + +int rds_iw_ring_low(struct rds_iw_work_ring *ring) +{ + return __rds_iw_ring_used(ring) <= (ring->w_nr >> 2); +} + + +/* + * returns the oldest alloced ring entry. This will be the next one + * freed. This can't be called if there are none allocated. + */ +u32 rds_iw_ring_oldest(struct rds_iw_work_ring *ring) +{ + return ring->w_free_ptr; +} + +/* + * returns the number of completed work requests. + */ + +u32 rds_iw_ring_completed(struct rds_iw_work_ring *ring, u32 wr_id, u32 oldest) +{ + u32 ret; + + if (oldest <= (unsigned long long)wr_id) + ret = (unsigned long long)wr_id - oldest + 1; + else + ret = ring->w_nr - oldest + (unsigned long long)wr_id + 1; + + rdsdebug("ring %p ret %u wr_id %u oldest %u\n", ring, ret, + wr_id, oldest); + return ret; +} diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c new file mode 100644 index 0000000..22dd38f --- /dev/null +++ b/net/rds/iw_send.c @@ -0,0 +1,975 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "rdma.h" +#include "iw.h" + +static void rds_iw_send_rdma_complete(struct rds_message *rm, + int wc_status) +{ + int notify_status; + + switch (wc_status) { + case IB_WC_WR_FLUSH_ERR: + return; + + case IB_WC_SUCCESS: + notify_status = RDS_RDMA_SUCCESS; + break; + + case IB_WC_REM_ACCESS_ERR: + notify_status = RDS_RDMA_REMOTE_ERROR; + break; + + default: + notify_status = RDS_RDMA_OTHER_ERROR; + break; + } + rds_rdma_send_complete(rm, notify_status); +} + +static void rds_iw_send_unmap_rdma(struct rds_iw_connection *ic, + struct rds_rdma_op *op) +{ + if (op->r_mapped) { + ib_dma_unmap_sg(ic->i_cm_id->device, + op->r_sg, op->r_nents, + op->r_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE); + op->r_mapped = 0; + } +} + +static void rds_iw_send_unmap_rm(struct rds_iw_connection *ic, + struct rds_iw_send_work *send, + int wc_status) +{ + struct rds_message *rm = send->s_rm; + + rdsdebug("ic %p send %p rm %p\n", ic, send, rm); + + ib_dma_unmap_sg(ic->i_cm_id->device, + rm->m_sg, rm->m_nents, + DMA_TO_DEVICE); + + if (rm->m_rdma_op != NULL) { + rds_iw_send_unmap_rdma(ic, rm->m_rdma_op); + + /* If the user asked for a completion notification on this + * message, we can implement three different semantics: + * 1. Notify when we received the ACK on the RDS message + * that was queued with the RDMA. This provides reliable + * notification of RDMA status at the expense of a one-way + * packet delay. + * 2. Notify when the IB stack gives us the completion event for + * the RDMA operation. + * 3. Notify when the IB stack gives us the completion event for + * the accompanying RDS messages. + * Here, we implement approach #3. To implement approach #2, + * call rds_rdma_send_complete from the cq_handler. To implement #1, + * don't call rds_rdma_send_complete at all, and fall back to the notify + * handling in the ACK processing code. + * + * Note: There's no need to explicitly sync any RDMA buffers using + * ib_dma_sync_sg_for_cpu - the completion for the RDMA + * operation itself unmapped the RDMA buffers, which takes care + * of synching. + */ + rds_iw_send_rdma_complete(rm, wc_status); + + if (rm->m_rdma_op->r_write) + rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes); + else + rds_stats_add(s_recv_rdma_bytes, rm->m_rdma_op->r_bytes); + } + + /* If anyone waited for this message to get flushed out, wake + * them up now */ + rds_message_unmapped(rm); + + rds_message_put(rm); + send->s_rm = NULL; +} + +void rds_iw_send_init_ring(struct rds_iw_connection *ic) +{ + struct rds_iw_send_work *send; + u32 i; + + for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) { + struct ib_sge *sge; + + send->s_rm = NULL; + send->s_op = NULL; + send->s_mapping = NULL; + + send->s_wr.next = NULL; + send->s_wr.wr_id = i; + send->s_wr.sg_list = send->s_sge; + send->s_wr.num_sge = 1; + send->s_wr.opcode = IB_WR_SEND; + send->s_wr.send_flags = 0; + send->s_wr.ex.imm_data = 0; + + sge = rds_iw_data_sge(ic, send->s_sge); + sge->lkey = 0; + + sge = rds_iw_header_sge(ic, send->s_sge); + sge->addr = ic->i_send_hdrs_dma + (i * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = 0; + + send->s_mr = ib_alloc_fast_reg_mr(ic->i_pd, fastreg_message_size); + if (IS_ERR(send->s_mr)) { + printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_mr failed\n"); + break; + } + + send->s_page_list = ib_alloc_fast_reg_page_list( + ic->i_cm_id->device, fastreg_message_size); + if (IS_ERR(send->s_page_list)) { + printk(KERN_WARNING "RDS/IW: ib_alloc_fast_reg_page_list failed\n"); + break; + } + } +} + +void rds_iw_send_clear_ring(struct rds_iw_connection *ic) +{ + struct rds_iw_send_work *send; + u32 i; + + for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) { + BUG_ON(!send->s_mr); + ib_dereg_mr(send->s_mr); + BUG_ON(!send->s_page_list); + ib_free_fast_reg_page_list(send->s_page_list); + if (send->s_wr.opcode == 0xdead) + continue; + if (send->s_rm) + rds_iw_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR); + if (send->s_op) + rds_iw_send_unmap_rdma(ic, send->s_op); + } +} + +/* + * The _oldest/_free ring operations here race cleanly with the alloc/unalloc + * operations performed in the send path. As the sender allocs and potentially + * unallocs the next free entry in the ring it doesn't alter which is + * the next to be freed, which is what this is concerned with. + */ +void rds_iw_send_cq_comp_handler(struct ib_cq *cq, void *context) +{ + struct rds_connection *conn = context; + struct rds_iw_connection *ic = conn->c_transport_data; + struct ib_wc wc; + struct rds_iw_send_work *send; + u32 completed; + u32 oldest; + u32 i; + int ret; + + rdsdebug("cq %p conn %p\n", cq, conn); + rds_iw_stats_inc(s_iw_tx_cq_call); + ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + if (ret) + rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + + while (ib_poll_cq(cq, 1, &wc) > 0) { + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", + (unsigned long long)wc.wr_id, wc.status, wc.byte_len, + be32_to_cpu(wc.ex.imm_data)); + rds_iw_stats_inc(s_iw_tx_cq_event); + + if (wc.status != IB_WC_SUCCESS) { + printk(KERN_ERR "WC Error: status = %d opcode = %d\n", wc.status, wc.opcode); + break; + } + + if (wc.opcode == IB_WC_LOCAL_INV && wc.wr_id == RDS_IW_LOCAL_INV_WR_ID) { + ic->i_fastreg_posted = 0; + continue; + } + + if (wc.opcode == IB_WC_FAST_REG_MR && wc.wr_id == RDS_IW_FAST_REG_WR_ID) { + ic->i_fastreg_posted = 1; + continue; + } + + if (wc.wr_id == RDS_IW_ACK_WR_ID) { + if (ic->i_ack_queued + HZ/2 < jiffies) + rds_iw_stats_inc(s_iw_tx_stalled); + rds_iw_ack_send_complete(ic); + continue; + } + + oldest = rds_iw_ring_oldest(&ic->i_send_ring); + + completed = rds_iw_ring_completed(&ic->i_send_ring, wc.wr_id, oldest); + + for (i = 0; i < completed; i++) { + send = &ic->i_sends[oldest]; + + /* In the error case, wc.opcode sometimes contains garbage */ + switch (send->s_wr.opcode) { + case IB_WR_SEND: + if (send->s_rm) + rds_iw_send_unmap_rm(ic, send, wc.status); + break; + case IB_WR_FAST_REG_MR: + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_READ: + case IB_WR_RDMA_READ_WITH_INV: + /* Nothing to be done - the SG list will be unmapped + * when the SEND completes. */ + break; + default: + if (printk_ratelimit()) + printk(KERN_NOTICE + "RDS/IW: %s: unexpected opcode 0x%x in WR!\n", + __func__, send->s_wr.opcode); + break; + } + + send->s_wr.opcode = 0xdead; + send->s_wr.num_sge = 1; + if (send->s_queued + HZ/2 < jiffies) + rds_iw_stats_inc(s_iw_tx_stalled); + + /* If a RDMA operation produced an error, signal this right + * away. If we don't, the subsequent SEND that goes with this + * RDMA will be canceled with ERR_WFLUSH, and the application + * never learn that the RDMA failed. */ + if (unlikely(wc.status == IB_WC_REM_ACCESS_ERR && send->s_op)) { + struct rds_message *rm; + + rm = rds_send_get_message(conn, send->s_op); + if (rm) + rds_iw_send_rdma_complete(rm, wc.status); + } + + oldest = (oldest + 1) % ic->i_send_ring.w_nr; + } + + rds_iw_ring_free(&ic->i_send_ring, completed); + + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags) + || test_bit(0, &conn->c_map_queued)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + + /* We expect errors as the qp is drained during shutdown */ + if (wc.status != IB_WC_SUCCESS && rds_conn_up(conn)) { + rds_iw_conn_error(conn, + "send completion on %pI4 " + "had status %u, disconnecting and reconnecting\n", + &conn->c_faddr, wc.status); + } + } +} + +/* + * This is the main function for allocating credits when sending + * messages. + * + * Conceptually, we have two counters: + * - send credits: this tells us how many WRs we're allowed + * to submit without overruning the reciever's queue. For + * each SEND WR we post, we decrement this by one. + * + * - posted credits: this tells us how many WRs we recently + * posted to the receive queue. This value is transferred + * to the peer as a "credit update" in a RDS header field. + * Every time we transmit credits to the peer, we subtract + * the amount of transferred credits from this counter. + * + * It is essential that we avoid situations where both sides have + * exhausted their send credits, and are unable to send new credits + * to the peer. We achieve this by requiring that we send at least + * one credit update to the peer before exhausting our credits. + * When new credits arrive, we subtract one credit that is withheld + * until we've posted new buffers and are ready to transmit these + * credits (see rds_iw_send_add_credits below). + * + * The RDS send code is essentially single-threaded; rds_send_xmit + * grabs c_send_lock to ensure exclusive access to the send ring. + * However, the ACK sending code is independent and can race with + * message SENDs. + * + * In the send path, we need to update the counters for send credits + * and the counter of posted buffers atomically - when we use the + * last available credit, we cannot allow another thread to race us + * and grab the posted credits counter. Hence, we have to use a + * spinlock to protect the credit counter, or use atomics. + * + * Spinlocks shared between the send and the receive path are bad, + * because they create unnecessary delays. An early implementation + * using a spinlock showed a 5% degradation in throughput at some + * loads. + * + * This implementation avoids spinlocks completely, putting both + * counters into a single atomic, and updating that atomic using + * atomic_add (in the receive path, when receiving fresh credits), + * and using atomic_cmpxchg when updating the two counters. + */ +int rds_iw_send_grab_credits(struct rds_iw_connection *ic, + u32 wanted, u32 *adv_credits, int need_posted) +{ + unsigned int avail, posted, got = 0, advertise; + long oldval, newval; + + *adv_credits = 0; + if (!ic->i_flowctl) + return wanted; + +try_again: + advertise = 0; + oldval = newval = atomic_read(&ic->i_credits); + posted = IB_GET_POST_CREDITS(oldval); + avail = IB_GET_SEND_CREDITS(oldval); + + rdsdebug("rds_iw_send_grab_credits(%u): credits=%u posted=%u\n", + wanted, avail, posted); + + /* The last credit must be used to send a credit update. */ + if (avail && !posted) + avail--; + + if (avail < wanted) { + struct rds_connection *conn = ic->i_cm_id->context; + + /* Oops, there aren't that many credits left! */ + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); + got = avail; + } else { + /* Sometimes you get what you want, lalala. */ + got = wanted; + } + newval -= IB_SET_SEND_CREDITS(got); + + /* + * If need_posted is non-zero, then the caller wants + * the posted regardless of whether any send credits are + * available. + */ + if (posted && (got || need_posted)) { + advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT); + newval -= IB_SET_POST_CREDITS(advertise); + } + + /* Finally bill everything */ + if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval) + goto try_again; + + *adv_credits = advertise; + return got; +} + +void rds_iw_send_add_credits(struct rds_connection *conn, unsigned int credits) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + + if (credits == 0) + return; + + rdsdebug("rds_iw_send_add_credits(%u): current=%u%s\n", + credits, + IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)), + test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ? ", ll_send_full" : ""); + + atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits); + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + + WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384); + + rds_iw_stats_inc(s_iw_rx_credit_updates); +} + +void rds_iw_advertise_credits(struct rds_connection *conn, unsigned int posted) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + + if (posted == 0) + return; + + atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits); + + /* Decide whether to send an update to the peer now. + * If we would send a credit update for every single buffer we + * post, we would end up with an ACK storm (ACK arrives, + * consumes buffer, we refill the ring, send ACK to remote + * advertising the newly posted buffer... ad inf) + * + * Performance pretty much depends on how often we send + * credit updates - too frequent updates mean lots of ACKs. + * Too infrequent updates, and the peer will run out of + * credits and has to throttle. + * For the time being, 16 seems to be a good compromise. + */ + if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16) + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); +} + +static inline void +rds_iw_xmit_populate_wr(struct rds_iw_connection *ic, + struct rds_iw_send_work *send, unsigned int pos, + unsigned long buffer, unsigned int length, + int send_flags) +{ + struct ib_sge *sge; + + WARN_ON(pos != send - ic->i_sends); + + send->s_wr.send_flags = send_flags; + send->s_wr.opcode = IB_WR_SEND; + send->s_wr.num_sge = 2; + send->s_wr.next = NULL; + send->s_queued = jiffies; + send->s_op = NULL; + + if (length != 0) { + sge = rds_iw_data_sge(ic, send->s_sge); + sge->addr = buffer; + sge->length = length; + sge->lkey = rds_iw_local_dma_lkey(ic); + + sge = rds_iw_header_sge(ic, send->s_sge); + } else { + /* We're sending a packet with no payload. There is only + * one SGE */ + send->s_wr.num_sge = 1; + sge = &send->s_sge[0]; + } + + sge->addr = ic->i_send_hdrs_dma + (pos * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = rds_iw_local_dma_lkey(ic); +} + +/* + * This can be called multiple times for a given message. The first time + * we see a message we map its scatterlist into the IB device so that + * we can provide that mapped address to the IB scatter gather entries + * in the IB work requests. We translate the scatterlist into a series + * of work requests that fragment the message. These work requests complete + * in order so we pass ownership of the message to the completion handler + * once we send the final fragment. + * + * The RDS core uses the c_send_lock to only enter this function once + * per connection. This makes sure that the tx ring alloc/unalloc pairs + * don't get out of sync and confuse the ring. + */ +int rds_iw_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + struct ib_device *dev = ic->i_cm_id->device; + struct rds_iw_send_work *send = NULL; + struct rds_iw_send_work *first; + struct rds_iw_send_work *prev; + struct ib_send_wr *failed_wr; + struct scatterlist *scat; + u32 pos; + u32 i; + u32 work_alloc; + u32 credit_alloc; + u32 posted; + u32 adv_credits = 0; + int send_flags = 0; + int sent; + int ret; + int flow_controlled = 0; + + BUG_ON(off % RDS_FRAG_SIZE); + BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); + + /* Fastreg support */ + if (rds_rdma_cookie_key(rm->m_rdma_cookie) + && !ic->i_fastreg_posted) { + ret = -EAGAIN; + goto out; + } + + /* FIXME we may overallocate here */ + if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) + i = 1; + else + i = ceil(be32_to_cpu(rm->m_inc.i_hdr.h_len), RDS_FRAG_SIZE); + + work_alloc = rds_iw_ring_alloc(&ic->i_send_ring, i, &pos); + if (work_alloc == 0) { + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); + rds_iw_stats_inc(s_iw_tx_ring_full); + ret = -ENOMEM; + goto out; + } + + credit_alloc = work_alloc; + if (ic->i_flowctl) { + credit_alloc = rds_iw_send_grab_credits(ic, work_alloc, &posted, 0); + adv_credits += posted; + if (credit_alloc < work_alloc) { + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc); + work_alloc = credit_alloc; + flow_controlled++; + } + if (work_alloc == 0) { + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_iw_stats_inc(s_iw_tx_throttle); + ret = -ENOMEM; + goto out; + } + } + + /* map the message the first time we see it */ + if (ic->i_rm == NULL) { + /* + printk(KERN_NOTICE "rds_iw_xmit prep msg dport=%u flags=0x%x len=%d\n", + be16_to_cpu(rm->m_inc.i_hdr.h_dport), + rm->m_inc.i_hdr.h_flags, + be32_to_cpu(rm->m_inc.i_hdr.h_len)); + */ + if (rm->m_nents) { + rm->m_count = ib_dma_map_sg(dev, + rm->m_sg, rm->m_nents, DMA_TO_DEVICE); + rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count); + if (rm->m_count == 0) { + rds_iw_stats_inc(s_iw_tx_sg_mapping_failure); + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc); + ret = -ENOMEM; /* XXX ? */ + goto out; + } + } else { + rm->m_count = 0; + } + + ic->i_unsignaled_wrs = rds_iw_sysctl_max_unsig_wrs; + ic->i_unsignaled_bytes = rds_iw_sysctl_max_unsig_bytes; + rds_message_addref(rm); + ic->i_rm = rm; + + /* Finalize the header */ + if (test_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags)) + rm->m_inc.i_hdr.h_flags |= RDS_FLAG_ACK_REQUIRED; + if (test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags)) + rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED; + + /* If it has a RDMA op, tell the peer we did it. This is + * used by the peer to release use-once RDMA MRs. */ + if (rm->m_rdma_op) { + struct rds_ext_header_rdma ext_hdr; + + ext_hdr.h_rdma_rkey = cpu_to_be32(rm->m_rdma_op->r_key); + rds_message_add_extension(&rm->m_inc.i_hdr, + RDS_EXTHDR_RDMA, &ext_hdr, sizeof(ext_hdr)); + } + if (rm->m_rdma_cookie) { + rds_message_add_rdma_dest_extension(&rm->m_inc.i_hdr, + rds_rdma_cookie_key(rm->m_rdma_cookie), + rds_rdma_cookie_offset(rm->m_rdma_cookie)); + } + + /* Note - rds_iw_piggyb_ack clears the ACK_REQUIRED bit, so + * we should not do this unless we have a chance of at least + * sticking the header into the send ring. Which is why we + * should call rds_iw_ring_alloc first. */ + rm->m_inc.i_hdr.h_ack = cpu_to_be64(rds_iw_piggyb_ack(ic)); + rds_message_make_checksum(&rm->m_inc.i_hdr); + + /* + * Update adv_credits since we reset the ACK_REQUIRED bit. + */ + rds_iw_send_grab_credits(ic, 0, &posted, 1); + adv_credits += posted; + BUG_ON(adv_credits > 255); + } else if (ic->i_rm != rm) + BUG(); + + send = &ic->i_sends[pos]; + first = send; + prev = NULL; + scat = &rm->m_sg[sg]; + sent = 0; + i = 0; + + /* Sometimes you want to put a fence between an RDMA + * READ and the following SEND. + * We could either do this all the time + * or when requested by the user. Right now, we let + * the application choose. + */ + if (rm->m_rdma_op && rm->m_rdma_op->r_fence) + send_flags = IB_SEND_FENCE; + + /* + * We could be copying the header into the unused tail of the page. + * That would need to be changed in the future when those pages might + * be mapped userspace pages or page cache pages. So instead we always + * use a second sge and our long-lived ring of mapped headers. We send + * the header after the data so that the data payload can be aligned on + * the receiver. + */ + + /* handle a 0-len message */ + if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) { + rds_iw_xmit_populate_wr(ic, send, pos, 0, 0, send_flags); + goto add_header; + } + + /* if there's data reference it with a chain of work reqs */ + for (; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) { + unsigned int len; + + send = &ic->i_sends[pos]; + + len = min(RDS_FRAG_SIZE, ib_sg_dma_len(dev, scat) - off); + rds_iw_xmit_populate_wr(ic, send, pos, + ib_sg_dma_address(dev, scat) + off, len, + send_flags); + + /* + * We want to delay signaling completions just enough to get + * the batching benefits but not so much that we create dead time + * on the wire. + */ + if (ic->i_unsignaled_wrs-- == 0) { + ic->i_unsignaled_wrs = rds_iw_sysctl_max_unsig_wrs; + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + } + + ic->i_unsignaled_bytes -= len; + if (ic->i_unsignaled_bytes <= 0) { + ic->i_unsignaled_bytes = rds_iw_sysctl_max_unsig_bytes; + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + } + + /* + * Always signal the last one if we're stopping due to flow control. + */ + if (flow_controlled && i == (work_alloc-1)) + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + rdsdebug("send %p wr %p num_sge %u next %p\n", send, + &send->s_wr, send->s_wr.num_sge, send->s_wr.next); + + sent += len; + off += len; + if (off == ib_sg_dma_len(dev, scat)) { + scat++; + off = 0; + } + +add_header: + /* Tack on the header after the data. The header SGE should already + * have been set up to point to the right header buffer. */ + memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header)); + + if (0) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n", + be16_to_cpu(hdr->h_dport), + hdr->h_flags, + be32_to_cpu(hdr->h_len)); + } + if (adv_credits) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + /* add credit and redo the header checksum */ + hdr->h_credit = adv_credits; + rds_message_make_checksum(hdr); + adv_credits = 0; + rds_iw_stats_inc(s_iw_tx_credit_updates); + } + + if (prev) + prev->s_wr.next = &send->s_wr; + prev = send; + + pos = (pos + 1) % ic->i_send_ring.w_nr; + } + + /* Account the RDS header in the number of bytes we sent, but just once. + * The caller has no concept of fragmentation. */ + if (hdr_off == 0) + sent += sizeof(struct rds_header); + + /* if we finished the message then send completion owns it */ + if (scat == &rm->m_sg[rm->m_count]) { + prev->s_rm = ic->i_rm; + prev->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + ic->i_rm = NULL; + } + + if (i < work_alloc) { + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc - i); + work_alloc = i; + } + if (ic->i_flowctl && i < credit_alloc) + rds_iw_send_add_credits(conn, credit_alloc - i); + + /* XXX need to worry about failed_wr and partial sends. */ + failed_wr = &first->s_wr; + ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr); + rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic, + first, &first->s_wr, ret, failed_wr); + BUG_ON(failed_wr != &first->s_wr); + if (ret) { + printk(KERN_WARNING "RDS/IW: ib_post_send to %pI4 " + "returned %d\n", &conn->c_faddr, ret); + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc); + if (prev->s_rm) { + ic->i_rm = prev->s_rm; + prev->s_rm = NULL; + } + goto out; + } + + ret = sent; +out: + BUG_ON(adv_credits); + return ret; +} + +static void rds_iw_build_send_fastreg(struct rds_iw_device *rds_iwdev, struct rds_iw_connection *ic, struct rds_iw_send_work *send, int nent, int len, u64 sg_addr) +{ + BUG_ON(nent > send->s_page_list->max_page_list_len); + /* + * Perform a WR for the fast_reg_mr. Each individual page + * in the sg list is added to the fast reg page list and placed + * inside the fast_reg_mr WR. + */ + send->s_wr.opcode = IB_WR_FAST_REG_MR; + send->s_wr.wr.fast_reg.length = len; + send->s_wr.wr.fast_reg.rkey = send->s_mr->rkey; + send->s_wr.wr.fast_reg.page_list = send->s_page_list; + send->s_wr.wr.fast_reg.page_list_len = nent; + send->s_wr.wr.fast_reg.page_shift = rds_iwdev->page_shift; + send->s_wr.wr.fast_reg.access_flags = IB_ACCESS_REMOTE_WRITE; + send->s_wr.wr.fast_reg.iova_start = sg_addr; + + ib_update_fast_reg_key(send->s_mr, send->s_remap_count++); +} + +int rds_iw_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + struct rds_iw_send_work *send = NULL; + struct rds_iw_send_work *first; + struct rds_iw_send_work *prev; + struct ib_send_wr *failed_wr; + struct rds_iw_device *rds_iwdev; + struct scatterlist *scat; + unsigned long len; + u64 remote_addr = op->r_remote_addr; + u32 pos, fr_pos; + u32 work_alloc; + u32 i; + u32 j; + int sent; + int ret; + int num_sge; + + rds_iwdev = ib_get_client_data(ic->i_cm_id->device, &rds_iw_client); + + /* map the message the first time we see it */ + if (!op->r_mapped) { + op->r_count = ib_dma_map_sg(ic->i_cm_id->device, + op->r_sg, op->r_nents, (op->r_write) ? + DMA_TO_DEVICE : DMA_FROM_DEVICE); + rdsdebug("ic %p mapping op %p: %d\n", ic, op, op->r_count); + if (op->r_count == 0) { + rds_iw_stats_inc(s_iw_tx_sg_mapping_failure); + ret = -ENOMEM; /* XXX ? */ + goto out; + } + + op->r_mapped = 1; + } + + if (!op->r_write) { + /* Alloc space on the send queue for the fastreg */ + work_alloc = rds_iw_ring_alloc(&ic->i_send_ring, 1, &fr_pos); + if (work_alloc != 1) { + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_iw_stats_inc(s_iw_tx_ring_full); + ret = -ENOMEM; + goto out; + } + } + + /* + * Instead of knowing how to return a partial rdma read/write we insist that there + * be enough work requests to send the entire message. + */ + i = ceil(op->r_count, rds_iwdev->max_sge); + + work_alloc = rds_iw_ring_alloc(&ic->i_send_ring, i, &pos); + if (work_alloc != i) { + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_iw_stats_inc(s_iw_tx_ring_full); + ret = -ENOMEM; + goto out; + } + + send = &ic->i_sends[pos]; + if (!op->r_write) { + first = prev = &ic->i_sends[fr_pos]; + } else { + first = send; + prev = NULL; + } + scat = &op->r_sg[0]; + sent = 0; + num_sge = op->r_count; + + for (i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++) { + send->s_wr.send_flags = 0; + send->s_queued = jiffies; + + /* + * We want to delay signaling completions just enough to get + * the batching benefits but not so much that we create dead time on the wire. + */ + if (ic->i_unsignaled_wrs-- == 0) { + ic->i_unsignaled_wrs = rds_iw_sysctl_max_unsig_wrs; + send->s_wr.send_flags = IB_SEND_SIGNALED; + } + + /* To avoid the need to have the plumbing to invalidate the fastreg_mr used + * for local access after RDS is finished with it, using + * IB_WR_RDMA_READ_WITH_INV will invalidate it after the read has completed. + */ + if (op->r_write) + send->s_wr.opcode = IB_WR_RDMA_WRITE; + else + send->s_wr.opcode = IB_WR_RDMA_READ_WITH_INV; + + send->s_wr.wr.rdma.remote_addr = remote_addr; + send->s_wr.wr.rdma.rkey = op->r_key; + send->s_op = op; + + if (num_sge > rds_iwdev->max_sge) { + send->s_wr.num_sge = rds_iwdev->max_sge; + num_sge -= rds_iwdev->max_sge; + } else + send->s_wr.num_sge = num_sge; + + send->s_wr.next = NULL; + + if (prev) + prev->s_wr.next = &send->s_wr; + + for (j = 0; j < send->s_wr.num_sge && scat != &op->r_sg[op->r_count]; j++) { + len = ib_sg_dma_len(ic->i_cm_id->device, scat); + + if (send->s_wr.opcode == IB_WR_RDMA_READ_WITH_INV) + send->s_page_list->page_list[j] = ib_sg_dma_address(ic->i_cm_id->device, scat); + else { + send->s_sge[j].addr = ib_sg_dma_address(ic->i_cm_id->device, scat); + send->s_sge[j].length = len; + send->s_sge[j].lkey = rds_iw_local_dma_lkey(ic); + } + + sent += len; + rdsdebug("ic %p sent %d remote_addr %llu\n", ic, sent, remote_addr); + remote_addr += len; + + scat++; + } + + if (send->s_wr.opcode == IB_WR_RDMA_READ_WITH_INV) { + send->s_wr.num_sge = 1; + send->s_sge[0].addr = conn->c_xmit_rm->m_rs->rs_user_addr; + send->s_sge[0].length = conn->c_xmit_rm->m_rs->rs_user_bytes; + send->s_sge[0].lkey = ic->i_sends[fr_pos].s_mr->lkey; + } + + rdsdebug("send %p wr %p num_sge %u next %p\n", send, + &send->s_wr, send->s_wr.num_sge, send->s_wr.next); + + prev = send; + if (++send == &ic->i_sends[ic->i_send_ring.w_nr]) + send = ic->i_sends; + } + + /* if we finished the message then send completion owns it */ + if (scat == &op->r_sg[op->r_count]) + first->s_wr.send_flags = IB_SEND_SIGNALED; + + if (i < work_alloc) { + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc - i); + work_alloc = i; + } + + /* On iWARP, local memory access by a remote system (ie, RDMA Read) is not + * recommended. Putting the lkey on the wire is a security hole, as it can + * allow for memory access to all of memory on the remote system. Some + * adapters do not allow using the lkey for this at all. To bypass this use a + * fastreg_mr (or possibly a dma_mr) + */ + if (!op->r_write) { + rds_iw_build_send_fastreg(rds_iwdev, ic, &ic->i_sends[fr_pos], + op->r_count, sent, conn->c_xmit_rm->m_rs->rs_user_addr); + work_alloc++; + } + + failed_wr = &first->s_wr; + ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr); + rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic, + first, &first->s_wr, ret, failed_wr); + BUG_ON(failed_wr != &first->s_wr); + if (ret) { + printk(KERN_WARNING "RDS/IW: rdma ib_post_send to %pI4 " + "returned %d\n", &conn->c_faddr, ret); + rds_iw_ring_unalloc(&ic->i_send_ring, work_alloc); + goto out; + } + +out: + return ret; +} + +void rds_iw_xmit_complete(struct rds_connection *conn) +{ + struct rds_iw_connection *ic = conn->c_transport_data; + + /* We may have a pending ACK or window update we were unable + * to send previously (due to flow control). Try again. */ + rds_iw_attempt_ack(ic); +} diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c new file mode 100644 index 0000000..ccc7e8f --- /dev/null +++ b/net/rds/iw_stats.c @@ -0,0 +1,95 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "iw.h" + +DEFINE_PER_CPU(struct rds_iw_statistics, rds_iw_stats) ____cacheline_aligned; + +static char *rds_iw_stat_names[] = { + "iw_connect_raced", + "iw_listen_closed_stale", + "iw_tx_cq_call", + "iw_tx_cq_event", + "iw_tx_ring_full", + "iw_tx_throttle", + "iw_tx_sg_mapping_failure", + "iw_tx_stalled", + "iw_tx_credit_updates", + "iw_rx_cq_call", + "iw_rx_cq_event", + "iw_rx_ring_empty", + "iw_rx_refill_from_cq", + "iw_rx_refill_from_thread", + "iw_rx_alloc_limit", + "iw_rx_credit_updates", + "iw_ack_sent", + "iw_ack_send_failure", + "iw_ack_send_delayed", + "iw_ack_send_piggybacked", + "iw_ack_received", + "iw_rdma_mr_alloc", + "iw_rdma_mr_free", + "iw_rdma_mr_used", + "iw_rdma_mr_pool_flush", + "iw_rdma_mr_pool_wait", + "iw_rdma_mr_pool_depleted", +}; + +unsigned int rds_iw_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail) +{ + struct rds_iw_statistics stats = {0, }; + uint64_t *src; + uint64_t *sum; + size_t i; + int cpu; + + if (avail < ARRAY_SIZE(rds_iw_stat_names)) + goto out; + + for_each_online_cpu(cpu) { + src = (uint64_t *)&(per_cpu(rds_iw_stats, cpu)); + sum = (uint64_t *)&stats; + for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++) + *(sum++) += *(src++); + } + + rds_stats_info_copy(iter, (uint64_t *)&stats, rds_iw_stat_names, + ARRAY_SIZE(rds_iw_stat_names)); +out: + return ARRAY_SIZE(rds_iw_stat_names); +} diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c new file mode 100644 index 0000000..9590678 --- /dev/null +++ b/net/rds/iw_sysctl.c @@ -0,0 +1,137 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "iw.h" + +static struct ctl_table_header *rds_iw_sysctl_hdr; + +unsigned long rds_iw_sysctl_max_send_wr = RDS_IW_DEFAULT_SEND_WR; +unsigned long rds_iw_sysctl_max_recv_wr = RDS_IW_DEFAULT_RECV_WR; +unsigned long rds_iw_sysctl_max_recv_allocation = (128 * 1024 * 1024) / RDS_FRAG_SIZE; +static unsigned long rds_iw_sysctl_max_wr_min = 1; +/* hardware will fail CQ creation long before this */ +static unsigned long rds_iw_sysctl_max_wr_max = (u32)~0; + +unsigned long rds_iw_sysctl_max_unsig_wrs = 16; +static unsigned long rds_iw_sysctl_max_unsig_wr_min = 1; +static unsigned long rds_iw_sysctl_max_unsig_wr_max = 64; + +unsigned long rds_iw_sysctl_max_unsig_bytes = (16 << 20); +static unsigned long rds_iw_sysctl_max_unsig_bytes_min = 1; +static unsigned long rds_iw_sysctl_max_unsig_bytes_max = ~0UL; + +unsigned int rds_iw_sysctl_flow_control = 1; + +ctl_table rds_iw_sysctl_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_send_wr", + .data = &rds_iw_sysctl_max_send_wr, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_iw_sysctl_max_wr_min, + .extra2 = &rds_iw_sysctl_max_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_recv_wr", + .data = &rds_iw_sysctl_max_recv_wr, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_iw_sysctl_max_wr_min, + .extra2 = &rds_iw_sysctl_max_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unsignaled_wr", + .data = &rds_iw_sysctl_max_unsig_wrs, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_iw_sysctl_max_unsig_wr_min, + .extra2 = &rds_iw_sysctl_max_unsig_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unsignaled_bytes", + .data = &rds_iw_sysctl_max_unsig_bytes, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_iw_sysctl_max_unsig_bytes_min, + .extra2 = &rds_iw_sysctl_max_unsig_bytes_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_recv_allocation", + .data = &rds_iw_sysctl_max_recv_allocation, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "flow_control", + .data = &rds_iw_sysctl_flow_control, + .maxlen = sizeof(rds_iw_sysctl_flow_control), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = 0} +}; + +static struct ctl_path rds_iw_sysctl_path[] = { + { .procname = "net", .ctl_name = CTL_NET, }, + { .procname = "rds", .ctl_name = CTL_UNNUMBERED, }, + { .procname = "iw", .ctl_name = CTL_UNNUMBERED, }, + { } +}; + +void rds_iw_sysctl_exit(void) +{ + if (rds_iw_sysctl_hdr) + unregister_sysctl_table(rds_iw_sysctl_hdr); +} + +int __init rds_iw_sysctl_init(void) +{ + rds_iw_sysctl_hdr = register_sysctl_paths(rds_iw_sysctl_path, rds_iw_sysctl_table); + if (rds_iw_sysctl_hdr == NULL) + return -ENOMEM; + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Tue Feb 24 17:30:43 2009 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 24 Feb 2009 17:30:43 -0800 Subject: [ofa-general] [PATCH 26/26] RDS: Add RDS to AF key strings In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <1235525443-9007-27-git-send-email-andy.grover@oracle.com> Signed-off-by: Andy Grover --- net/core/sock.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 8ee734e..7c6d089 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -155,7 +155,7 @@ static const char *af_family_key_strings[AF_MAX+1] = { "sk_lock-27" , "sk_lock-28" , "sk_lock-AF_CAN" , "sk_lock-AF_TIPC" , "sk_lock-AF_BLUETOOTH", "sk_lock-IUCV" , "sk_lock-AF_RXRPC" , "sk_lock-AF_ISDN" , "sk_lock-AF_PHONET" , - "sk_lock-AF_MAX" + "sk_lock-AF_RDS" , "sk_lock-AF_MAX" }; static const char *af_family_slock_key_strings[AF_MAX+1] = { "slock-AF_UNSPEC", "slock-AF_UNIX" , "slock-AF_INET" , @@ -170,7 +170,7 @@ static const char *af_family_slock_key_strings[AF_MAX+1] = { "slock-27" , "slock-28" , "slock-AF_CAN" , "slock-AF_TIPC" , "slock-AF_BLUETOOTH", "slock-AF_IUCV" , "slock-AF_RXRPC" , "slock-AF_ISDN" , "slock-AF_PHONET" , - "slock-AF_MAX" + "slock-AF_RDS" , "slock-AF_MAX" }; static const char *af_family_clock_key_strings[AF_MAX+1] = { "clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" , @@ -185,7 +185,7 @@ static const char *af_family_clock_key_strings[AF_MAX+1] = { "clock-27" , "clock-28" , "clock-AF_CAN" , "clock-AF_TIPC" , "clock-AF_BLUETOOTH", "clock-AF_IUCV" , "clock-AF_RXRPC" , "clock-AF_ISDN" , "clock-AF_PHONET" , - "clock-AF_MAX" + "clock-AF_RDS" , "clock-AF_MAX" }; /* -- 1.5.6.3 From phillipwils at gmail.com Tue Feb 24 21:51:31 2009 From: phillipwils at gmail.com (Phillip Wilson) Date: Tue, 24 Feb 2009 21:51:31 -0800 Subject: [ofa-general] ***SPAM*** Mellanox ibv_reg_mr (memory region) function call fails under load when using the mlx4 driver Message-ID: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com> The “ibv_reg_mr()” function call fails with HCA (DID=0x634A) that uses the mlx4_0 driver when the system is under load (memory and cpu). The system usually has over 500MB of system memory when “ibv_reg_mr()” call fails. If I only run one HCA with (DID=0x6278) that uses the mthca0 driver with the other tools to generate stress the “ibv_reg_mr()” call always passes. If I only run the HCA with (DID=0x634A) with the other tools to generate stress the “ibv_reg_mr()” call will always fails; it usually takes less than 30 minutes for the failure to occur. The maximum number of memory regions requested at one time is up to 8 (32MB) with two HCA dual port cards and the maximum size for a memory region is 1 MB. (i.e. ctx->mr = ibv_reg_mr(ctx->pd, buffer, /*malloc 4MB buffer per process*/ size, /*2 Bytes to 1MB */ IBV_ACCESS_LOCAL_WRITE); ) I modified the ibv_rc_pingpong test to use the parent-child paradigm instead of the current client/server approach for my environment. The code forks a parent process and a child process per port which serves the same purpose as the current client/server approach. The code also forks a process to run on a HCA. Basically, the same code is executed on each HCA except for the user libraries (libmlx4.so, libmthca.so), mlx4.ko, mthca.ko and firmware on each HCA. Since the code in the user libraries is very similar to each other, I suspect the issue is in the kernel code or HCA firmware. Does any one know what kernel patch fixes this issue starting from kernel 2.6.24 through 2.6.28? Has anyone else seen this issue? System Information: The system has 4GB of memory. uname -a Linux (none) 2.6.24.02.02.08 #21 SMP Thu Feb 19 11:04:35 PST 2009 ia64 unknown OFED 1.2.5 lspci -d 15b3: 0000:10:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) 0000:c3:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0) lspci -d 15b3: -n 0000:10:00.0 0c06: 15b3:6278 (rev 20) 0000:c3:00.0 0c06: 15b3:634a (rev a0) ibv_devinfo -v hca_id: mlx4_0 fw_ver: 2.5.000 hca_id: mthca0 fw_ver: 4.8.930 -------------- next part -------------- An HTML attachment was scrubbed... URL: From davem at davemloft.net Tue Feb 24 23:26:48 2009 From: davem at davemloft.net (David Miller) Date: Tue, 24 Feb 2009 23:26:48 -0800 (PST) Subject: [ofa-general] Re: [PATCH 23/26] RDS: Add AF and PF #defines for RDS sockets In-Reply-To: <1235525443-9007-24-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <1235525443-9007-24-git-send-email-andy.grover@oracle.com> Message-ID: <20090224.232648.84203571.davem@davemloft.net> From: Andy Grover Date: Tue, 24 Feb 2009 17:30:40 -0800 > @@ -191,7 +191,8 @@ struct ucred { > #define AF_RXRPC 33 /* RxRPC sockets */ > #define AF_ISDN 34 /* mISDN sockets */ > #define AF_PHONET 35 /* Phonet sockets */ > -#define AF_MAX 36 /* For now.. */ > +#define AF_RDS 36 /* RDS sockets */ > +#define AF_MAX 37 /* For now.. */ > > /* Protocol families, same as address families. */ > #define PF_UNSPEC AF_UNSPEC Pick an unused number, you don't have to increment AF_MAX to allocate a value. And I don't want to hear any whining about how you've used this value of 36 internally for a long time or anything like that. From davem at davemloft.net Tue Feb 24 23:28:14 2009 From: davem at davemloft.net (David Miller) Date: Tue, 24 Feb 2009 23:28:14 -0800 (PST) Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <20090224.232814.227017310.davem@davemloft.net> From: Andy Grover Date: Tue, 24 Feb 2009 17:30:17 -0800 > This patchset against net-next adds support for RDS sockets. RDS is an > Oracle-originated protocol used to send IPC datagrams (up to 1MB) > reliably, and is used currently in Oracle RAC and Exadata products. > > I've addressed all the issues from comments on take 1. (thanks!) This patchset > squashes the changes into the original changeset, but I've also included > a tree where the un-squashed changes since last time may be reviewed: > git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6.git > rds-broken-out-fixes > > Major changes since last time include moving to net/rds, and the > additional inclusion of iwarp transport support. This makes RDMA too much of a first-class citizen in the networking stack. That's a blocker for me. Furthermore the port you've choosen for the protocol is arbitrary, not properly allocated with the appropriate standards committee, and therefore could conflict with something other people are using. I'm rejecting these patches, sorry. From dotanba at gmail.com Tue Feb 24 23:50:54 2009 From: dotanba at gmail.com (Dotan Barak) Date: Wed, 25 Feb 2009 09:50:54 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Mellanox ibv_reg_mr (memory region) function call fails under load when using the mlx4 driver In-Reply-To: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com> References: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com> Message-ID: <2f3bf9a60902242350x7cad3b6u8bf8d86027a9795@mail.gmail.com> Do you execute your program under the root user or under any other user? (maybe you fail because of the ulimit value of memory which can be pinned) Dotan On Wed, Feb 25, 2009 at 7:51 AM, Phillip Wilson wrote: > The “ibv_reg_mr()” function call fails with HCA (DID=0x634A) that uses the > mlx4_0 driver when the system is under load (memory and cpu).  The system > usually has over 500MB of system memory when “ibv_reg_mr()” call fails. > > > > If I only run one HCA with (DID=0x6278) that uses the mthca0 driver with the > other tools to generate stress the “ibv_reg_mr()” call always passes.  If I > only run the HCA with (DID=0x634A) with the other tools to generate stress > the “ibv_reg_mr()” call will always fails; it usually takes less than 30 > minutes for the failure to occur. > > > > > > The maximum number of memory regions requested at one time is up to 8 (32MB) > with two HCA dual port cards and the maximum size for a memory region is 1 > MB. > > > > (i.e. ctx->mr = ibv_reg_mr(ctx->pd, > >                                              buffer,  /*malloc 4MB buffer > per process*/ > >                                              size,      /*2 Bytes to 1MB */ > >                                              IBV_ACCESS_LOCAL_WRITE); > > ) > > > > I modified the ibv_rc_pingpong test to use the parent-child paradigm instead > of the current client/server approach for my environment.  The code forks a > parent process and a child process per port which serves the same purpose as > the current client/server approach.  The code also forks a process to run on > a HCA.  Basically, the same code is executed on each HCA except for the user > libraries (libmlx4.so, libmthca.so), mlx4.ko, mthca.ko and firmware on each > HCA. > > > > Since the code in the user libraries is very similar to each other, I > suspect the issue is in the kernel code or HCA firmware. > > > > Does any one know what kernel patch fixes this issue starting from kernel > 2.6.24 through 2.6.28?  Has anyone else seen this issue? > > > > System Information: > > > > The system has 4GB of memory. > > > > uname -a > > Linux (none) 2.6.24.02.02.08 #21 SMP Thu Feb 19 11:04:35 PST 2009 ia64 > unknown > > > > OFED 1.2.5 > > > > lspci -d 15b3: > > > > 0000:10:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex > (Tavor compatibility mode) (rev 20) > > 0000:c3:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0) > > > > lspci -d 15b3: -n > > 0000:10:00.0 0c06: 15b3:6278 (rev 20) > > 0000:c3:00.0 0c06: 15b3:634a (rev a0) > > > > ibv_devinfo -v > > hca_id: mlx4_0 > >         fw_ver:                         2.5.000 > > > > hca_id: mthca0 > >         fw_ver:                         4.8.930 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Wed Feb 25 00:04:09 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 25 Feb 2009 10:04:09 +0200 Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <20090224.232814.227017310.davem@davemloft.net> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> Message-ID: <49A4FB79.1090809@voltaire.com> David Miller wrote: >> Major changes since last time include moving to net/rds, and the additional inclusion of iwarp transport support. >> > This makes RDMA too much of a first-class citizen in the networking stack. That's a blocker for me. > Hi Dave Can you elaborate a bit further, I wasn't sure to follow if your comment was related to the inclusion of iwarp or to something else. Or. From davem at davemloft.net Wed Feb 25 00:06:28 2009 From: davem at davemloft.net (David Miller) Date: Wed, 25 Feb 2009 00:06:28 -0800 (PST) Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <49A4FB79.1090809@voltaire.com> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> <49A4FB79.1090809@voltaire.com> Message-ID: <20090225.000628.108688119.davem@davemloft.net> From: Or Gerlitz Date: Wed, 25 Feb 2009 10:04:09 +0200 > Can you elaborate a bit further, I wasn't sure to follow if your > comment was related to the inclusion of iwarp or to something else. It's making real sockets, using the real networking stack, using up real IP port/address pairs recognized by the rest of the real networking stack, and doing RDMA over that connection. That's not allowed. We always said that if these RDMA things are in the tree, they should use their own IP addresses and that are not visible to the real Linux networking stack. From phillipwils at gmail.com Wed Feb 25 00:29:34 2009 From: phillipwils at gmail.com (Phillip Wilson) Date: Wed, 25 Feb 2009 00:29:34 -0800 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Mellanox ibv_reg_mr (memory region) function call fails under load when using the mlx4 driver In-Reply-To: <2f3bf9a60902242350x7cad3b6u8bf8d86027a9795@mail.gmail.com> References: <6e4f44220902242151j4aed43d4va31525490c0cdd86@mail.gmail.com> <2f3bf9a60902242350x7cad3b6u8bf8d86027a9795@mail.gmail.com> Message-ID: <6e4f44220902250029r65ba36d8me002e916f638d443@mail.gmail.com> All programs are executed as the root user. ulimit -a time(seconds) unlimited file(blocks) unlimited data(kbytes) unlimited stack(kbytes) unlimited coredump(blocks) 0 memory(kbytes) unlimited locked memory(kbytes) unlimited process 8063 nofiles 1048576 vmemory(kbytes) unlimited locks unlimited On Tue, Feb 24, 2009 at 11:50 PM, Dotan Barak wrote: > Do you execute your program under the root user or under any other user? > (maybe you fail because of the ulimit value of memory which can be pinned) > > > Dotan > > On Wed, Feb 25, 2009 at 7:51 AM, Phillip Wilson > wrote: > > The “ibv_reg_mr()” function call fails with HCA (DID=0x634A) that uses > the > > mlx4_0 driver when the system is under load (memory and cpu). The system > > usually has over 500MB of system memory when “ibv_reg_mr()” call fails. > > > > > > > > If I only run one HCA with (DID=0x6278) that uses the mthca0 driver with > the > > other tools to generate stress the “ibv_reg_mr()” call always passes. If > I > > only run the HCA with (DID=0x634A) with the other tools to generate > stress > > the “ibv_reg_mr()” call will always fails; it usually takes less than 30 > > minutes for the failure to occur. > > > > > > > > > > > > The maximum number of memory regions requested at one time is up to 8 > (32MB) > > with two HCA dual port cards and the maximum size for a memory region is > 1 > > MB. > > > > > > > > (i.e. ctx->mr = ibv_reg_mr(ctx->pd, > > > > buffer, /*malloc 4MB buffer > > per process*/ > > > > size, /*2 Bytes to 1MB > */ > > > > IBV_ACCESS_LOCAL_WRITE); > > > > ) > > > > > > > > I modified the ibv_rc_pingpong test to use the parent-child paradigm > instead > > of the current client/server approach for my environment. The code forks > a > > parent process and a child process per port which serves the same purpose > as > > the current client/server approach. The code also forks a process to run > on > > a HCA. Basically, the same code is executed on each HCA except for the > user > > libraries (libmlx4.so, libmthca.so), mlx4.ko, mthca.ko and firmware on > each > > HCA. > > > > > > > > Since the code in the user libraries is very similar to each other, I > > suspect the issue is in the kernel code or HCA firmware. > > > > > > > > Does any one know what kernel patch fixes this issue starting from kernel > > 2.6.24 through 2.6.28? Has anyone else seen this issue? > > > > > > > > System Information: > > > > > > > > The system has 4GB of memory. > > > > > > > > uname -a > > > > Linux (none) 2.6.24.02.02.08 #21 SMP Thu Feb 19 11:04:35 PST 2009 ia64 > > unknown > > > > > > > > OFED 1.2.5 > > > > > > > > lspci -d 15b3: > > > > > > > > 0000:10:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex > > (Tavor compatibility mode) (rev 20) > > > > 0000:c3:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev > a0) > > > > > > > > lspci -d 15b3: -n > > > > 0000:10:00.0 0c06: 15b3:6278 (rev 20) > > > > 0000:c3:00.0 0c06: 15b3:634a (rev a0) > > > > > > > > ibv_devinfo -v > > > > hca_id: mlx4_0 > > > > fw_ver: 2.5.000 > > > > > > > > hca_id: mthca0 > > > > fw_ver: 4.8.930 > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at Voltaire.com Wed Feb 25 01:46:00 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Wed, 25 Feb 2009 11:46:00 +0200 Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <20090225.000628.108688119.davem@davemloft.net> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> <49A4FB79.1090809@voltaire.com> <20090225.000628.108688119.davem@davemloft.net> Message-ID: <49A51358.4080408@Voltaire.com> David Miller wrote: > It's making real sockets, using the real networking stack, > using up real IP port/address pairs recognized by the rest > of the real networking stack, and doing RDMA over that connection. The only usage of the network stack done by the RDMA stack (at its rdma connection manager) on behalf of protocols such as RDS is for address resolution. This practice of sending ARPs is supported by the mainline kernel for long time and common also among other technologies / drivers. Or. From vlad at lists.openfabrics.org Wed Feb 25 03:28:20 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 25 Feb 2009 03:28:20 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090225-0200 daily build status Message-ID: <20090225112820.B22F9E60FC3@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From kliteyn at dev.mellanox.co.il Wed Feb 25 04:25:08 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 25 Feb 2009 14:25:08 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <20090224143706.GO7641@sashak.voltaire.com> References: <499AB068.2020205@dev.mellanox.co.il> <20090218181955.GX5910@sashak.voltaire.com> <499C7E2D.8050301@dev.mellanox.co.il> <20090224143706.GO7641@sashak.voltaire.com> Message-ID: <49A538A4.3070008@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 23:31 Wed 18 Feb , Yevgeny Kliteynik wrote: > > [snip...] >> Good point. >> I'll repost the patch when we finish discussing it. > > Let's go this way now. Please resend the patch. Will do. > After looking closer into scenario with SwithInfo/PortInfo race I'm > thinking about two optimizations there: > > 1. Initialize all switch ports (and not only local and port 0) right on > first NodeInfo receiving (via osm_node_new()) - this makes your patch > unnecessary, but it is a bigger change which will definitely require some > heavy testing, so it is fine IMO to do it subsequently. > > 2. Request PortInfo for all switch ports right on first NodeInfo > receiving (not wait for SwitchInfo), just in parallel with SwitchInfo > request. This should simplify subnet discovery flow and speed it up. > And also this will require some heavy testing... > > What do you think about (1) and (2). Could you see any disadvantages? I don't see any. The first option looks shorter and somewhat more safe, but the second option might speed up the discovery a little bit. I'm OK with both options. In any case, this will have to be seriously tested. -- Yevgeny > Sasha > From cameron at harr.org Wed Feb 25 08:31:18 2009 From: cameron at harr.org (Cameron Harr) Date: Wed, 25 Feb 2009 09:31:18 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A4812A.8050202@harr.org> References: <48E386F6.5040502@fusionio.com> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net> <49A4812A.8050202@ha rr.org> Message-ID: <49A57256.2000005@harr.org> Cameron Harr wrote: > Vladislav Bolkhovitin wrote: >>>>> I ran each test 3 times and took the averages. In order to get a >>>>> quick look at performance per run, I added a column in the summary >>>>> that sums the IOPs for each test with SRPT thread enabled and then >>>>> not enabled. Test 4 seems to give the best results. Here's a brief >>>>> summary of that summary with just SRPT thread=0: >>>>> >>>>> Baseline: 356226.39 >>>>> Test 1: 371217.6533 >>>>> Test 2: 370553.78 >>>>> Test 3: 373295.2033 >>>>> Test 4: 399385.2233 >>>>> Test 5: 393204.5833 >>>> Linux CPU scheduler does really impressive job! >>>> >>>> Interesting, will something change with: >>>> >>>> 1. The latest SVN. It has some changes, which might make a difference. >>> Sorry for the delay. >>> This is with SVN rev 673. I don't hit the high I hit before, but at >>> a 1.8% difference (with test 4), it's statistically noise. >>> >>> Test 1: 390631.5133 >>> Test 2: 386125.4133 >>> Test 3: 356268.0267 >>> Test 4: 392237.7867 >>> Test 5: 390012.1467 > I just ran again, this time with rev 680 and am a little concerned to > see the drop in performance. I verified that debug is not on. I'll try > to start another run on 680 to see if I get similar results. > > Test 1:368342.41 > Test 2:366787.2067 > Test 3:345334.68 > Test 4:372684.58 > Test 5:372184.8333 I re-compiled and re-ran the tests and numbers are a little better but performance still seems to have gone down from 673: Test 1:373751.66 Test 2:371242.6067 Test 3:347988.1467 Test 4:378247.31 Test 5:375616.53 From sashak at voltaire.com Wed Feb 25 09:53:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Feb 2009 19:53:00 +0200 Subject: [ofa-general] [PATCH] opensm/lid_mgr: fix duplicated lid assignment Message-ID: <20090225175300.GD11192@sashak.voltaire.com> When OpenSM is running with '-r' option (reassign lids) it will clean up all internal free lid ranges and used_lids db, but not guid2lid db. Then during new lids assignment for ports which don't presented in guid2lid db LidMgr will ignore the fact that some port can already have the same lid assigned. As result we will get a subnet with duplicated lids. The proposed fix is to reassign all lids unconditionally (ignoring existing guid2lid db and port's current lid value) if '-r' is specified. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_lid_mgr.c | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index b74aba5..ec7fd86 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -773,6 +773,10 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr, !osm_switch_sp0_is_lmc_capable(p_port->p_node->sw, p_mgr->p_subn)) num_lids = 1; + if (p_mgr->p_subn->first_time_master_sweep == TRUE && + p_mgr->p_subn->opt.reassign_lids == TRUE) + goto AssignLid; + /* if the port matches the guid2lid */ if (!osm_db_guid2lid_get(p_mgr->p_g2l, guid, &min_lid, &max_lid)) { *p_min_lid = min_lid; @@ -804,9 +808,7 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr, /* we want to ignore the discovered lid if we are also on first sweep of reassign lids flow */ - if (min_lid && - !((p_mgr->p_subn->first_time_master_sweep == TRUE) && - (p_mgr->p_subn->opt.reassign_lids == TRUE))) { + if (min_lid) { /* make sure lid is valid */ if ((num_lids == 1) || ((min_lid & lmc_mask) == min_lid)) { /* is it free */ @@ -831,6 +833,7 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr, guid, min_lid, min_lid + num_lids - 1); } +AssignLid: /* first cleanup the existing discovered lid range */ __osm_lid_mgr_cleanup_discovered_port_lid_range(p_mgr, p_port); -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Wed Feb 25 10:02:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Feb 2009 20:02:26 +0200 Subject: [ofa-general] [PATCH] opensm/lid_mgr: simplify lmc_mask initialization Message-ID: <20090225180226.GE11192@sashak.voltaire.com> Expression '~((1 << lmc) - 1)' has value 0xffff when lmc = 0, so we don't need to set it up as: if (lmc) lmc_mask = ~((1 << lmc) - 1); else lmc_mask = 0xffff; Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_lid_mgr.c | 24 ++++++------------------ 1 files changed, 6 insertions(+), 18 deletions(-) diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index ec7fd86..ce02b4c 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -146,10 +146,7 @@ static void __osm_lid_mgr_validate_db(IN osm_lid_mgr_t * p_mgr) OSM_LOG_ENTER(p_mgr->p_log); - if (p_mgr->p_subn->opt.lmc) - lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); - else - lmc_mask = 0xffff; + lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); cl_qlist_init(&guids); @@ -327,10 +324,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr) OSM_LOG_ENTER(p_mgr->p_log); - if (p_mgr->p_subn->opt.lmc) - lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); - else - lmc_mask = 0xffff; + lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); /* if we came out of standby we need to discard any previous guid2lid info we might have. @@ -667,10 +661,7 @@ __osm_lid_mgr_find_free_lid_range(IN osm_lid_mgr_t * const p_mgr, p_mgr->p_subn->opt.lmc, num_lids); lmc_num_lids = (1 << p_mgr->p_subn->opt.lmc); - if (p_mgr->p_subn->opt.lmc) - lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); - else - lmc_mask = 0xffff; + lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); /* Search the list of free lid ranges for a range which is big enough @@ -760,11 +751,6 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr, OSM_LOG_ENTER(p_mgr->p_log); - if (p_mgr->p_subn->opt.lmc) - lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); - else - lmc_mask = 0xffff; - /* get the lid from the guid2lid */ guid = cl_ntoh64(osm_port_get_guid(p_port)); @@ -777,6 +763,8 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr, p_mgr->p_subn->opt.reassign_lids == TRUE) goto AssignLid; + lmc_mask = ~(num_lids - 1); + /* if the port matches the guid2lid */ if (!osm_db_guid2lid_get(p_mgr->p_g2l, guid, &min_lid, &max_lid)) { *p_min_lid = min_lid; @@ -810,7 +798,7 @@ __osm_lid_mgr_get_port_lid(IN osm_lid_mgr_t * const p_mgr, reassign lids flow */ if (min_lid) { /* make sure lid is valid */ - if ((num_lids == 1) || ((min_lid & lmc_mask) == min_lid)) { + if ((min_lid & lmc_mask) == min_lid) { /* is it free */ if (__osm_lid_mgr_is_range_not_persistent (p_mgr, min_lid, num_lids)) { -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Wed Feb 25 10:02:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Feb 2009 20:02:53 +0200 Subject: [ofa-general] [PATCH] opensm/sweep: add log message before lid assignment Message-ID: <20090225180253.GF11192@sashak.voltaire.com> Improve logging - add log message (msg box) between pkey tables and QoS parameters setup and lid manager. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_state_mgr.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 0a27044..a1efd1a 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1247,6 +1247,9 @@ _repeat_discovery: if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, + "PKEY and QOS setup completed - STARTING SM LID CONFIG"); + osm_lid_mgr_process_sm(&sm->lid_mgr); if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; -- 1.6.1.2.319.gbd9e From rdreier at cisco.com Wed Feb 25 10:16:18 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Feb 2009 10:16:18 -0800 Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <20090225.000628.108688119.davem@davemloft.net> (David Miller's message of "Wed, 25 Feb 2009 00:06:28 -0800 (PST)") References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> <49A4FB79.1090809@voltaire.com> <20090225.000628.108688119.davem@davemloft.net> Message-ID: > It's making real sockets, using the real networking stack, > using up real IP port/address pairs recognized by the rest > of the real networking stack, and doing RDMA over that > connection. > > That's not allowed. > > We always said that if these RDMA things are in the tree, > they should use their own IP addresses and that are not > visible to the real Linux networking stack. How is what the RDS code is doing any different than what the (upstream) NFS/RDMA and iSER code does? It uses the same rdma_xxx() interfaces for handling connections. - R. From andy.grover at gmail.com Wed Feb 25 10:43:27 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Wed, 25 Feb 2009 10:43:27 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <20090224.232814.227017310.davem@davemloft.net> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> Message-ID: On Tue, Feb 24, 2009 at 11:28 PM, David Miller wrote: > This makes RDMA too much of a first-class citizen in the networking > stack.  That's a blocker for me. RDS is not an RDMA protocol, it is a protocol that supports RDMA. RDS is not an IB protocol, it is a protocol that supports IB transport. RDS's reliable-datagram socket implementation has a modular interface to the transport (e.g. tcp, udp, or ib) and works fine over transports that do not support RDMA. (Most users also do not use RDMA.) OK so we have: 1) RDS socket code must go in net/rds, it's socket code 2) RDS core rdma support move to drivers/infiniband? 3) RDS IB/iwarp transport keep the non-RDMA support in net/rds or move to d/i? It's not RDMA it's IB 4) IB/iwarp transport's rdma support move to d/i 5) RDS TCP transport (impl. but not incl. in patchset) net/rds 6) RDS UDP/DCB transport (not impl. yet) net/rds Does this look right? Right now it sounds like you're saying 1, 5, and 6 go in net/rds, 2-4 go in drivers/infiniband. I'd personally prefer to not split it up, or to split it on the natural core/transport boundary, but I can make it work whatever you decide. :-) > Furthermore the port you've choosen for the protocol is arbitrary, not > properly allocated with the appropriate standards committee, and > therefore could conflict with something other people are using. I'm sure allocating the port won't be too big an issue. Regards -- Andy From purdy at sgi.com Wed Feb 25 13:09:32 2009 From: purdy at sgi.com (Dale Purdy) Date: Wed, 25 Feb 2009 15:09:32 -0600 Subject: [ofa-general] [PATCH] opensm: Implement weighted routing Message-ID: <20090225210932.GA6098@sgi.com> Implement a weighted routing scheme for fine tuning the lid matrix for routing engines that use the lid matrix. An optional file containing a switch_guid port and weighing factor combination per line can be supplied to override a default hop weight factor of 1 for each switch output port in computing the lid matrix. This allows one to alter the min hop paths for things like routes to I/O. Signed-off-by: Dale Purdy --- opensm/include/opensm/osm_port.h | 4 ++ opensm/include/opensm/osm_subnet.h | 1 + opensm/man/opensm.8.in | 7 +++ opensm/opensm/main.c | 13 +++++- opensm/opensm/osm_subnet.c | 7 +++ opensm/opensm/osm_ucast_mgr.c | 82 ++++++++++++++++++++++++++++++++++-- 6 files changed, 109 insertions(+), 5 deletions(-) diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h index 3dda541..ae54c9f 100644 --- a/opensm/include/opensm/osm_port.h +++ b/opensm/include/opensm/osm_port.h @@ -115,6 +115,7 @@ typedef struct osm_physp { osm_pkey_tbl_t pkeys; ib_vl_arb_table_t vl_arb[4]; cl_ptr_vector_t slvl_by_port; + uint8_t hop_wf; } osm_physp_t; /* * FIELDS @@ -171,6 +172,9 @@ typedef struct osm_physp { * Switches have an entry for every other input port (inc SMA=0). * On CAs only one per port. * +* hop_wf +* Hop weighting factor to be used in the routing. +* * SEE ALSO * Port *********/ diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 2dfccda..6353d22 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -181,6 +181,7 @@ typedef struct osm_subn_opt { char *console; uint16_t console_port; char *port_prof_ignore_file; + char *hop_weights_file; boolean_t port_profile_switch_nodes; boolean_t sweep_on_trap; char *routing_engine_names; diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index 7690980..c77ecab 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -31,6 +31,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) [\-console [off | local | socket | loopback]] [\-console-port ] [\-i(gnore-guids) ] +[\-w | \-\-hop_weights_file ] [\-f | \-\-log_file ] [\-L | \-\-log_limit ] [\-e(rase_log_file)] [\-P(config) ] @@ -233,6 +234,12 @@ This option provides the means to define a set of ports (by node guid and port number) that will be ignored by the link load equalization algorithm. .TP +\fB\-w\fR, \fB\-\-hop_weights_file\fR +This option provides weighting factors per port representing a hop +cost in computing the lid matrix. The file consists of lines +containing a switch GUID, output port, and weighting factor. Any port +not listed in the file defaults to a weighting factor of 1. +.TP \fB\-x\fR, \fB\-\-honor_guid2lid\fR This option forces OpenSM to honor the guid2lid file, when it comes out of Standby state, if such file exists diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 47fd658..f145dab 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -255,6 +255,10 @@ static void show_usage(void) " This option provides the means to define a set of ports\n" " (by guid) that will be ignored by the link load\n" " equalization algorithm.\n\n"); + printf("--hop_weights_file, -w \n" + " This option provides the means to define a weighting\n" + " factor per port for customizing the least weight\n" + " hops for the routing.\n\n"); printf("--honor_guid2lid, -x\n" " This option forces OpenSM to honor the guid2lid file,\n" " when it comes out of Standby state, if such file exists\n" @@ -524,7 +528,7 @@ int main(int argc, char *argv[]) char *conf_template = NULL, *config_file = NULL; uint32_t val; const char *const short_option = - "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:"; + "F:c:i:w:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:"; /* In the array below, the 2nd parameter specifies the number @@ -540,6 +544,7 @@ int main(int argc, char *argv[]) {"debug", 1, NULL, 'd'}, {"guid", 1, NULL, 'g'}, {"ignore_guids", 1, NULL, 'i'}, + {"hop_weights_file", 1, NULL, 'w'}, {"lmc", 1, NULL, 'l'}, {"sweep", 1, NULL, 's'}, {"timeout", 1, NULL, 't'}, @@ -664,6 +669,12 @@ int main(int argc, char *argv[]) opt.port_prof_ignore_file); break; + case 'w': + opt.hop_weights_file = optarg; + printf(" Hop Weights File = %s\n", + opt.hop_weights_file); + break; + case 'g': /* Specifies port guid with which to bind. diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index b3100a4..26e4481 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -322,6 +322,7 @@ static const opt_rec_t opt_tbl[] = { { "polling_retry_number", OPT_OFFSET(polling_retry_number), opts_parse_uint32, NULL, 1 }, { "force_heavy_sweep", OPT_OFFSET(force_heavy_sweep), opts_parse_boolean, NULL, 1 }, { "port_prof_ignore_file", OPT_OFFSET(port_prof_ignore_file), opts_parse_charp, NULL, 0 }, + { "hop_weights_file", OPT_OFFSET(hop_weights_file), opts_parse_charp, NULL, 0 }, { "port_profile_switch_nodes", OPT_OFFSET(port_profile_switch_nodes), opts_parse_boolean, NULL, 1 }, { "sweep_on_trap", OPT_OFFSET(sweep_on_trap), opts_parse_boolean, NULL, 1 }, { "routing_engine", OPT_OFFSET(routing_engine_names), opts_parse_charp, NULL, 0 }, @@ -727,6 +728,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->qos_policy_file = strdup(OSM_DEFAULT_QOS_POLICY_FILE); p_opt->accum_log_file = TRUE; p_opt->port_prof_ignore_file = NULL; + p_opt->hop_weights_file = NULL; p_opt->port_profile_switch_nodes = FALSE; p_opt->sweep_on_trap = TRUE; p_opt->use_ucast_cache = FALSE; @@ -1359,6 +1361,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) p_opts->port_prof_ignore_file : null_str); fprintf(out, + "# The file holding routing weighting factors per output port\n" + "hop_weights_file %s\n\n", + p_opts->hop_weights_file ? p_opts->hop_weights_file : null_str); + + fprintf(out, "# Routing engine\n" "# Multiple routing engines can be specified separated by\n" "# commas so that specific ordering of routing algorithms will\n" diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index e404c91..81c3604 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -125,11 +125,11 @@ __osm_ucast_mgr_process_hop_0_1(IN cl_map_item_t * const p_map_item, if (p_remote_node && p_remote_node->sw && (p_remote_node != p_sw->p_node)) { + osm_physp_t *p = osm_node_get_physp_ptr(p_sw->p_node, i); + remote_lid = osm_node_get_base_lid(p_remote_node, 0); remote_lid = cl_ntoh16(remote_lid); - osm_switch_set_hops(p_sw, remote_lid, i, 1); - osm_switch_set_hops(p_remote_node->sw, lid, remote_port, - 1); + osm_switch_set_hops(p_sw, remote_lid, i, p->hop_wf); } } } @@ -146,6 +146,7 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr, osm_switch_t *p_sw, *p_next_sw; uint16_t lid_ho; uint8_t hops; + osm_physp_t *p; OSM_LOG_ENTER(p_mgr->p_log); @@ -156,6 +157,8 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr, cl_ntoh64(osm_node_get_node_guid(p_remote_sw->p_node)), port_num, remote_port_num); + p = osm_node_get_physp_ptr(p_this_sw->p_node, port_num); + p_next_sw = (osm_switch_t *) cl_qmap_head(&p_mgr->p_subn->sw_guid_tbl); while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_mgr->p_subn->sw_guid_tbl)) { @@ -166,7 +169,7 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr, hops = osm_switch_get_least_hops(p_remote_sw, lid_ho); if (hops == OSM_NO_PATH) continue; - hops++; + hops += p->hop_wf; if (hops < osm_switch_get_hop_count(p_this_sw, lid_ho, port_num)) { if (osm_switch_set_hops @@ -573,6 +576,61 @@ __osm_ucast_mgr_process_neighbors(IN cl_map_item_t * const p_map_item, /********************************************************************** **********************************************************************/ +static int set_hop_wf(void *ctx, uint64_t guid, char *p) +{ + osm_ucast_mgr_t *m = ctx; + osm_node_t *node = osm_get_node_by_guid(m->p_subn, cl_hton64(guid)); + osm_physp_t *physp; + unsigned port, hop_wf; + char *e; + + if (!node || !node->sw) { + OSM_LOG(m->p_log, OSM_LOG_DEBUG, + "switch with guid 0x%016" PRIx64 " is not found\n", + guid); + return 0; + } + + if (!p || !*p || !(port = strtoul(p, &e, 0)) || (p == e) || + port >= node->sw->num_ports) { + OSM_LOG(m->p_log, OSM_LOG_DEBUG, + "bad port specified for guid 0x%016" PRIx64 "\n", guid); + return 0; + } + + p = e + 1; + + if (!*p || !(hop_wf = strtoul(p, &e, 0)) || (p == e) || + (hop_wf >= 0x100)) { + OSM_LOG(m->p_log, OSM_LOG_DEBUG, + "bad hop weight factor specified for guid 0x%016" PRIx64 "port %u\n", + guid, port); + return 0; + } + + physp = osm_node_get_physp_ptr(node, port); + if (!physp) + return 0; + + physp->hop_wf = hop_wf; + + return 0; +} + +static void set_default_hop_wf(cl_map_item_t * const p_map_item, void *ctx) +{ + osm_switch_t *sw = (osm_switch_t *)p_map_item; + int i; + + for (i = 1; i < sw->num_ports; i++) { + osm_physp_t *p = osm_node_get_physp_ptr(sw->p_node, i); + if (p) + p->hop_wf = 1; + } +} + +/********************************************************************** + **********************************************************************/ int osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr) { uint32_t i; @@ -585,6 +643,22 @@ int osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr) "Starting switches' Min Hop Table Assignment\n"); /* + Set up the weighting factors for the routing. + */ + cl_qmap_apply_func(p_sw_guid_tbl, set_default_hop_wf, NULL); + if (p_mgr->p_subn->opt.hop_weights_file) { + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Fetching hop weight factor file \'%s\'\n", + p_mgr->p_subn->opt.hop_weights_file); + if (parse_node_map(p_mgr->p_subn->opt.hop_weights_file, + set_hop_wf, p_mgr)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : cannot " + "parse hop_weights_file \'%s\'\n", + p_mgr->p_subn->opt.hop_weights_file); + } + } + + /* Set the switch matrices for each switch's own port 0 LID(s) then set the lid matrices for the each switch's leaf nodes. */ -- 1.5.6.5 From davem at davemloft.net Wed Feb 25 13:45:12 2009 From: davem at davemloft.net (David Miller) Date: Wed, 25 Feb 2009 13:45:12 -0800 (PST) Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> Message-ID: <20090225.134512.90879325.davem@davemloft.net> From: Andrew Grover Date: Wed, 25 Feb 2009 10:43:27 -0800 > RDS's reliable-datagram socket implementation has a modular interface > to the transport (e.g. tcp, udp, or ib) and works fine over transports > that do not support RDMA. (Most users also do not use RDMA.) Ok, let me look over the patches again. From ralph.campbell at qlogic.com Wed Feb 25 16:36:03 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 25 Feb 2009 16:36:03 -0800 Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference in local_completions() Message-ID: <1235608563.3948.199.camel@chromite.mv.qlogic.com> handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local on the mad_agent_priv->local_work work queue with local->mad_priv == NULL if device->process_mad() returns IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and (!ib_response_mad(&mad_priv->mad.mad) || !mad_agent_priv->agent.recv_handler). In this case, local_completions() will be called with local->mad_priv == NULL. The code does check for this case and skips calling recv_mad_agent->agent.recv_handler() but recv == 0 so kmem_cache_free() is called with a NULL pointer. Also, since recv isn't reinitialized each time through the loop, it can cause a memory leak if recv should have been zero. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 5c54fc2..8388e5e 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work) struct ib_mad_local_private *local; struct ib_mad_agent_private *recv_mad_agent; unsigned long flags; - int recv = 0; + int recv; struct ib_wc wc; struct ib_mad_send_wc mad_send_wc; @@ -2370,14 +2370,15 @@ static void local_completions(struct work_struct *work) completion_list); list_del(&local->completion_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + recv = 1; if (local->mad_priv) { recv_mad_agent = local->recv_mad_agent; if (!recv_mad_agent) { printk(KERN_ERR PFX "No receive MAD agent for local completion\n"); + recv = 0; goto local_send_completion; } - recv = 1; /* * Defined behavior is to complete response * before request From rdreier at cisco.com Wed Feb 25 16:53:30 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Feb 2009 16:53:30 -0800 Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference in local_completions() In-Reply-To: <1235608563.3948.199.camel@chromite.mv.qlogic.com> (Ralph Campbell's message of "Wed, 25 Feb 2009 16:36:03 -0800") References: <1235608563.3948.199.camel@chromite.mv.qlogic.com> Message-ID: This looks fine to me. Hal and/or Sean, any comment? From rdreier at cisco.com Wed Feb 25 16:53:58 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 25 Feb 2009 16:53:58 -0800 Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference in local_completions() In-Reply-To: <1235608563.3948.199.camel@chromite.mv.qlogic.com> (Ralph Campbell's message of "Wed, 25 Feb 2009 16:36:03 -0800") References: <1235608563.3948.199.camel@chromite.mv.qlogic.com> Message-ID: By the way, I didn't pay close attention to the previous discussion about this. Did you and Hal reach agreement about the approach? - R. From ralph.campbell at qlogic.com Wed Feb 25 17:03:40 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 25 Feb 2009 17:03:40 -0800 Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference in local_completions() In-Reply-To: References: <1235608563.3948.199.camel@chromite.mv.qlogic.com> Message-ID: <1235610220.3948.206.camel@chromite.mv.qlogic.com> On Wed, 2009-02-25 at 16:53 -0800, Roland Dreier wrote: > By the way, I didn't pay close attention to the previous discussion > about this. Did you and Hal reach agreement about the approach? > > - R. The earlier patch I posted wasn't correct. I was looking for comments about how kmem_cache_free() is called when recv_mad_agent->agent.recv_handler() is called. Hal didn't answer directly so I checked the code and I see that the receive handler is responsible for calling ib_free_recv_mad() which does the work. Hal just wanted me to test it, which I did. From sean.hefty at intel.com Wed Feb 25 17:22:26 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 25 Feb 2009 17:22:26 -0800 Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference in local_completions() In-Reply-To: <1235608563.3948.199.camel@chromite.mv.qlogic.com> References: <1235608563.3948.199.camel@chromite.mv.qlogic.com> Message-ID: <0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com> >handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local >on the mad_agent_priv->local_work work queue with >local->mad_priv == NULL if device->process_mad() returns >IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and >(!ib_response_mad(&mad_priv->mad.mad) || > !mad_agent_priv->agent.recv_handler). > >In this case, local_completions() will be called with >local->mad_priv == NULL. The code does check for this >case and skips calling recv_mad_agent->agent.recv_handler() >but recv == 0 so kmem_cache_free() is called with a >NULL pointer. > >Also, since recv isn't reinitialized each time through the loop, >it can cause a memory leak if recv should have been zero. > >Signed-off-by: Ralph Campbell > >diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c >index 5c54fc2..8388e5e 100644 >--- a/drivers/infiniband/core/mad.c >+++ b/drivers/infiniband/core/mad.c >@@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work) > struct ib_mad_local_private *local; > struct ib_mad_agent_private *recv_mad_agent; > unsigned long flags; >- int recv = 0; >+ int recv; With this change, I think it would be better to rename the 'recv' flag. The logic itself looks correct to me. - Sean From ralph.campbell at qlogic.com Wed Feb 25 17:43:58 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 25 Feb 2009 17:43:58 -0800 Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference in local_completions() In-Reply-To: <0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com> References: <1235608563.3948.199.camel@chromite.mv.qlogic.com> <0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com> Message-ID: <1235612638.3948.211.camel@chromite.mv.qlogic.com> On Wed, 2009-02-25 at 17:22 -0800, Sean Hefty wrote: > >handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local > >on the mad_agent_priv->local_work work queue with > >local->mad_priv == NULL if device->process_mad() returns > >IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and > >(!ib_response_mad(&mad_priv->mad.mad) || > > !mad_agent_priv->agent.recv_handler). > > > >In this case, local_completions() will be called with > >local->mad_priv == NULL. The code does check for this > >case and skips calling recv_mad_agent->agent.recv_handler() > >but recv == 0 so kmem_cache_free() is called with a > >NULL pointer. > > > >Also, since recv isn't reinitialized each time through the loop, > >it can cause a memory leak if recv should have been zero. > > > >Signed-off-by: Ralph Campbell > > > >diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > >index 5c54fc2..8388e5e 100644 > >--- a/drivers/infiniband/core/mad.c > >+++ b/drivers/infiniband/core/mad.c > >@@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work) > > struct ib_mad_local_private *local; > > struct ib_mad_agent_private *recv_mad_agent; > > unsigned long flags; > >- int recv = 0; > >+ int recv; > > With this change, I think it would be better to rename the 'recv' flag. The > logic itself looks correct to me. > > - Sean OK, how about "free" or "free_mad"? From keshetti.mahesh at gmail.com Wed Feb 25 20:51:43 2009 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Thu, 26 Feb 2009 10:21:43 +0530 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: Implement weighted routing Message-ID: <829ded920902252051g283b9e84vffce832452d241ac@mail.gmail.com> Hello Dale Purdy, I have a requirement where I have to set the some hop's weight factor to zero. Is this supported by your patch ? I have implemented something similar to it before but it lead to loops in the routing table. Does your patch take care of those things ? -Mahesh From sashak at voltaire.com Wed Feb 25 21:10:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 07:10:12 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> <20090218003355.GX7189@sashak.voltaire.com> Message-ID: <20090226051012.GH11192@sashak.voltaire.com> Hi Hal, On 10:20 Wed 18 Feb , Hal Rosenstock wrote: > On Tue, Feb 17, 2009 at 7:33 PM, Sasha Khapyorsky wrote: > > On 18:21 Tue 17 Feb , Hal Rosenstock wrote: > >> > > >> > For utilities which run once through I think the old functions work just > >> > fine. > >> > >> Well, sort of... Aren't mad_portid "collisions" possible when multiple > >> programs are run concurrently ? > > > > No. > > With the old API, mad_portid can be overwritten by another process or > thread. Another thread is not an expected use case but it is possible. Yes, but you asked about "collisions" between different programs (processes) run. Sasha From sashak at voltaire.com Wed Feb 25 21:18:36 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 07:18:36 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec In-Reply-To: <20090218151015.GA6482@comcast.net> References: <20090218151015.GA6482@comcast.net> Message-ID: <20090226051836.GJ11192@sashak.voltaire.com> On 10:10 Wed 18 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 21:19:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 07:19:12 +0200 Subject: [ofa-general] Re: [PATCH] management/libibmad.txt: Remove madrpc_lock/unlock In-Reply-To: <20090218152728.GA8489@comcast.net> References: <20090218152728.GA8489@comcast.net> Message-ID: <20090226051912.GK11192@sashak.voltaire.com> On 10:27 Wed 18 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 21:51:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 07:51:24 +0200 Subject: [ofa-general] Re: [PATCH] opensm/man/opensm.8.in: Indicate ROUTER_EXP deprecated In-Reply-To: <20090218152913.GC8489@comcast.net> References: <20090218152913.GC8489@comcast.net> Message-ID: <20090226055117.GM11192@sashak.voltaire.com> On 10:29 Wed 18 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 21:58:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 07:58:42 +0200 Subject: [ofa-general] Re: opensm/osm_console.c: Improve perfmgr print_counters error message In-Reply-To: <20090218153227.GF8489@comcast.net> References: <20090218153227.GF8489@comcast.net> Message-ID: <20090226055842.GN11192@sashak.voltaire.com> On 10:32 Wed 18 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 22:01:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 08:01:32 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/smpdump.c: Fix usage examples In-Reply-To: <20090218155537.GA8762@comcast.net> References: <20090218155537.GA8762@comcast.net> Message-ID: <20090226060132.GO11192@sashak.voltaire.com> On 10:55 Wed 18 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 22:03:47 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 08:03:47 +0200 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/smpdump.c: Release umad resources on exit In-Reply-To: <20090218171932.GA15139@comcast.net> References: <20090218171932.GA15139@comcast.net> Message-ID: <20090226060347.GP11192@sashak.voltaire.com> On 12:19 Wed 18 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 22:15:51 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 08:15:51 +0200 Subject: [ofa-general] Re: [PATCH] opensm/console: Enhance perfmgr print_counters for better nodenames In-Reply-To: <20090219130653.GA29318@comcast.net> References: <20090219130653.GA29318@comcast.net> Message-ID: <20090226061551.GQ11192@sashak.voltaire.com> On 08:06 Thu 19 Feb , Hal Rosenstock wrote: > > nodenames can have spaces in them > Also, no need for next_token being inlined > > Signed-off-by: Hal Rosenstock Applied with changes noted below. Thanks. [snip...] > diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c > index 3babe3a..8766f93 100644 > --- a/opensm/opensm/osm_perfmgr.c > +++ b/opensm/opensm/osm_perfmgr.c > @@ -1304,9 +1304,9 @@ void > osm_perfmgr_print_counters(osm_perfmgr_t *pm, char *nodename, FILE *fp) > { > uint64_t guid = strtoull(nodename, NULL, 0); > - if (guid == 0 && errno == EINVAL) > + if (guid == 0 && errno) // name > perfmgr_db_print_by_name(pm->db, nodename, fp); > - else > + else // guid > perfmgr_db_print_by_guid(pm->db, guid, fp); Such comments are not really helpful - it is pretty clear from the code (flow itself and function names too) what is going on there, so I'm removing this. And in general I think it is better to use C-style comments - /* ... */, in C code and not C++-style // ... . Sasha From sashak at voltaire.com Wed Feb 25 22:24:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 08:24:45 +0200 Subject: [ofa-general] Re: [PATCHv2] opensm/man/opensm.8.in: Indicate ROUTER_EXP obsoleted In-Reply-To: <20090219184415.GA29943@comcast.net> References: <20090219184415.GA29943@comcast.net> Message-ID: <20090226062445.GR11192@sashak.voltaire.com> On 13:44 Thu 19 Feb , Hal Rosenstock wrote: > > Pointed out by Rolf > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 22:27:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 08:27:46 +0200 Subject: [ofa-general] Re: [PATCH] libibmad/fields.c: Dump LIDs as unsigned decimal In-Reply-To: <20090220215845.GA7360@comcast.net> References: <20090220215845.GA7360@comcast.net> Message-ID: <20090226062746.GS11192@sashak.voltaire.com> On 16:58 Fri 20 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Feb 25 22:28:48 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 08:28:48 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Convert more LID prints to unsigned decimal In-Reply-To: <20090220215938.GB7360@comcast.net> References: <20090220215938.GB7360@comcast.net> Message-ID: <20090226062848.GT11192@sashak.voltaire.com> On 16:59 Fri 20 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From klakshman03 at hotmail.com Wed Feb 25 22:38:42 2009 From: klakshman03 at hotmail.com (lakshmana swamy) Date: Thu, 26 Feb 2009 12:08:42 +0530 Subject: [ofa-general] ***SPAM*** Problem in IB network without Switch Message-ID: Hi All I have been trying to enable the IPoIB communication between two machines. The machines has been conncted with a Back-to-Back Infiniband Cable since I dont have IB switch. Installation of drivers and IP configuration has been done in both the machines. Subnet manager (opensmd) running on one machine. The problem is communication has not been happening through IB."ibstatus" output shows port is in "Down" State in both the machines. What could be the problem, Iam unable to figure out where is the problem. Operating System : Rocks 5.0 (RHEL 5.0) OFED : Cisco OFED roll 5.0 (OFED 1.3) HCA cards : Mellanox SDR Please check the following commands output. [root at mattool ~]# ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0002:c901:08cd:13c1 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 2.5 Gb/sec (1X) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0002:c901:08cd:13c2 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 2.5 Gb/sec (1X) [root at mattool ~]# /etc/init.d/openibd status HCA driver loaded Configured devices: ib0 ib1 Currently active devices: ib0 ib1 The following OFED modules are loaded: rdma_ucm qlgc_vnic ib_sdp rdma_cm ib_addr ib_ipoib ib_ipath mlx4_core mlx4_ib ib_mthca ib_uverbs ib_umad ib_sa ib_cm ib_mad ib_core iw_cxgb3 [root at mattool ~]# Thanks Laxman _________________________________________________________________ Find a better job. We have plenty. Visit MSN Jobs http://www.in.msn.com/jobs -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Wed Feb 25 22:43:35 2009 From: sean.hefty at intel.com (Hefty, Sean) Date: Wed, 25 Feb 2009 22:43:35 -0800 Subject: [ofa-general] [PATCH] IB/core: fix null pointer dereference in local_completions() In-Reply-To: <1235612638.3948.211.camel@chromite.mv.qlogic.com> References: <1235608563.3948.199.camel@chromite.mv.qlogic.com> <0C179AD5ED9C4035B35F553555FA185E@amr.corp.intel.com> <1235612638.3948.211.camel@chromite.mv.qlogic.com> Message-ID: >OK, how about "free" or "free_mad"? Sure - free_mad sounds good to me. From sashak at voltaire.com Wed Feb 25 23:06:29 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 09:06:29 +0200 Subject: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: <20090218153016.GD8489@comcast.net> References: <20090218153016.GD8489@comcast.net> Message-ID: <20090226070629.GU11192@sashak.voltaire.com> Hi Hal, On 10:30 Wed 18 Feb , Hal Rosenstock wrote: > > Only supported in osm_vendor_ibumad.c (separate patch for other > vendor layers) > Also, update applications using this (osmtest, opensm) > > Signed-off-by: Hal Rosenstock > --- > opensm/libvendor/osm_vendor_ibumad.c | 24 +++++++++++++++++++----- > opensm/opensm/main.c | 6 ++++++ > opensm/osmtest/main.c | 11 +++++++++++ > opensm/osmtest/osmtest.c | 7 +++++++ > 4 files changed, 43 insertions(+), 5 deletions(-) > > diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c > index 734a860..861bfbe 100644 > --- a/opensm/libvendor/osm_vendor_ibumad.c > +++ b/opensm/libvendor/osm_vendor_ibumad.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > umad_ca_t ca; > ib_port_attr_t *attr = p_attr_array; > unsigned done = 0; > - int r, i, j; > + int r, i, j, k; > > OSM_LOG_ENTER(p_vend->p_log); > > CL_ASSERT(p_vend && p_num_ports); > > + r = 0; > if (!*p_num_ports) { > r = IB_INVALID_PARAMETER; > OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: " > @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > } > > for (i = 0; i < p_vend->ca_count && !done; i++) { > - /* > - * For each CA, retrieve the port guids > - */ > + /* For each CA, retrieve the port attributes */ > if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) { > if (ca.node_type < 1 || ca.node_type > 3) > continue; > @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > attr->port_num = ca.ports[j]->portnum; > attr->sm_lid = ca.ports[j]->sm_lid; > attr->link_state = ca.ports[j]->state; > + attr->num_pkeys = ca.ports[j]->pkeys_size; > + if (attr->num_pkeys && attr->p_pkey_table) { > + if (attr->num_pkeys < ca.ports[j]->pkeys_size) { You are doing: attr->num_pkeys = ca.ports[j]->pkeys_size; , just two lines above, so this check will be always false. > + r = IB_INSUFFICIENT_MEMORY; > + OSM_LOG(p_vend->p_log, > + OSM_LOG_ERROR, > + "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", > + j, > + ca.ports[j]->pkeys_size); Also should it be an error? May be it is just enough to fill requested pkey entries? > + goto Exit; > + } > + for (k = 0; k < attr->num_pkeys; k++) > + attr->p_pkey_table[k] = > + cl_hton16(ca.ports[j]->pkeys[k]); > + } > attr++; > if (attr - p_attr_array > *p_num_ports) { > done = 1; > @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > } > > *p_num_ports = attr - p_attr_array; > - r = 0; > > Exit: > OSM_LOG_EXIT(p_vend->p_log); > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 73a6274..503d7fa 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) > uint32_t i, choice = 0; > ib_api_status_t status; > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + Here and below. Just memset(attr_array, 0, sizeof(attr_array)); would be enough. Sasha > /* Call the transport layer for a list of local port GUID values */ > status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array, > &num_ports); > diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c > index b360af6..83c1e13 100644 > --- a/opensm/osmtest/main.c > +++ b/opensm/osmtest/main.c > @@ -1,6 +1,7 @@ > /* > * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt) > ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > int i; > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > /* > Call the transport layer for a list of local port > GUID values. > @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid) > ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > int i; > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > /* > Call the transport layer for a list of local port > GUID values. > diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c > index a7b343f..986a8d2 100644 > --- a/opensm/osmtest/osmtest.c > +++ b/opensm/osmtest/osmtest.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt, > ib_api_status_t status; > uint32_t num_ports = MAX_LOCAL_IBPORTS; > ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > + int i; > > OSM_LOG_ENTER(&p_osmt->log); > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > /* > * Call the transport layer for a list of local port > * GUID values. > -- > 1.5.6.4 > From sashak at voltaire.com Wed Feb 25 23:10:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 09:10:59 +0200 Subject: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: <20090218153016.GD8489@comcast.net> References: <20090218153016.GD8489@comcast.net> Message-ID: <20090226071059.GV11192@sashak.voltaire.com> On 10:30 Wed 18 Feb , Hal Rosenstock wrote: > > Only supported in osm_vendor_ibumad.c (separate patch for other > vendor layers) > Also, update applications using this (osmtest, opensm) It looks that ibutils (ibis) requires same fix (attr_array initialization) too. Sasha > > Signed-off-by: Hal Rosenstock > --- > opensm/libvendor/osm_vendor_ibumad.c | 24 +++++++++++++++++++----- > opensm/opensm/main.c | 6 ++++++ > opensm/osmtest/main.c | 11 +++++++++++ > opensm/osmtest/osmtest.c | 7 +++++++ > 4 files changed, 43 insertions(+), 5 deletions(-) > > diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c > index 734a860..861bfbe 100644 > --- a/opensm/libvendor/osm_vendor_ibumad.c > +++ b/opensm/libvendor/osm_vendor_ibumad.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > umad_ca_t ca; > ib_port_attr_t *attr = p_attr_array; > unsigned done = 0; > - int r, i, j; > + int r, i, j, k; > > OSM_LOG_ENTER(p_vend->p_log); > > CL_ASSERT(p_vend && p_num_ports); > > + r = 0; > if (!*p_num_ports) { > r = IB_INVALID_PARAMETER; > OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: " > @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > } > > for (i = 0; i < p_vend->ca_count && !done; i++) { > - /* > - * For each CA, retrieve the port guids > - */ > + /* For each CA, retrieve the port attributes */ > if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) { > if (ca.node_type < 1 || ca.node_type > 3) > continue; > @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > attr->port_num = ca.ports[j]->portnum; > attr->sm_lid = ca.ports[j]->sm_lid; > attr->link_state = ca.ports[j]->state; > + attr->num_pkeys = ca.ports[j]->pkeys_size; > + if (attr->num_pkeys && attr->p_pkey_table) { > + if (attr->num_pkeys < ca.ports[j]->pkeys_size) { > + r = IB_INSUFFICIENT_MEMORY; > + OSM_LOG(p_vend->p_log, > + OSM_LOG_ERROR, > + "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", > + j, > + ca.ports[j]->pkeys_size); > + goto Exit; > + } > + for (k = 0; k < attr->num_pkeys; k++) > + attr->p_pkey_table[k] = > + cl_hton16(ca.ports[j]->pkeys[k]); > + } > attr++; > if (attr - p_attr_array > *p_num_ports) { > done = 1; > @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, > } > > *p_num_ports = attr - p_attr_array; > - r = 0; > > Exit: > OSM_LOG_EXIT(p_vend->p_log); > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 73a6274..503d7fa 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) > uint32_t i, choice = 0; > ib_api_status_t status; > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > /* Call the transport layer for a list of local port GUID values */ > status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array, > &num_ports); > diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c > index b360af6..83c1e13 100644 > --- a/opensm/osmtest/main.c > +++ b/opensm/osmtest/main.c > @@ -1,6 +1,7 @@ > /* > * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt) > ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > int i; > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > /* > Call the transport layer for a list of local port > GUID values. > @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid) > ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > int i; > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > /* > Call the transport layer for a list of local port > GUID values. > diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c > index a7b343f..986a8d2 100644 > --- a/opensm/osmtest/osmtest.c > +++ b/opensm/osmtest/osmtest.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt, > ib_api_status_t status; > uint32_t num_ports = MAX_LOCAL_IBPORTS; > ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > + int i; > > OSM_LOG_ENTER(&p_osmt->log); > > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > /* > * Call the transport layer for a list of local port > * GUID values. > -- > 1.5.6.4 > From keshetti.mahesh at gmail.com Thu Feb 26 00:31:07 2009 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Thu, 26 Feb 2009 14:01:07 +0530 Subject: [ofa-general] ***SPAM*** Re: Problem in IB network without Switch Message-ID: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com> Hi, > phys state: 2: Polling On both machines physical state is 'Polling' i.e. the physical connectivity of the two is not proper. Check the connectivity first. Only after it becomes phys state: 5: LinkUp you will be able to enable any IB communication on this interface. -Mahesh From ogerlitz at voltaire.com Thu Feb 26 00:57:45 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Feb 2009 10:57:45 +0200 (IST) Subject: [ofa-general] [PATCH] ib/iser: remove hard setting of mtu Message-ID: Remove hard setting of the IB MTU used by iser's RC queue-pair to 1K, as this was done due to inter-op issues with an old iser target which is not used any more. Signed-off-by: Or Gerlitz Index: linus-linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c =================================================================== --- linus-linux-2.6.orig/drivers/infiniband/ulp/iser/iser_verbs.c +++ linus-linux-2.6/drivers/infiniband/ulp/iser/iser_verbs.c @@ -401,13 +401,6 @@ static void iser_route_handler(struct rd if (ret) goto failure; - iser_dbg("path.mtu is %d setting it to %d\n", - cma_id->route.path_rec->mtu, IB_MTU_1024); - - /* we must set the MTU to 1024 as this is what the target is assuming */ - if (cma_id->route.path_rec->mtu > IB_MTU_1024) - cma_id->route.path_rec->mtu = IB_MTU_1024; - memset(&conn_param, 0, sizeof conn_param); conn_param.responder_resources = 4; conn_param.initiator_depth = 1; From jackm at dev.mellanox.co.il Thu Feb 26 01:26:59 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 26 Feb 2009 11:26:59 +0200 Subject: [ofa-general] ***SPAM*** Problem in IB network without Switch In-Reply-To: References: Message-ID: <200902261126.59440.jackm@dev.mellanox.co.il> "DOWN" means that you do not have a physical link between the ports. Check your cables -- they may be bad, or badly inserted. - Jack On Thursday 26 February 2009 08:38, lakshmana swamy wrote: > > Hi All > > I have been trying to enable the IPoIB communication between two machines. The machines has been conncted with a Back-to-Back Infiniband Cable since I dont have IB switch. Installation of drivers and IP configuration has been done in both the machines. Subnet manager (opensmd) running on one machine. > > The problem is communication has not been happening through IB."ibstatus" output shows port is in "Down" State in both the machines. What could be the problem, Iam unable to figure out where is the problem. > > Operating System : Rocks 5.0 (RHEL 5.0) > OFED : Cisco OFED roll 5.0 (OFED 1.3) > HCA cards : Mellanox SDR > > Please check the following commands output. > > [root at mattool ~]# ibstatus > Infiniband device 'mthca0' port 1 status: > default gid: fe80:0000:0000:0000:0002:c901:08cd:13c1 > base lid: 0x0 > sm lid: 0x0 > state: 1: DOWN > phys state: 2: Polling > rate: 2.5 Gb/sec (1X) > Infiniband device 'mthca0' port 2 status: > default gid: fe80:0000:0000:0000:0002:c901:08cd:13c2 > base lid: 0x0 > sm lid: 0x0 > state: 1: DOWN > phys state: 2: Polling > rate: 2.5 Gb/sec (1X) > > [root at mattool ~]# /etc/init.d/openibd status > > HCA driver loaded > Configured devices: > ib0 ib1 > Currently active devices: > ib0 > ib1 > The following OFED modules are loaded: > rdma_ucm > qlgc_vnic > ib_sdp > rdma_cm > ib_addr > ib_ipoib > ib_ipath > mlx4_core > mlx4_ib > ib_mthca > ib_uverbs > ib_umad > ib_sa > ib_cm > ib_mad > ib_core > iw_cxgb3 > [root at mattool ~]# > > > Thanks > > Laxman > > _________________________________________________________________ > Find a better job. We have plenty. Visit MSN Jobs > http://www.in.msn.com/jobs From sashak at voltaire.com Thu Feb 26 02:05:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 12:05:21 +0200 Subject: [ofa-general] Re: [PATCH 1/6] [ib-diag] ibnetdiscover: add support for WinOF In-Reply-To: <16F309DB95BC45BE90DE636AE675310C@amr.corp.intel.com> References: <16F309DB95BC45BE90DE636AE675310C@amr.corp.intel.com> Message-ID: <20090226100521.GA11192@sashak.voltaire.com> Hi Sean, On 17:46 Wed 18 Feb , Sean Hefty wrote: > Mainly fixing datatypes to avoid type mismatches. > > Signed-off-by: Sean Hefty > --- > Also attaching patch in case my mailer wraps the lines. > > infiniband-diags/src/grouping.c | 28 ++++++++++++++-------------- > infiniband-diags/src/ibnetdiscover.c | 8 ++++---- > 2 files changed, 18 insertions(+), 18 deletions(-) > > diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c > index 0ea139f..0266af4 100644 > --- a/infiniband-diags/src/grouping.c > +++ b/infiniband-diags/src/grouping.c > @@ -265,20 +265,20 @@ int is_chassis_switch(Node *node) > } > > /* these structs help find Line (Anafa) slot number while using spine portnum */ > -int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; > -int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; > -int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; > -int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; > +char line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; > +char anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; > +char line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; > +char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; > > /* IPR FCR modules connectivity while using sFB4 port as reference */ > -int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; > +char ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; > > /* these structs help find Spine (Anafa) slot number while using spine portnum */ > -int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > -int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > -int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > -int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; > */ > +char spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > +char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > +char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > +char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; > +/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */ > > static void get_sfb_slot(Node *node, Port *lineport) > { > @@ -309,7 +309,7 @@ static void get_sfb_slot(Node *node, Port *lineport) > static void get_router_slot(Node *node, Port *spineport) > { > ChassisRecord *ch = node->chrecord; > - int guessnum = 0; > + uint64_t guessnum = 0; > > if (!ch) { > if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) > @@ -460,7 +460,7 @@ static void insert_line_router(Node *node, ChassisList *chassislist) > return; /* already filled slot */ > > chassislist->linenode[i] = node; > - node->chrecord->chassisnum = chassislist->chassisnum; > + node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum; > } > > static void insert_spine(Node *node, ChassisList *chassislist) > @@ -471,7 +471,7 @@ static void insert_spine(Node *node, ChassisList *chassislist) > return; /* already filled slot */ > > chassislist->spinenode[i] = node; > - node->chrecord->chassisnum = chassislist->chassisnum; > + node->chrecord->chassisnum = (unsigned char) chassislist->chassisnum; Wouldn't it be better to try to fix data definitions and minimize such and similar castings? For instance could slightly modified patch like below compile cleanly in WinOF environment (I cannot test, sorry)? Sasha >From 8e8556ba011dab628723736aa32191f54cca4cb5 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 18 Feb 2009 17:46:05 -0800 Subject: [PATCH] ibnetdiscover: add support for WinOF Mainly fixing datatypes to avoid type mismatches. Signed-off-by: Sean Hefty Signed-off-by: Sasha Khapyorsky --- infiniband-diags/include/grouping.h | 6 +++--- infiniband-diags/src/grouping.c | 22 +++++++++++----------- infiniband-diags/src/ibnetdiscover.c | 8 ++++---- 3 files changed, 18 insertions(+), 18 deletions(-) diff --git a/infiniband-diags/include/grouping.h b/infiniband-diags/include/grouping.h index e54efef..811e372 100644 --- a/infiniband-diags/include/grouping.h +++ b/infiniband-diags/include/grouping.h @@ -48,9 +48,9 @@ typedef struct AllChassisList AllChassisList; struct ChassisList { ChassisList *next; uint64_t chassisguid; - int chassisnum; - int chassistype; - int nodecount; /* used for grouping by SystemImageGUID */ + unsigned char chassisnum; + unsigned char chassistype; + unsigned int nodecount; /* used for grouping by SystemImageGUID */ Node *spinenode[SPINES_MAX_NUM + 1]; Node *linenode[LINES_MAX_NUM + 1]; }; diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c index 0ea139f..048efc7 100644 --- a/infiniband-diags/src/grouping.c +++ b/infiniband-diags/src/grouping.c @@ -265,20 +265,20 @@ int is_chassis_switch(Node *node) } /* these structs help find Line (Anafa) slot number while using spine portnum */ -int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; -int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; -int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; -int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; +char line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; +char anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; +char line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; +char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; /* IPR FCR modules connectivity while using sFB4 port as reference */ -int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; +char ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; /* these structs help find Spine (Anafa) slot number while using spine portnum */ -int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ +char spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */ static void get_sfb_slot(Node *node, Port *lineport) { @@ -309,7 +309,7 @@ static void get_sfb_slot(Node *node, Port *lineport) static void get_router_slot(Node *node, Port *spineport) { ChassisRecord *ch = node->chrecord; - int guessnum = 0; + uint64_t guessnum = 0; if (!ch) { if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 948a79d..6946fd7 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -47,7 +47,7 @@ #include #include -#include +#include #include "ibnetdiscover.h" #include "grouping.h" @@ -215,7 +215,7 @@ extend_dpath(ib_dr_path_t *path, int nextport) ++path->cnt; if (path->cnt > maxhops_discovered) maxhops_discovered = path->cnt; - path->p[path->cnt] = nextport; + path->p[path->cnt] = (uint8_t) nextport; return path->cnt; } @@ -515,7 +515,7 @@ out_ids(Node *node, int group, char *chname) } uint64_t -out_chassis(int chassisnum) +out_chassis(unsigned char chassisnum) { uint64_t guid; @@ -967,7 +967,7 @@ int main(int argc, char **argv) { "Router_list", 'R', 0, NULL, "list of connected routers" }, { "node-name-map", 1, 1, "", "node name map file" }, { "ports", 'p', 0, NULL, "obtain a ports report" }, - { } + { 0 } }; char usage_args[] = "[topology-file]"; -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Thu Feb 26 02:11:44 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 12:11:44 +0200 Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: References: Message-ID: <20090226101144.GB11192@sashak.voltaire.com> On 17:46 Wed 18 Feb , Sean Hefty wrote: > Signed-off-by: Sean Hefty > --- > > infiniband-diags/src/ibroute.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c > index 144d1b2..d1049ad 100644 > --- a/infiniband-diags/src/ibroute.c > +++ b/infiniband-diags/src/ibroute.c > @@ -45,7 +45,7 @@ > > #include > #include > -#include > +#include > > #include "ibdiag_common.h" > > @@ -327,7 +327,7 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid) > > for (;i < e; i++) { > unsigned outport = lft[i % IB_SMP_DATA_SIZE]; > - unsigned valid = (outport <= nports); > + unsigned valid = (outport <= (unsigned) nports); Similar question. Sasha >From 7127f00d9020b261819d2205557646016fdd6b36 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 18 Feb 2009 17:46:38 -0800 Subject: [PATCH] ibroute: add support for WinOF Signed-off-by: Sean Hefty Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/ibroute.c | 9 +++++---- 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 144d1b2..235d122 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -45,7 +45,7 @@ #include #include -#include +#include #include "ibdiag_common.h" @@ -54,7 +54,7 @@ static int brief, dump_all, multicast; /*******************************************/ char * -check_switch(ib_portid_t *portid, int *nports, uint64_t *guid, +check_switch(ib_portid_t *portid, unsigned int *nports, uint64_t *guid, uint8_t *sw, char *nd) { uint8_t ni[IB_SMP_DATA_SIZE] = {0}; @@ -289,7 +289,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid) uint8_t sw[IB_SMP_DATA_SIZE]; char str[200], *s; uint64_t nodeguid; - int block, i, e, nports, top; + int block, i, e, top; + unsigned nports; int n = 0, startblock, endblock; if ((s = check_switch(portid, &nports, &nodeguid, sw, nd))) @@ -370,7 +371,7 @@ int main(int argc, char **argv) { "all", 'a', 0, NULL, "show all lids, even invalid entries" }, { "no_dests", 'n', 0, NULL, "do not try to resolve destinations" }, { "Multicast", 'M', 0, NULL, "show multicast forwarding tables" }, - { } + { 0 } }; char usage_args[] = "[ [ []]]"; const char *usage_examples[] = { -- 1.6.1.2.319.gbd9e From jackm at dev.mellanox.co.il Thu Feb 26 02:38:26 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 26 Feb 2009 12:38:26 +0200 Subject: [ofa-general] [PATCH] mlx4_core: Add device IDs for MT25458 10GigE devices Message-ID: <200902261238.26437.jackm@dev.mellanox.co.il> Adds device IDs for Mellanox' MT25458 ConnectX+10-GBaseT 10GigE Ethernet devices. Signed-off-by: Jack Morgenstein diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 6ef2490..84db33b 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -1230,6 +1230,8 @@ static struct pci_device_id mlx4_pci_table[] = { { PCI_VDEVICE(MELLANOX, 0x673c) }, /* MT25408 "Hermon" QDR PCIe gen2 */ { PCI_VDEVICE(MELLANOX, 0x6368) }, /* MT25408 "Hermon" EN 10GigE */ { PCI_VDEVICE(MELLANOX, 0x6750) }, /* MT25408 "Hermon" EN 10GigE PCIe gen2 */ + { PCI_VDEVICE(MELLANOX, 0x6372) }, /* MT25458 ConnectX EN 10GBASE-T 10GigE */ + { PCI_VDEVICE(MELLANOX, 0x675a) }, /* MT25458 ConnectX EN 10GBASE-T+Gen2 10GigE */ { 0, } }; From klakshman03 at hotmail.com Thu Feb 26 02:59:57 2009 From: klakshman03 at hotmail.com (lakshmana swamy) Date: Thu, 26 Feb 2009 16:29:57 +0530 Subject: [ofa-general] ***SPAM*** RE: Problem in IB network without Switch In-Reply-To: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com> References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com> Message-ID: Hi Jack and Mahesh ThanQ for your response. I have channged the HCA card as well as IB cables also..Ooooops no use. How can I perform diagnostics. Please Help me out. ThanQ Laxman > Date: Thu, 26 Feb 2009 14:01:07 +0530 > Subject: Re: Problem in IB network without Switch > From: keshetti.mahesh at gmail.com > To: klakshman03 at hotmail.com > CC: general at lists.openfabrics.org > > Hi, > > > phys state: 2: Polling > > On both machines physical state is 'Polling' i.e. the physical > connectivity of the two is not proper. Check the connectivity first. > Only after it becomes > > phys state: 5: LinkUp > > you will be able to enable any IB communication on this interface. > > -Mahesh _________________________________________________________________ Wish to Marry Now? Join MSN Matrimony FREE! http://www.in.msn.com/matrimony -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Thu Feb 26 03:18:58 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 26 Feb 2009 03:18:58 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090226-0200 daily build status Message-ID: <20090226111858.D3C13E60CB0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From jackm at dev.mellanox.co.il Thu Feb 26 03:35:59 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 26 Feb 2009 13:35:59 +0200 Subject: [ofa-general] Re: Problem in IB network without Switch In-Reply-To: References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com> Message-ID: <200902261335.59927.jackm@dev.mellanox.co.il> On Thursday 26 February 2009 12:59, lakshmana swamy wrote: Please send me the output of console command: ibstat Maybe you have old FW. - Jack > > Hi Jack and Mahesh > > ThanQ for your response. > > I have channged the HCA card as well as IB cables also..Ooooops no use. > > > How can I perform diagnostics. Please Help me out. > > ThanQ > > Laxman > > > > > Date: Thu, 26 Feb 2009 14:01:07 +0530 > > Subject: Re: Problem in IB network without Switch > > From: keshetti.mahesh at gmail.com > > To: klakshman03 at hotmail.com > > CC: general at lists.openfabrics.org > > > > Hi, > > > > > phys state: 2: Polling > > > > On both machines physical state is 'Polling' i.e. the physical > > connectivity of the two is not proper. Check the connectivity first. > > Only after it becomes > > > > phys state: 5: LinkUp > > > > you will be able to enable any IB communication on this interface. > > > > -Mahesh > > _________________________________________________________________ > Wish to Marry Now? Join MSN Matrimony FREE! > http://www.in.msn.com/matrimony From klakshman03 at hotmail.com Thu Feb 26 03:53:30 2009 From: klakshman03 at hotmail.com (lakshmana swamy) Date: Thu, 26 Feb 2009 17:23:30 +0530 Subject: [ofa-general] ***SPAM*** RE: Problem in IB network without Switch In-Reply-To: <200902261335.59927.jackm@dev.mellanox.co.il> References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com> <200902261335.59927.jackm@dev.mellanox.co.il> Message-ID: Hi Jack, Please find the output of ibstat on both the nodes, . [root at mattool ~]# /opt/ofed/extras/hca_self_test.ofed ---- Performing InfiniBand HCA Self Test ---- Number of HCAs Detected ................ 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... OFED-1.3 1.3-2.6.18_53.1.14.el5 Host Driver RPM Check .................. PASS HCA Type of HCA #0 ..................... Cougar /opt/ofed/extras/hca_self_test.ofed: line 227: [: =: unary operator expected HCA Firmware Check ..................... FAIL REASON: mismatch HCA #0 firmware detected (found v, need v3.5.917) Host Driver Initialization ............. PASS Number of HCA Ports Active ............. 0 Port State of Port #0 on HCA #0 ........ DOWN Port State of Port #1 on HCA #0 ........ DOWN Error Counter Check on HCA #0 .......... PASS Kernel Syslog Check .................... PASS Node GUID .............................. 00:02:c9:01:08:cd:13:c0 ------------------ DONE --------------------- [root at mattool ~]# ************ IBSTAT output ****************** [root at mattool ~]# ibstat CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 3.1.0 Hardware version: a1 Node GUID: 0x0002c90108cd13c0 System image GUID: 0x0002c90108cd13c0 Port 1: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00510a68 Port GUID: 0x0002c90108cd13c1 Port 2: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00510a6a Port GUID: 0x0002c90108cd13c2 [root at mattool ~]# [root at compute-0-0 ~]# ibstat CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 3.0.0 Hardware version: a1 Node GUID: 0x0002c9020000114c System image GUID: 0x0002c9020000114f Port 1: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00110a68 Port GUID: 0x0002c9020000114d Port 2: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00110a68 Port GUID: 0x0002c9020000114e [root at compute-0-0 ~]# Thanking you laxman > From: jackm at dev.mellanox.co.il > To: klakshman03 at hotmail.com > Subject: Re: Problem in IB network without Switch > Date: Thu, 26 Feb 2009 13:35:59 +0200 > CC: keshetti.mahesh at gmail.com; general at lists.openfabrics.org > > On Thursday 26 February 2009 12:59, lakshmana swamy wrote: > > Please send me the output of console command: ibstat > Maybe you have old FW. > > - Jack > > > > > Hi Jack and Mahesh > > > > ThanQ for your response. > > > > I have channged the HCA card as well as IB cables also..Ooooops no use. > > > > > > How can I perform diagnostics. Please Help me out. > > > > ThanQ > > > > Laxman > > > > > > > > > Date: Thu, 26 Feb 2009 14:01:07 +0530 > > > Subject: Re: Problem in IB network without Switch > > > From: keshetti.mahesh at gmail.com > > > To: klakshman03 at hotmail.com > > > CC: general at lists.openfabrics.org > > > > > > Hi, > > > > > > > phys state: 2: Polling > > > > > > On both machines physical state is 'Polling' i.e. the physical > > > connectivity of the two is not proper. Check the connectivity first. > > > Only after it becomes > > > > > > phys state: 5: LinkUp > > > > > > you will be able to enable any IB communication on this interface. > > > > > > -Mahesh > > > > _________________________________________________________________ > > Wish to Marry Now? Join MSN Matrimony FREE! > > http://www.in.msn.com/matrimony _________________________________________________________________ Chose your Life Partner! Join MSN Matrimony FREE http://www.in.msn.com/matrimony -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Feb 26 04:03:02 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 07:03:02 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: <20090226070629.GU11192@sashak.voltaire.com> References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> Message-ID: Sasha, On Thu, Feb 26, 2009 at 2:06 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 10:30 Wed 18 Feb     , Hal Rosenstock wrote: >> >> Only supported in osm_vendor_ibumad.c (separate patch for other >> vendor layers) >> Also, update applications using this (osmtest, opensm) >> >> Signed-off-by: Hal Rosenstock >> --- >>  opensm/libvendor/osm_vendor_ibumad.c |   24 +++++++++++++++++++----- >>  opensm/opensm/main.c                 |    6 ++++++ >>  opensm/osmtest/main.c                |   11 +++++++++++ >>  opensm/osmtest/osmtest.c             |    7 +++++++ >>  4 files changed, 43 insertions(+), 5 deletions(-) >> >> diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c >> index 734a860..861bfbe 100644 >> --- a/opensm/libvendor/osm_vendor_ibumad.c >> +++ b/opensm/libvendor/osm_vendor_ibumad.c >> @@ -2,6 +2,7 @@ >>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>       umad_ca_t ca; >>       ib_port_attr_t *attr = p_attr_array; >>       unsigned done = 0; >> -     int r, i, j; >> +     int r, i, j, k; >> >>       OSM_LOG_ENTER(p_vend->p_log); >> >>       CL_ASSERT(p_vend && p_num_ports); >> >> +     r = 0; >>       if (!*p_num_ports) { >>               r = IB_INVALID_PARAMETER; >>               OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: " >> @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>       } >> >>       for (i = 0; i < p_vend->ca_count && !done; i++) { >> -             /* >> -              * For each CA, retrieve the port guids >> -              */ >> +             /* For each CA, retrieve the port attributes */ >>               if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) { >>                       if (ca.node_type < 1 || ca.node_type > 3) >>                               continue; >> @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>                               attr->port_num = ca.ports[j]->portnum; >>                               attr->sm_lid = ca.ports[j]->sm_lid; >>                               attr->link_state = ca.ports[j]->state; >> +                             attr->num_pkeys = ca.ports[j]->pkeys_size; >> +                             if (attr->num_pkeys && attr->p_pkey_table) { >> +                                     if (attr->num_pkeys < ca.ports[j]->pkeys_size) { > > You are doing: > >        attr->num_pkeys = ca.ports[j]->pkeys_size; > > , just two lines above, so this check will be always false. Oops; I'll fix in the next version. >> +                                             r = IB_INSUFFICIENT_MEMORY; >> +                                             OSM_LOG(p_vend->p_log, >> +                                                     OSM_LOG_ERROR, >> +                                                     "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", >> +                                                     j, >> +                                                     ca.ports[j]->pkeys_size); > > Also should it be an error? May be it is just enough to fill requested > pkey entries? I agree that being more forgiving is better but then how would it be known if the pkeys are being truncated ? Also, it seems to be the style of the API (what is done for ports). Can't just request an individal port but all ports. >> +                                             goto Exit; >> +                                     } >> +                                     for (k = 0; k < attr->num_pkeys; k++) >> +                                             attr->p_pkey_table[k] = >> +                                                     cl_hton16(ca.ports[j]->pkeys[k]); >> +                             } >>                               attr++; >>                               if (attr - p_attr_array > *p_num_ports) { >>                                       done = 1; >> @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>       } >> >>       *p_num_ports = attr - p_attr_array; >> -     r = 0; >> >>  Exit: >>       OSM_LOG_EXIT(p_vend->p_log); >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >> index 73a6274..503d7fa 100644 >> --- a/opensm/opensm/main.c >> +++ b/opensm/opensm/main.c >> @@ -2,6 +2,7 @@ >>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. >>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) >>       uint32_t i, choice = 0; >>       ib_api_status_t status; >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + > > Here and below. Just > >        memset(attr_array, 0, sizeof(attr_array)); > > would be enough. Sure; next version. -- Hal > Sasha > >>       /* Call the transport layer for a list of local port GUID values */ >>       status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array, >>                                             &num_ports); >> diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c >> index b360af6..83c1e13 100644 >> --- a/opensm/osmtest/main.c >> +++ b/opensm/osmtest/main.c >> @@ -1,6 +1,7 @@ >>  /* >>   * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt) >>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >>       int i; >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >>       /* >>          Call the transport layer for a list of local port >>          GUID values. >> @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid) >>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >>       int i; >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >>       /* >>          Call the transport layer for a list of local port >>          GUID values. >> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c >> index a7b343f..986a8d2 100644 >> --- a/opensm/osmtest/osmtest.c >> +++ b/opensm/osmtest/osmtest.c >> @@ -2,6 +2,7 @@ >>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt, >>       ib_api_status_t status; >>       uint32_t num_ports = MAX_LOCAL_IBPORTS; >>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >> +     int i; >> >>       OSM_LOG_ENTER(&p_osmt->log); >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >>       /* >>        * Call the transport layer for a list of local port >>        * GUID values. >> -- >> 1.5.6.4 >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Thu Feb 26 04:03:12 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 07:03:12 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: <20090226071059.GV11192@sashak.voltaire.com> References: <20090218153016.GD8489@comcast.net> <20090226071059.GV11192@sashak.voltaire.com> Message-ID: On Thu, Feb 26, 2009 at 2:10 AM, Sasha Khapyorsky wrote: > On 10:30 Wed 18 Feb     , Hal Rosenstock wrote: >> >> Only supported in osm_vendor_ibumad.c (separate patch for other >> vendor layers) >> Also, update applications using this (osmtest, opensm) > > It looks that ibutils (ibis) requires same fix (attr_array > initialization) too. Yes, I'm aware but didn't want to send those until these were accepted. -- Hal > > Sasha > >> >> Signed-off-by: Hal Rosenstock >> --- >>  opensm/libvendor/osm_vendor_ibumad.c |   24 +++++++++++++++++++----- >>  opensm/opensm/main.c                 |    6 ++++++ >>  opensm/osmtest/main.c                |   11 +++++++++++ >>  opensm/osmtest/osmtest.c             |    7 +++++++ >>  4 files changed, 43 insertions(+), 5 deletions(-) >> >> diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c >> index 734a860..861bfbe 100644 >> --- a/opensm/libvendor/osm_vendor_ibumad.c >> +++ b/opensm/libvendor/osm_vendor_ibumad.c >> @@ -2,6 +2,7 @@ >>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>       umad_ca_t ca; >>       ib_port_attr_t *attr = p_attr_array; >>       unsigned done = 0; >> -     int r, i, j; >> +     int r, i, j, k; >> >>       OSM_LOG_ENTER(p_vend->p_log); >> >>       CL_ASSERT(p_vend && p_num_ports); >> >> +     r = 0; >>       if (!*p_num_ports) { >>               r = IB_INVALID_PARAMETER; >>               OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: " >> @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>       } >> >>       for (i = 0; i < p_vend->ca_count && !done; i++) { >> -             /* >> -              * For each CA, retrieve the port guids >> -              */ >> +             /* For each CA, retrieve the port attributes */ >>               if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) { >>                       if (ca.node_type < 1 || ca.node_type > 3) >>                               continue; >> @@ -590,6 +590,21 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>                               attr->port_num = ca.ports[j]->portnum; >>                               attr->sm_lid = ca.ports[j]->sm_lid; >>                               attr->link_state = ca.ports[j]->state; >> +                             attr->num_pkeys = ca.ports[j]->pkeys_size; >> +                             if (attr->num_pkeys && attr->p_pkey_table) { >> +                                     if (attr->num_pkeys < ca.ports[j]->pkeys_size) { >> +                                             r = IB_INSUFFICIENT_MEMORY; >> +                                             OSM_LOG(p_vend->p_log, >> +                                                     OSM_LOG_ERROR, >> +                                                     "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", >> +                                                     j, >> +                                                     ca.ports[j]->pkeys_size); >> +                                             goto Exit; >> +                                     } >> +                                     for (k = 0; k < attr->num_pkeys; k++) >> +                                             attr->p_pkey_table[k] = >> +                                                     cl_hton16(ca.ports[j]->pkeys[k]); >> +                             } >>                               attr++; >>                               if (attr - p_attr_array > *p_num_ports) { >>                                       done = 1; >> @@ -601,7 +616,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, >>       } >> >>       *p_num_ports = attr - p_attr_array; >> -     r = 0; >> >>  Exit: >>       OSM_LOG_EXIT(p_vend->p_log); >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >> index 73a6274..503d7fa 100644 >> --- a/opensm/opensm/main.c >> +++ b/opensm/opensm/main.c >> @@ -2,6 +2,7 @@ >>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. >>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) >>       uint32_t i, choice = 0; >>       ib_api_status_t status; >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >>       /* Call the transport layer for a list of local port GUID values */ >>       status = osm_vendor_get_all_port_attr(p_osm->p_vendor, attr_array, >>                                             &num_ports); >> diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c >> index b360af6..83c1e13 100644 >> --- a/opensm/osmtest/main.c >> +++ b/opensm/osmtest/main.c >> @@ -1,6 +1,7 @@ >>  /* >>   * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt) >>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >>       int i; >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >>       /* >>          Call the transport layer for a list of local port >>          GUID values. >> @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid) >>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >>       int i; >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >>       /* >>          Call the transport layer for a list of local port >>          GUID values. >> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c >> index a7b343f..986a8d2 100644 >> --- a/opensm/osmtest/osmtest.c >> +++ b/opensm/osmtest/osmtest.c >> @@ -2,6 +2,7 @@ >>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. >>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt, >>       ib_api_status_t status; >>       uint32_t num_ports = MAX_LOCAL_IBPORTS; >>       ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >> +     int i; >> >>       OSM_LOG_ENTER(&p_osmt->log); >> >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >>       /* >>        * Call the transport layer for a list of local port >>        * GUID values. >> -- >> 1.5.6.4 >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Thu Feb 26 04:03:32 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 07:03:32 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/console: Enhance perfmgr print_counters for better nodenames In-Reply-To: <20090226061551.GQ11192@sashak.voltaire.com> References: <20090219130653.GA29318@comcast.net> <20090226061551.GQ11192@sashak.voltaire.com> Message-ID: On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky wrote: [snip...] > And in general I think it is better to use C-style comments - /* ... */, > in C code and not C++-style // ... . Is this going to be enforced uniformly across OpenSM ? -- Hal > Sasha From hal.rosenstock at gmail.com Thu Feb 26 04:04:39 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 07:04:39 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] [PATCH] libibmad: remove functions which use pthread In-Reply-To: <20090226051012.GH11192@sashak.voltaire.com> References: <20081231170244.GC21950@sashak.voltaire.com> <20081231170413.GD21950@sashak.voltaire.com> <20090217091955.pjpl28xzuo4g4o8o@www-openlabnet.llnl.gov> <20090217142859.9e7a7e22.weiny2@llnl.gov> <20090218003355.GX7189@sashak.voltaire.com> <20090226051012.GH11192@sashak.voltaire.com> Message-ID: Sasha, On Thu, Feb 26, 2009 at 12:10 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 10:20 Wed 18 Feb     , Hal Rosenstock wrote: >> On Tue, Feb 17, 2009 at 7:33 PM, Sasha Khapyorsky wrote: >> > On 18:21 Tue 17 Feb     , Hal Rosenstock wrote: >> >> > >> >> > For utilities which run once through I think the old functions work just >> >> > fine. >> >> >> >> Well, sort of... Aren't mad_portid "collisions" possible when multiple >> >> programs are run concurrently ? >> > >> > No. >> >> With the old API, mad_portid can be overwritten by another process or >> thread. Another thread is not an expected use case but it is possible. > > Yes, but you asked about "collisions" between different programs > (processes) run. Another language issue. -- Hal > > Sasha > From hal.rosenstock at gmail.com Thu Feb 26 04:22:40 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 07:22:40 -0500 Subject: [ofa-general] Re: [PATCH] IB/core: fix null pointer dereference in local_completions() In-Reply-To: References: <1235608563.3948.199.camel@chromite.mv.qlogic.com> Message-ID: On Wed, Feb 25, 2009 at 7:53 PM, Roland Dreier wrote: > This looks fine to me.  Hal and/or Sean, any comment? This looks right to me too. -- Hal From kliteyn at dev.mellanox.co.il Thu Feb 26 04:22:14 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 26 Feb 2009 14:22:14 +0200 Subject: [ofa-general] [PATCH v2] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node Message-ID: <49A68976.6000404@dev.mellanox.co.il> Hi Sasha, [v2: adding CL_ASSERT() and changing comments] This patch fixes bugzilla issue #1515. The bug was discovered and analyzed by Line Holen. Topology: |---------------| | SW2 | |---------------| |x |y |z |v |----| | | |----| | | | | | |----| |----| | | | | | a| b| c| d| |---------------| |---------------| | SW1 | | SW3 | |---------------| |---------------| | | | | HCA with SM HCA During the discovery: SM sends NodeInfo request to SW1 SM sends NodeInfo request to SW2 through link a->x SM discovers new node SW2: - updates DR to SW2 to go through link a->x - creates physp x SM sends NodeInfo request to SW2 through link b->y SM discovers a known node SW2 - DOES NOT create physp y - updates DR to SW2 to go through link b->y >From now on, the DR to SW2 is going through port y, so OpenSM won't deal with port y any more, leaving it uninitialized (no physp object for this port). The fix is to create physp for the newly discovered port of the known switch node, same way as it is done for HCAs. I also added one log message for the case that showed the problem - when one of the link sides is uninitialized (no valid ports check). Perhaps this log message should be an error message instead? Debugged-by: Line Holen Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_node_info_rcv.c | 35 ++++++++++++++++++++++++++--------- 1 files changed, 26 insertions(+), 9 deletions(-) diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index c52c0d5..4d3724c 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -154,18 +154,17 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, goto _exit; } - /* - We have seen this neighbor node before, but we might - not have seen this port on the neighbor node before. - We should not set links to an uninitialized port on the - neighbor, so check validity up front. If it's not - valid, do nothing, since we'll see this link again - when we probe the neighbor. - */ + /* When setting the link, ports on both + sides of the link should be initialized */ if (!osm_node_link_has_valid_ports(p_node, port_num, p_neighbor_node, - p_ni_context->port_num)) + p_ni_context->port_num)) { + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", + cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); + CL_ASSERT(0); goto _exit; + } if (osm_node_link_exists(p_node, port_num, p_neighbor_node, p_ni_context->port_num)) { @@ -537,8 +536,26 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, IN osm_node_t * const p_node, IN const osm_madw_t * const p_madw) { + + ib_smp_t *p_smp; + ib_node_info_t *p_ni; + uint8_t port_num; + OSM_LOG_ENTER(sm->p_log); + p_smp = osm_madw_get_smp_ptr(p_madw); + p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + port_num = ib_node_info_get_local_port_num(p_ni); + + if (!osm_node_get_physp_ptr(p_node, port_num)) { + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Creating physp for node GUID:0x%" + PRIx64 ", port %u\n", + cl_ntoh64(osm_node_get_node_guid(p_node)), + port_num); + osm_node_init_physp(p_node, p_madw); + } + /* If this switch has already been probed during this sweep, then don't bother reprobing it. -- 1.5.1.4 From ogerlitz at voltaire.com Thu Feb 26 04:38:09 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Feb 2009 14:38:09 +0200 (IST) Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL / PortRcvDataSL support Message-ID: libimad implementation of PortXmtDataSL (IBA A13.6.5) / PortRcvDataSL (IBA A13.6.6) reading and resetting Signed-off-by: Or Gerlitz Index: management/libibmad/include/infiniband/mad.h =================================================================== --- management.orig/libibmad/include/infiniband/mad.h +++ management/libibmad/include/infiniband/mad.h @@ -153,7 +153,8 @@ enum GSI_ATTR_ID { IB_GSI_PORT_SAMPLES_RESULT = 0x11, IB_GSI_PORT_COUNTERS = 0x12, IB_GSI_PORT_COUNTERS_EXT = 0x1D, - + IB_GSI_PORT_XMIT_DATA_SL = 0x36, + IB_GSI_PORT_RCV_DATA_SL = 0x37, IB_GSI_ATTR_LAST }; @@ -421,6 +422,28 @@ enum MAD_FIELDS { IB_PC_XMT_WAIT_F, IB_PC_LAST_F, + IB_PC_XMT_DATA_SL_FIRST_F, + IB_PC_XMT_DATA_SL0_F = IB_PC_XMT_DATA_SL_FIRST_F, + IB_PC_XMT_DATA_SL1_F, + IB_PC_XMT_DATA_SL2_F, + IB_PC_XMT_DATA_SL3_F, + IB_PC_XMT_DATA_SL4_F, + IB_PC_XMT_DATA_SL5_F, + IB_PC_XMT_DATA_SL6_F, + IB_PC_XMT_DATA_SL7_F, + IB_PC_XMT_DATA_SL_LAST_F, + + IB_PC_RCV_DATA_SL_FIRST_F, + IB_PC_RCV_DATA_SL0_F = IB_PC_RCV_DATA_SL_FIRST_F, + IB_PC_RCV_DATA_SL1_F, + IB_PC_RCV_DATA_SL2_F, + IB_PC_RCV_DATA_SL3_F, + IB_PC_RCV_DATA_SL4_F, + IB_PC_RCV_DATA_SL5_F, + IB_PC_RCV_DATA_SL6_F, + IB_PC_RCV_DATA_SL7_F, + IB_PC_RCV_DATA_SL_LAST_F, + /* * SMInfo */ @@ -793,6 +816,16 @@ MAD_EXPORT uint8_t *port_performance_ext MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout); +MAD_EXPORT uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t * dest, + int port, unsigned timeout); +MAD_EXPORT uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t * dest, + int port, unsigned timeout); +MAD_EXPORT uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t * dest, + int port, unsigned mask, + unsigned timeout); +MAD_EXPORT uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t * dest, + int port, unsigned mask, + unsigned timeout); MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout); MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, @@ -830,7 +863,8 @@ MAD_EXPORT ib_mad_dump_fn mad_dump_mtu, mad_dump_vlcap, mad_dump_opervls, mad_dump_node_type, mad_dump_sltovl, mad_dump_vlarbitration, mad_dump_nodedesc, mad_dump_nodeinfo, mad_dump_portinfo, - mad_dump_switchinfo, mad_dump_perfcounters, mad_dump_perfcounters_ext; + mad_dump_switchinfo, mad_dump_perfcounters, mad_dump_perfcounters_ext, + mad_dump_perfcounters_xmt_sl, mad_dump_perfcounters_rcv_sl; extern int ibdebug; Index: management/libibmad/src/fields.c =================================================================== --- management.orig/libibmad/src/fields.c +++ management/libibmad/src/fields.c @@ -262,6 +262,26 @@ static const ib_field_t ib_mad_f[] = { {320, 32, "XmtWait", mad_dump_uint}, {0, 0}, /* IB_PC_LAST_F */ + {32, 32, "XmtDataSL0", mad_dump_uint}, + {64, 32, "XmtDataSL1", mad_dump_uint}, + {96, 32, "XmtDataSL2", mad_dump_uint}, + {128, 32, "XmtDataSL3", mad_dump_uint}, + {160, 32, "XmtDataSL4", mad_dump_uint}, + {196, 32, "XmtDataSL5", mad_dump_uint}, + {224, 32, "XmtDataSL6", mad_dump_uint}, + {256, 32, "XmtDataSL7", mad_dump_uint}, + {0, 0}, /* IB_PC_XMT_DATA_SL_LAST_F */ + + {32, 32, "RcvDataSL0", mad_dump_uint}, + {64, 32, "RcvDataSL1", mad_dump_uint}, + {96, 32, "RcvDataSL2", mad_dump_uint}, + {128, 32, "RcvDataSL3", mad_dump_uint}, + {160, 32, "RcvDataSL4", mad_dump_uint}, + {196, 32, "RcvDataSL5", mad_dump_uint}, + {224, 32, "RcvDataSL6", mad_dump_uint}, + {256, 32, "RcvDataSL7", mad_dump_uint}, + {0, 0}, /* IB_PC_RCV_DATA_SL_LAST_F */ + /* * SMInfo */ Index: management/libibmad/src/gs.c =================================================================== --- management.orig/libibmad/src/gs.c +++ management/libibmad/src/gs.c @@ -193,6 +193,18 @@ uint8_t *port_performance_ext_query(void return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_COUNTERS_EXT); } +uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t * dest, int port, + unsigned timeout) +{ + return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_XMIT_DATA_SL); +} + +uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t * dest, int port, + unsigned timeout) +{ + return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_RCV_DATA_SL); +} + uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned mask, unsigned timeout, const void *srcport) @@ -208,6 +220,20 @@ uint8_t *port_performance_ext_reset(void IB_GSI_PORT_COUNTERS_EXT); } +uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t * dest, int port, + unsigned mask, unsigned timeout) +{ + return performance_reset(rcvbuf, dest, port, mask, timeout, + IB_GSI_PORT_XMIT_DATA_SL); +} + +uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t * dest, int port, + unsigned mask, unsigned timeout) +{ + return performance_reset(rcvbuf, dest, port, mask, timeout, + IB_GSI_PORT_RCV_DATA_SL); +} + uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, const void *srcport) Index: management/libibmad/src/libibmad.map =================================================================== --- management.orig/libibmad/src/libibmad.map +++ management/libibmad/src/libibmad.map @@ -22,6 +22,8 @@ IBMAD_1.3 { mad_dump_opervls; mad_dump_perfcounters; mad_dump_perfcounters_ext; + mad_dump_perfcounters_xmt_sl; + mad_dump_perfcounters_rcv_sl; mad_dump_physportstate; mad_dump_portcapmask; mad_dump_portinfo; @@ -45,6 +47,10 @@ IBMAD_1.3 { port_performance_reset; port_performance_ext_query; port_performance_ext_reset; + port_performance_xmt_sl_query; + port_performance_rcv_sl_query; + port_performance_xmt_sl_reset; + port_performance_rcv_sl_reset; port_samples_control_query; port_samples_result_query; mad_build_pkt; Index: management/libibmad/src/dump.c =================================================================== --- management.orig/libibmad/src/dump.c +++ management/libibmad/src/dump.c @@ -699,6 +699,16 @@ void mad_dump_perfcounters_ext(char *buf _dump_fields(buf, bufsz, val, IB_PC_EXT_FIRST_F, IB_PC_EXT_LAST_F); } +void mad_dump_perfcounters_xmt_sl(char *buf, int bufsz, void *val, int valsz) +{ + _dump_fields(buf, bufsz, val, IB_PC_XMT_DATA_SL_FIRST_F, IB_PC_XMT_DATA_SL_LAST_F); +} + +void mad_dump_perfcounters_rcv_sl(char *buf, int bufsz, void *val, int valsz) +{ + _dump_fields(buf, bufsz, val, IB_PC_RCV_DATA_SL_FIRST_F, IB_PC_RCV_DATA_SL_LAST_F); +} + void xdump(FILE * file, char *msg, void *p, int size) { #define HEX(x) ((x) < 10 ? '0' + (x) : 'a' + ((x) -10)) From ogerlitz at voltaire.com Thu Feb 26 04:39:50 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Feb 2009 14:39:50 +0200 (IST) Subject: [ofa-general] [PATCH 2/2] perfquery: add PortXmtDataSL/PortRcvDataSL read and reset In-Reply-To: References: Message-ID: perfquery PortXmtDataSL/PortRcvDataSL (IBA A13.6.5/6) support Signed-off-by: Or Gerlitz Index: management/infiniband-diags/src/perfquery.c =================================================================== --- management.orig/infiniband-diags/src/perfquery.c +++ management/infiniband-diags/src/perfquery.c @@ -307,7 +307,50 @@ static void reset_counters(int extended, } } -static int reset, reset_only, all_ports, loop_ports, port, extended; +static int reset, reset_only, all_ports, loop_ports, port, extended, xmt_sl, rcv_sl; + +void xmt_sl_query(ib_portid_t *portid, int port, int mask) +{ + char buf[1024]; + + if (reset_only) { + if (!port_performance_xmt_sl_reset(pc, portid, port, mask, ibd_timeout)) + IBERROR("perfslreset"); + return; + } + + if (!port_performance_xmt_sl_query(pc, portid, port, ibd_timeout)) + IBERROR("perfslquery"); + + mad_dump_perfcounters_xmt_sl(buf, sizeof buf, pc, sizeof pc); + printf("# Port counters: %s port %d\n%s", portid2str(portid), port, buf); + + if(reset) + if (!port_performance_xmt_sl_reset(pc, portid, port, mask, ibd_timeout)) + IBERROR("perfslreset"); +} + +void rcv_sl_query(ib_portid_t *portid, int port, int mask) +{ + char buf[1024]; + + if (reset_only) { + if (!port_performance_rcv_sl_reset(pc, portid, port, mask, ibd_timeout)) + IBERROR("perfslreset"); + return; + } + + if (!port_performance_rcv_sl_query(pc, portid, port, ibd_timeout)) + IBERROR("perfslquery"); + + mad_dump_perfcounters_rcv_sl(buf, sizeof buf, pc, sizeof pc); + printf("# Port counters: %s port %d\n%s", portid2str(portid), port, buf); + + if(reset) + if (!port_performance_rcv_sl_reset(pc, portid, port, mask, ibd_timeout)) + IBERROR("perfslreset"); +} + static int process_opt(void *context, int ch, char *optarg) { @@ -315,6 +358,12 @@ static int process_opt(void *context, in case 'x': extended = 1; break; + case 's': + xmt_sl = 1; + break; + case 'S': + rcv_sl = 1; + break; case 'a': all_ports++; port = ALL_PORTS; @@ -349,6 +398,8 @@ int main(int argc, char **argv) const struct ibdiag_opt opts[] = { { "extended", 'x', 0, NULL, "show extended port counters" }, + { "xmtsl", 's', 0, NULL, "show Xmt SL port counters" }, + { "rcvsl", 'S', 0, NULL, "show Rcv SL port counters" }, { "all_ports", 'a', 0, NULL, "show aggregated counters" }, { "loop_ports", 'l', 0, NULL, "iterate through each port" }, { "reset_after_read", 'r', 0, NULL, "reset counters after read" }, @@ -405,6 +456,16 @@ int main(int argc, char **argv) all_ports_loop = 1; } + if (xmt_sl) { + xmt_sl_query(&portid, port, mask); + exit(0); + } + + if (rcv_sl) { + rcv_sl_query(&portid, port, mask); + exit(0); + } + if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) IBERROR("smp query nodeinfo failed"); From ogerlitz at voltaire.com Thu Feb 26 04:41:40 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Feb 2009 14:41:40 +0200 (IST) Subject: [ofa-general] Re: [PATCH 2/2] perfquery: add PortXmtDataSL/PortRcvDataSL read and reset In-Reply-To: References: Message-ID: Hi Sasha, For some reason the Xmt SL help is printed twice, any idea why? Or. ./infiniband-diags/src/perfquery -h Usage: ./infiniband-diags/src/perfquery [options] [ [[port] [reset_mask]]] Options: --extended, -x show extended port counters --xmtsl, -s show Xmt SL port counters --rcvsl, -S show Rcv SL port counters --all_ports, -a show aggregated counters --loop_ports, -l iterate through each port --reset_after_read, -r reset counters after read --Reset_only, -R only reset counters --Ca, -C Ca name to use --Port, -P Ca port number to use --Lid, -L use LID address argument --Guid, -G use GUID address argument --timeout, -t timeout in ms --xmtsl, -s show Xmt SL port counters --errors, -e show send and receive errors --verbose, -v increase verbosity level --debug, -d raise debug level --usage, -u usage message --help, -h help message --version, -V show version From hal.rosenstock at gmail.com Thu Feb 26 06:06:16 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 09:06:16 -0500 Subject: ***SPAM*** Re: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL / PortRcvDataSL support In-Reply-To: References: Message-ID: On Thu, Feb 26, 2009 at 7:38 AM, Or Gerlitz wrote: > libimad implementation of PortXmtDataSL (IBA A13.6.5) / PortRcvDataSL > (IBA A13.6.6) reading and resetting > > Signed-off-by: Or Gerlitz > > Index: management/libibmad/include/infiniband/mad.h > =================================================================== > --- management.orig/libibmad/include/infiniband/mad.h > +++ management/libibmad/include/infiniband/mad.h > @@ -153,7 +153,8 @@ enum GSI_ATTR_ID { >        IB_GSI_PORT_SAMPLES_RESULT = 0x11, >        IB_GSI_PORT_COUNTERS = 0x12, >        IB_GSI_PORT_COUNTERS_EXT = 0x1D, > - > +       IB_GSI_PORT_XMIT_DATA_SL = 0x36, > +       IB_GSI_PORT_RCV_DATA_SL  = 0x37, >        IB_GSI_ATTR_LAST >  }; > > @@ -421,6 +422,28 @@ enum MAD_FIELDS { >        IB_PC_XMT_WAIT_F, >        IB_PC_LAST_F, > > +       IB_PC_XMT_DATA_SL_FIRST_F, > +       IB_PC_XMT_DATA_SL0_F = IB_PC_XMT_DATA_SL_FIRST_F, > +       IB_PC_XMT_DATA_SL1_F, > +       IB_PC_XMT_DATA_SL2_F, > +       IB_PC_XMT_DATA_SL3_F, > +       IB_PC_XMT_DATA_SL4_F, > +       IB_PC_XMT_DATA_SL5_F, > +       IB_PC_XMT_DATA_SL6_F, > +       IB_PC_XMT_DATA_SL7_F, > +       IB_PC_XMT_DATA_SL_LAST_F, > + > +       IB_PC_RCV_DATA_SL_FIRST_F, > +       IB_PC_RCV_DATA_SL0_F = IB_PC_RCV_DATA_SL_FIRST_F, > +       IB_PC_RCV_DATA_SL1_F, > +       IB_PC_RCV_DATA_SL2_F, > +       IB_PC_RCV_DATA_SL3_F, > +       IB_PC_RCV_DATA_SL4_F, > +       IB_PC_RCV_DATA_SL5_F, > +       IB_PC_RCV_DATA_SL6_F, > +       IB_PC_RCV_DATA_SL7_F, > +       IB_PC_RCV_DATA_SL_LAST_F, > + Any reason to restrict this to SL0-7 rather than the complete SL range ? -- Hal [snip...] From purdy at sgi.com Thu Feb 26 06:31:27 2009 From: purdy at sgi.com (Dale Purdy) Date: Thu, 26 Feb 2009 08:31:27 -0600 Subject: [ofa-general] Re: [PATCH] opensm: Implement weighted routing In-Reply-To: <829ded920902252051g283b9e84vffce832452d241ac@mail.gmail.com> References: <829ded920902252051g283b9e84vffce832452d241ac@mail.gmail.com> Message-ID: <20090226143127.GA28285@sgi.com> On Thu, Feb 26, 2009 at 10:21:43AM +0530, Keshetti Mahesh wrote: > Hello Dale Purdy, > > I have a requirement where I have to set the some hop's weight > factor to zero. Is this supported by your patch ? > I have implemented something similar to it before but it lead to > loops in the routing table. Does your patch take care of those things ? > > -Mahesh No, the accepted values for the hop weight are 1 - 0xff. I suppose one could allow a value of zero though. Or one could raise the weight factor for the other ports on the switch to a large value so that the one you are trying to force traffic through is highly favored in comparison. Whenever you are manipulating the hop weight factors, you better know what you are doing since it alters the behavior of the routing engines and could then induce credit loops. In our case we are using this to separate MPI traffic from I/O traffic and at the same time eliminate credit loops. -- Dale From ogerlitz at voltaire.com Thu Feb 26 06:54:59 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Feb 2009 16:54:59 +0200 Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL / PortRcvDataSL support In-Reply-To: References: Message-ID: <49A6AD43.4000706@voltaire.com> Hal Rosenstock wrote: > Any reason to restrict this to SL0-7 rather than the complete SL range? > Not really, I can fix that. Or. From dorfman.eli at gmail.com Thu Feb 26 07:32:40 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 26 Feb 2009 17:32:40 +0200 Subject: [ofa-general] ***SPAM*** [PATCH 1/2] include/opensm/osm_opensm.h add setup function to routing engine. Message-ID: <49A6B618.1090300@gmail.com> add setup function to routing engine. call it only when we want to use this routing engine. this will save allocation for routing algorithms that are not used. Signed-off-by: Eli Dorfman --- opensm/include/opensm/osm_opensm.h | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h index c121be4..6191530 100644 --- a/opensm/include/opensm/osm_opensm.h +++ b/opensm/include/opensm/osm_opensm.h @@ -122,6 +122,8 @@ typedef enum _osm_routing_engine_type { struct osm_routing_engine { const char *name; void *context; + int initialized; + int (*setup) (void *re, void *p_osm); int (*build_lid_matrices) (void *context); int (*ucast_build_fwd_tables) (void *context); void (*ucast_dump_tables) (void *context); -- 1.5.5 From dorfman.eli at gmail.com Thu Feb 26 07:36:11 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 26 Feb 2009 17:36:11 +0200 Subject: [ofa-general] [PATCH 2/2] opensm: setup routing engine when in use and delete when fail In-Reply-To: <49A6B618.1090300@gmail.com> References: <49A6B618.1090300@gmail.com> Message-ID: <49A6B6EB.80700@gmail.com> setup routing engine when in use and delete when fail setup routing engine before use. delete resources when routing algorithm fails this will save allocation for routing algorithms that are not used. Signed-off-by: Eli Dorfman --- opensm/opensm/osm_opensm.c | 20 ++++++-------------- opensm/opensm/osm_ucast_mgr.c | 34 +++++++++++++++++++++++++++++++++- 2 files changed, 39 insertions(+), 15 deletions(-) diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index 7de2e5b..a2620d5 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -169,21 +169,14 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) memset(re, 0, sizeof(struct osm_routing_engine)); re->name = m->name; - if (m->setup(re, osm)) { - OSM_LOG(&osm->log, OSM_LOG_VERBOSE, - "setup of routing" - " engine \'%s\' failed\n", name); - return; - } - OSM_LOG(&osm->log, OSM_LOG_DEBUG, - "\'%s\' routing engine set up\n", re->name); + re->setup = m->setup; append_routing_engine(osm, re); return; } } OSM_LOG(&osm->log, OSM_LOG_ERROR, - "cannot find or setup routing engine \'%s\'", name); + "cannot find or setup routing engine \'%s\'\n", name); } static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) @@ -224,18 +217,17 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm) /********************************************************************** **********************************************************************/ -static void destroy_routing_engines(osm_opensm_t *osm) +static void destroy_routing_engines(struct osm_routing_engine **re) { struct osm_routing_engine *r, *next; - next = osm->routing_engine_list; + next = *re; while (next) { r = next; next = r->next; - if (r->delete) - r->delete(r->context); free(r); } + *re = NULL; } /********************************************************************** @@ -289,7 +281,7 @@ void osm_opensm_destroy(IN osm_opensm_t * const p_osm) /* do the destruction in reverse order as init */ destroy_plugins(p_osm); - destroy_routing_engines(p_osm); + destroy_routing_engines(&p_osm->routing_engine_list); osm_sa_destroy(&p_osm->sa); osm_sm_destroy(&p_osm->sm); #ifdef ENABLE_OSM_PERF_MGR diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index e404c91..7175926 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -886,7 +886,6 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl; p_osm = p_mgr->p_subn->p_osm; - p_routing_eng = p_osm->routing_engine_list; CL_PLOCK_EXCL_ACQUIRE(p_mgr->p_lock); @@ -897,10 +896,30 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0) goto Exit; + /* update the entry in active list */ + p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE; + p_routing_eng = p_osm->routing_engine_list; while (p_routing_eng) { + if (!p_routing_eng->initialized && + p_routing_eng->setup(p_routing_eng, p_osm)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE, + "setup of routing engine \'%s\' failed\n", + p_routing_eng->name); + p_routing_eng = p_routing_eng->next; + continue; + } + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "\'%s\' routing engine set up\n", p_routing_eng->name); + p_routing_eng->initialized = 1; + if (!ucast_mgr_route(p_routing_eng, p_osm)) break; + + /* delete unused routing engine */ + if (p_routing_eng->delete) + p_routing_eng->delete(p_routing_eng->context); + p_routing_eng = p_routing_eng->next; } @@ -911,6 +930,19 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; } + /* if for some reason different routing engine is used */ + /* cleanup unused routing engine */ + p_routing_eng = p_osm->routing_engine_list; + while (p_routing_eng) { + if (p_routing_eng->initialized && + p_osm->routing_engine_used != + osm_routing_engine_type(p_routing_eng->name) && + p_routing_eng->delete) + p_routing_eng->delete(p_routing_eng->context); + + p_routing_eng = p_routing_eng->next; + } + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, "%s tables configured on all switches\n", osm_routing_engine_type_str(p_osm->routing_engine_used)); -- 1.5.5 From dorfman.eli at gmail.com Thu Feb 26 07:43:31 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 26 Feb 2009 17:43:31 +0200 Subject: [ofa-general] ***SPAM*** [PATCH 1/2] include/opensm/osm_opensm.h support routing engine update Message-ID: <49A6B8A3.2020703@gmail.com> support routing engine update. add prev routing engine list. save active routing engine list as prev routing engine list. this is used to cleanup used routing engine allocation if needed and only after new routing engine was configured. Signed-off-by: Eli Dorfman --- opensm/include/opensm/osm_opensm.h | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h index 6191530..c8b91a0 100644 --- a/opensm/include/opensm/osm_opensm.h +++ b/opensm/include/opensm/osm_opensm.h @@ -185,6 +185,7 @@ typedef struct osm_opensm { cl_dispatcher_t disp; cl_plock_t lock; struct osm_routing_engine *routing_engine_list; + struct osm_routing_engine *prev_routing_engine_list; osm_routing_engine_type_t routing_engine_used; osm_stats_t stats; osm_console_t console; @@ -525,5 +526,7 @@ extern volatile unsigned int osm_exit_flag; * Set to one to cause all threads to leave *********/ +void update_routing_engines(osm_opensm_t *osm, const char *engine_names); + END_C_DECLS #endif /* _OSM_OPENSM_H_ */ -- 1.5.5 From dorfman.eli at gmail.com Thu Feb 26 07:49:02 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 26 Feb 2009 17:49:02 +0200 Subject: [ofa-general] [PATCH 2/2] opensm routing engine update In-Reply-To: <49A6B8A3.2020703@gmail.com> References: <49A6B8A3.2020703@gmail.com> Message-ID: <49A6B9EE.7000008@gmail.com> support routing engine update. save active routing engine list as prev routing engine list. this is used to cleanup used routing engine allocation if needed and only after new routing engine was configured. Signed-off-by: Eli Dorfman --- opensm/opensm/osm_opensm.c | 9 +++++++++ opensm/opensm/osm_subnet.c | 10 +++++++++- opensm/opensm/osm_ucast_mgr.c | 22 +++++++++++++++++++++- 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index a2620d5..6ab28be 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -230,6 +230,15 @@ static void destroy_routing_engines(struct osm_routing_engine **re) *re = NULL; } +void update_routing_engines(osm_opensm_t *osm, const char *engine_names) +{ + /* cleanup prev routing engine list and replace with current list */ + destroy_routing_engines(&osm->prev_routing_engine_list); + osm->prev_routing_engine_list = osm->routing_engine_list; + osm->routing_engine_list = NULL; + setup_routing_engines(osm, engine_names); +} + /********************************************************************** **********************************************************************/ static void destroy_plugins(osm_opensm_t *osm) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index b3100a4..1ba5c91 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -151,6 +151,14 @@ static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val) osm_set_sm_priority(p_sm, sm_priority); } +static void opts_setup_routing_engine(osm_subn_t *p_subn, void *p_val) +{ + osm_opensm_t *p_osm = p_subn->p_osm; + char *engines = (char *) p_val; + + update_routing_engines(p_osm, engines); +} + static void opts_parse_net64(IN osm_subn_t *p_subn, IN char *p_key, IN char *p_val_str, void *p_v1, void *p_v2, void (*pfn)(osm_subn_t *, void *)) @@ -324,7 +332,7 @@ static const opt_rec_t opt_tbl[] = { { "port_prof_ignore_file", OPT_OFFSET(port_prof_ignore_file), opts_parse_charp, NULL, 0 }, { "port_profile_switch_nodes", OPT_OFFSET(port_profile_switch_nodes), opts_parse_boolean, NULL, 1 }, { "sweep_on_trap", OPT_OFFSET(sweep_on_trap), opts_parse_boolean, NULL, 1 }, - { "routing_engine", OPT_OFFSET(routing_engine_names), opts_parse_charp, NULL, 0 }, + { "routing_engine", OPT_OFFSET(routing_engine_names), opts_parse_charp, opts_setup_routing_engine, 1 }, { "connect_roots", OPT_OFFSET(connect_roots), opts_parse_boolean, NULL, 1 }, { "use_ucast_cache", OPT_OFFSET(use_ucast_cache), opts_parse_boolean, NULL, 1 }, { "log_file", OPT_OFFSET(log_file), opts_parse_charp, NULL, 0 }, diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 7175926..cda9f34 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -879,7 +879,7 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t *osm) int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) { osm_opensm_t *p_osm; - struct osm_routing_engine *p_routing_eng; + struct osm_routing_engine *p_routing_eng, *r; cl_qmap_t *p_sw_guid_tbl; OSM_LOG_ENTER(p_mgr->p_log); @@ -896,6 +896,26 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0) goto Exit; + /* find used routing engine in previous list */ + r = p_osm->prev_routing_engine_list; + while (r) { + if (p_osm->routing_engine_used == + osm_routing_engine_type(r->name)) + { + p_routing_eng = p_osm->routing_engine_list; + while (p_routing_eng) { + if (p_osm->routing_engine_used == + osm_routing_engine_type(p_routing_eng->name)) { + memcpy(p_routing_eng, r, sizeof(*p_routing_eng)); + break; + } + p_routing_eng = p_routing_eng->next; + } + break; + } + r = r->next; + } + /* update the entry in active list */ p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE; -- 1.5.5 From ramachandra.kuchimanchi at qlogic.com Thu Feb 26 08:38:38 2009 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 26 Feb 2009 22:08:38 +0530 (IST) Subject: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation Message-ID: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095> Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime. In the 8-bit field of the RMPP header, the first 5 bits are RRespTime and next 3 bits are RMPPFlags. Hence to retain the first 5 bits, the mask should be 0xF8 instead of 0xF1. Signed-off-by: Ramachandra K --- diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 5f6c40f..1a0f409 100644 --- a/include/rdma/ib_mad.h +++ b/include/rdma/ib_mad.h @@ -290,7 +290,7 @@ static inline void ib_set_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr, u8 rtime) */ static inline void ib_set_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr, u8 flags) { - rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF1) | + rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF8) | (flags & 0x7); } From jackm at dev.mellanox.co.il Thu Feb 26 08:49:40 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 26 Feb 2009 18:49:40 +0200 Subject: [ofa-general] Re: Problem in IB network without Switch In-Reply-To: References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com> <200902261335.59927.jackm@dev.mellanox.co.il> Message-ID: <200902261849.40448.jackm@dev.mellanox.co.il> You are running VERY old firmware (from 2004), and moreover, on one host you have 3.0.0, and on the other 3.1.0. You need to upgrade your firmware. Contact your Mellanox FAE (support engineer) for instructions. - Jack > Hi Jack, > > Please find the output of ibstat on both the nodes, . > > [root at mattool ~]# /opt/ofed/extras/hca_self_test.ofed > HCA Firmware Check ..................... FAIL > REASON: mismatch HCA #0 firmware detected (found v, need v3.5.917) > Host Driver Initialization ............. PASS > > [root at mattool ~]# > > ************ IBSTAT output ****************** > > > [root at mattool ~]# ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.1.0 > [root at compute-0-0 ~]# ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.0.0 From cameron at harr.org Thu Feb 26 09:19:39 2009 From: cameron at harr.org (Cameron Harr) Date: Thu, 26 Feb 2009 10:19:39 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A57256.2000005@harr.org> References: <48E386F6.5040502@fusionio.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net> <49A4812A.8050202@harr.org> <49A57256.2000005@harr.o rg> Message-ID: <49A6CF2B.4010002@harr.org> Cameron Harr wrote: > Cameron Harr wrote: > I re-compiled and re-ran the tests and numbers are a little better but > performance still seems to have gone down from 673: > Test 1:373751.66 > Test 2:371242.6067 > Test 3:347988.1467 > Test 4:378247.31 > Test 5:375616.53 I was curious and did a regression test with 673 and those numbers are now even worse, so I'll presume there is an issue on my system and not the SCST code: Test 1:365204.3067 Test 2:364152.2067 Test 3:340665.7633 Test 4:369916.8133 Test 5:369093.5833 From hal.rosenstock at gmail.com Thu Feb 26 10:02:53 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 13:02:53 -0500 Subject: [ofa-general] Re: [ewg] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation In-Reply-To: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095> References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095> Message-ID: On Thu, Feb 26, 2009 at 11:38 AM, Ramachandra K wrote: > Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime. > In the 8-bit field of the RMPP header, the first 5 bits > are RRespTime and next 3 bits are RMPPFlags. Hence to retain > the first 5 bits, the mask should be 0xF8 instead of 0xF1. > > Signed-off-by: Ramachandra K > --- > > diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h > index 5f6c40f..1a0f409 100644 > --- a/include/rdma/ib_mad.h > +++ b/include/rdma/ib_mad.h > @@ -290,7 +290,7 @@ static inline void ib_set_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr, u8 rtime) >  */ >  static inline void ib_set_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr, u8 flags) >  { > -       rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF1) | > +       rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF8) | Looks right to me. Sean ? -- Hal >                                     (flags & 0x7); >  } > > > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From sean.hefty at intel.com Thu Feb 26 10:07:36 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 10:07:36 -0800 Subject: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation In-Reply-To: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095> References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095> Message-ID: <2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com> >Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime. >In the 8-bit field of the RMPP header, the first 5 bits >are RRespTime and next 3 bits are RMPPFlags. Hence to retain >the first 5 bits, the mask should be 0xF8 instead of 0xF1. > >Signed-off-by: Ramachandra K Good catch. Acked-by: Sean Hefty From sean.hefty at intel.com Thu Feb 26 10:13:00 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 10:13:00 -0800 Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL / PortRcvDataSL support In-Reply-To: References: Message-ID: <3B25B2D61996446F88703F647919FC4E@amr.corp.intel.com> >+MAD_EXPORT uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t * >dest, >+ int port, unsigned timeout); >+MAD_EXPORT uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t * >dest, >+ int port, unsigned timeout); >+MAD_EXPORT uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t * >dest, >+ int port, unsigned mask, >+ unsigned timeout); >+MAD_EXPORT uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t * >dest, >+ int port, unsigned mask, >+ unsigned timeout); > MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * >dest, > int port, unsigned timeout); > MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * >dest, {snip} >+uint8_t *port_performance_xmt_sl_query(void *rcvbuf, ib_portid_t * dest, int >port, >+ unsigned timeout) >+{ >+ return pma_query(rcvbuf, dest, port, timeout, >IB_GSI_PORT_XMIT_DATA_SL); >+} >+ >+uint8_t *port_performance_rcv_sl_query(void *rcvbuf, ib_portid_t * dest, int >port, >+ unsigned timeout) >+{ >+ return pma_query(rcvbuf, dest, port, timeout, IB_GSI_PORT_RCV_DATA_SL); >+} >+ > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > unsigned timeout, const void *srcport) >@@ -208,6 +220,20 @@ uint8_t *port_performance_ext_reset(void > IB_GSI_PORT_COUNTERS_EXT); > } > >+uint8_t *port_performance_xmt_sl_reset(void *rcvbuf, ib_portid_t * dest, int >port, >+ unsigned mask, unsigned timeout) >+{ >+ return performance_reset(rcvbuf, dest, port, mask, timeout, >+ IB_GSI_PORT_XMIT_DATA_SL); >+} >+ >+uint8_t *port_performance_rcv_sl_reset(void *rcvbuf, ib_portid_t * dest, int >port, >+ unsigned mask, unsigned timeout) >+{ >+ return performance_reset(rcvbuf, dest, port, mask, timeout, >+ IB_GSI_PORT_RCV_DATA_SL); >+} >+ Rather than continue to add more and more interfaces to the library, can we just export a couple of more generic calls? - Sean From sean.hefty at intel.com Thu Feb 26 10:19:44 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 10:19:44 -0800 Subject: [ofa-general] RE: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <20090226101144.GB11192@sashak.voltaire.com> References: <20090226101144.GB11192@sashak.voltaire.com> Message-ID: <5294A95D031B46038380A7F7BA6F0248@amr.corp.intel.com> I'll give your 2 patches a try. Is there a way that you can enable type mismatch warnings? If so, that would probably show the same issues. - Sean From Bill.Boas at openfabrics.org Thu Feb 26 10:06:18 2009 From: Bill.Boas at openfabrics.org (Bill Boas) Date: Thu, 26 Feb 2009 10:06:18 -0800 Subject: [ofa-general] Sun is asking for a response on Bug No.s 1522, 1523 In-Reply-To: <897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM> References: <1235592409.22158.80.camel@pc.interlinx.bc.ca> <897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM> Message-ID: Dear OFA Maintainers: Sun has contacted me about these bugs and is asking for priority action to get fixes for them. See the thread below. I'm sending this email to 3 OFA lists because Sun having to contact the Alliance this way raises many questions about how a vendor like Sun or a customer like, say, a Wall St bank gets bugs fixed and where do we, the OFA, publish this information? And if we agree that an OFA maintainer is the right person to fix the bug how does that maintainer learn that and how do they respond? Thank you for responding to this email and to the Sun team using OFED in their products. Bill. Bill Boas Executive Director and Vice Chair OpenFabrics Alliance 510-375-8840 Bill.Boas at openfabrics.org www.openfabrics.org _____ From: Bryon.Neitzel at Sun.COM [mailto:Bryon.Neitzel at Sun.COM] Sent: Wednesday, February 25, 2009 12:20 PM To: Bill Boas Cc: Peter Jones; Brian J. Murrell; Makia Minich Subject: Fwd: OFED 1.4 for NHM chips Hi Bill, I found the bug numbers for the Lustre build issues against OFED 1.4.0 that I mentioned yesterday. Is there any way to bump up the priority on these? This is blocking our ability to deliver our new Vayu (Nehalem+QDR) hardware to our customers, since Mellanox says they'll only support OFED 1.4 with QDR hardware. Thanks, Bryon Begin forwarded message: From: "Brian J. Murrell" Date: February 25, 2009 1:06:49 PM MST To: Bryon Neitzel Cc: Peter Jones Subject: Re: OFED 1.4 for NHM chips On Wed, 2009-02-25 at 12:59 -0700, Bryon Neitzel wrote: Hi Brian, what is the website where these OFED bugs were filed? https://bugs.openfabrics.org/show_bug.cgi?id=1522 https://bugs.openfabrics.org/show_bug.cgi?id=1523 Bill Boas tried to find any OFED bugs opened by Sun yesterday, Yeah. I opened them a number of days ago. Not sure if it matters to anyone, but IMHO, 1523 is the correct future direction and fixing that would implicitly fix 1522. 1523 basically describes separating the "technology preview" that caused this breakage out into it's own independent module so that it does not pollute the OFED core. As such, it might be resisted as it's a "deeper cut" type of fix, but one that removes the sick part rather than trying to continue to bandage it. b. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Thu Feb 26 11:06:39 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 21:06:39 +0200 Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <5294A95D031B46038380A7F7BA6F0248@amr.corp.intel.com> References: <20090226101144.GB11192@sashak.voltaire.com> <5294A95D031B46038380A7F7BA6F0248@amr.corp.intel.com> Message-ID: <20090226190639.GI14238@sashak.voltaire.com> On 10:19 Thu 26 Feb , Sean Hefty wrote: > I'll give your 2 patches a try. Is there a way that you can enable type > mismatch warnings? If so, that would probably show the same issues. Actually yes, I can use -Wsign-compare -Wconversion (although it makes much more warning :)). Sasha From robert.j.woodruff at intel.com Thu Feb 26 11:05:49 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 26 Feb 2009 11:05:49 -0800 Subject: [ofa-general] RE: Sun is asking for a response on Bug No.s 1522, 1523 In-Reply-To: References: <1235592409.22158.80.camel@pc.interlinx.bc.ca> <897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM> Message-ID: <382A478CAD40FA4FB46605CF81FE39F41F1BFD66@orsmsx507.amr.corp.intel.com> Bill, since the bug was submitted as P3 and normal, rather than P1 as critical or blocker, it was probably overlooked as something that had to be fixed in 1.4. I changed the bug to a blocker P1 bug for OFED 1.4.1, and thus should show up on the bug tracking list that gets reviewed in the EWG. In addition to enterring the bug into bugzilla, people may also want to send an email to the maintaner for blocker bugs, so that they know it needs quick attention. It is asigned to Jeff Becker, the NFS/RDMA maintainer. I know that there were some issues with backports for NFS/RDMA in 1.4 and believe he is trying to get these fixed for OFED 1.4.1. woody ________________________________ From: Bill Boas [mailto:Bill.Boas at openfabrics.org] Sent: Thursday, February 26, 2009 10:06 AM To: general at lists.openfabrics.org; ewg at lists.openfabrics.org; wwg at lists.openfabrics.org Cc: 'Peter Jones'; 'Brian J. Murrell'; 'Makia Minich'; Bryon.Neitzel at Sun.COM Subject: Sun is asking for a response on Bug No.s 1522, 1523 Dear OFA Maintainers: Sun has contacted me about these bugs and is asking for priority action to get fixes for them. See the thread below. I'm sending this email to 3 OFA lists because Sun having to contact the Alliance this way raises many questions about how a vendor like Sun or a customer like, say, a Wall St bank gets bugs fixed and where do we, the OFA, publish this information? And if we agree that an OFA maintainer is the right person to fix the bug how does that maintainer learn that and how do they respond? Thank you for responding to this email and to the Sun team using OFED in their products. Bill. Bill Boas Executive Director and Vice Chair OpenFabrics Alliance 510-375-8840 Bill.Boas at openfabrics.org www.openfabrics.org ________________________________ From: Bryon.Neitzel at Sun.COM [mailto:Bryon.Neitzel at Sun.COM] Sent: Wednesday, February 25, 2009 12:20 PM To: Bill Boas Cc: Peter Jones; Brian J. Murrell; Makia Minich Subject: Fwd: OFED 1.4 for NHM chips Hi Bill, I found the bug numbers for the Lustre build issues against OFED 1.4.0 that I mentioned yesterday. Is there any way to bump up the priority on these? This is blocking our ability to deliver our new Vayu (Nehalem+QDR) hardware to our customers, since Mellanox says they'll only support OFED 1.4 with QDR hardware. Thanks, Bryon Begin forwarded message: From: "Brian J. Murrell" > Date: February 25, 2009 1:06:49 PM MST To: Bryon Neitzel > Cc: Peter Jones > Subject: Re: OFED 1.4 for NHM chips On Wed, 2009-02-25 at 12:59 -0700, Bryon Neitzel wrote: Hi Brian, what is the website where these OFED bugs were filed? https://bugs.openfabrics.org/show_bug.cgi?id=1522 https://bugs.openfabrics.org/show_bug.cgi?id=1523 Bill Boas tried to find any OFED bugs opened by Sun yesterday, Yeah. I opened them a number of days ago. Not sure if it matters to anyone, but IMHO, 1523 is the correct future direction and fixing that would implicitly fix 1522. 1523 basically describes separating the "technology preview" that caused this breakage out into it's own independent module so that it does not pollute the OFED core. As such, it might be resisted as it's a "deeper cut" type of fix, but one that removes the sick part rather than trying to continue to bandage it. b. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramachandra.kuchimanchi at qlogic.com Thu Feb 26 11:09:27 2009 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Fri, 27 Feb 2009 00:39:27 +0530 Subject: ***SPAM*** RE: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation In-Reply-To: <2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com> References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095> <2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com> Message-ID: <71d336490902261109n583f5b26gc9bf6fbee02e092e@mail.gmail.com> On Thu, Feb 26, 2009 at 11:37 PM, Sean Hefty wrote: >>Fix ib_set_rmpp_flags() to use the correct bit mask for RRespTime. >>In the 8-bit field of the RMPP header, the first 5 bits >>are RRespTime and next 3 bits are RMPPFlags. Hence to retain >>the first 5 bits, the mask should be 0xF8 instead of 0xF1. >> >>Signed-off-by: Ramachandra K > > Good catch. > > Acked-by: Sean Hefty > Just to add some more information - drivers/infiniband/core/mad_rmpp.c:ack_recv()--->format_ack() calls ib_set_rmpp_flags() and due to the incorrect ANDing with 0xF1, RRespTime got changed incorrectly and RMPP Acks sent back always had a RRespTime of 0x1E (30) which caused the other end to consider the time outs to be approximately 4297 seconds (i.e. in the order of 4*2^30) instead of the usual ~4 seconds (order of 4*2^20). Regards, Ram From vst at vlnb.net Thu Feb 26 11:49:51 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Thu, 26 Feb 2009 22:49:51 +0300 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A6CF2B.4010002@harr.org> References: <48E386F6.5040502@fusionio.com> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net> <49A4812A.8050202@harr.org> <49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org> Message-ID: <49A6F25F.8060306@vlnb.net> Cameron Harr, on 02/26/2009 08:19 PM wrote: > Cameron Harr wrote: >> Cameron Harr wrote: >> I re-compiled and re-ran the tests and numbers are a little better but >> performance still seems to have gone down from 673: >> Test 1:373751.66 >> Test 2:371242.6067 >> Test 3:347988.1467 >> Test 4:378247.31 >> Test 5:375616.53 > I was curious and did a regression test with 673 and those numbers are > now even worse, so I'll presume there is an issue on my system and not > the SCST code: > Test 1:365204.3067 > Test 2:364152.2067 > Test 3:340665.7633 > Test 4:369916.8133 > Test 5:369093.5833 It's known that any OS, including Linux, is getting "tired" under load with time from boot, which leads to worse performance. I guess, you can experience such effect. Check with r634. R635 has cache locality in data structures related change, which intended to improve performance a bit, but might make it worse instead. Vlad From sean.hefty at intel.com Thu Feb 26 12:07:58 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 12:07:58 -0800 Subject: [ofa-general] RE: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <20090226101144.GB11192@sashak.voltaire.com> References: <20090226101144.GB11192@sashak.voltaire.com> Message-ID: <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> Both of your patches (ibnetdiscover and ibroute) build on winof. I replaced my 2 patches with yours, updated to the latest codebase, and pushed everything: git://git.openfabrics.org/~shefty/ib-mgmt.git master Were there changes to the other patches that you wanted (including saquery, which wasn't part of the numbered series)? - Sean From sashak at voltaire.com Thu Feb 26 13:02:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 23:02:19 +0200 Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> Message-ID: <20090226210211.GK14238@sashak.voltaire.com> On 12:07 Thu 26 Feb , Sean Hefty wrote: > Both of your patches (ibnetdiscover and ibroute) build on winof. I replaced my > 2 patches with yours, updated to the latest codebase, and pushed everything: > > git://git.openfabrics.org/~shefty/ib-mgmt.git master > > Were there changes to the other patches that you wanted (including saquery, > which wasn't part of the numbered series)? Thanks. I applied everything except ibsysstat.c and saquery.c. Wanted to clarify some things there: > From 49f28a63589be21dd7218922ed9d0b2b719a92c2 Mon Sep 17 00:00:00 2001 > From: Sean Hefty > Date: Thu, 26 Feb 2009 10:12:07 -0800 > Subject: [PATCH 1/2] [ib-diag] ibsysstat: add support for WinOF > > Signed-off-by: Sean Hefty > --- > infiniband-diags/src/ibsysstat.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c > index cc1418d..b9f2f85 100644 > --- a/infiniband-diags/src/ibsysstat.c > +++ b/infiniband-diags/src/ibsysstat.c > @@ -183,7 +183,7 @@ static char *ibsystat_serv(void) > > DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod); > > - size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS, > + size = mk_reply(attr, (char *) mad + IB_VENDOR_RANGE2_DATA_OFFS, What is the reason for such void * to char * casting? > sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS); > > if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0) > @@ -210,7 +210,7 @@ static char *ibsystat(ib_portid_t *portid, int attr) > { > ib_rpc_t rpc = { 0 }; > int fd, agent, timeout, len; > - void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; > + void *data = (char *) umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; Ditto. > > DEBUG("Sysstat ping.."); > > @@ -318,7 +318,7 @@ int main(int argc, char **argv) > const struct ibdiag_opt opts[] = { > { "oui", 'o', 1, NULL, "use specified OUI number" }, > { "Server", 'S', 0, NULL, "start in server mode" }, > - { } > + { 0 } > }; > char usage_args[] = " []"; > > -- > 1.6.1.2.319.gbd9e > > > From 1b9685769339891670df6d9af66e9933794be8a0 Mon Sep 17 00:00:00 2001 > From: Sean Hefty > Date: Thu, 26 Feb 2009 10:12:29 -0800 > Subject: [PATCH 2/2] [ib-diag] saquery: add support for WinOF > > Signed-off-by: Sean Hefty > --- > infiniband-diags/src/saquery.c | 80 ++++++++++++++++++++++------------------ > 1 files changed, 44 insertions(+), 36 deletions(-) > > diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c > index bcd1f61..4a5cfb8 100644 > --- a/infiniband-diags/src/saquery.c > +++ b/infiniband-diags/src/saquery.c > @@ -37,20 +37,25 @@ > * > */ > > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > #include > #include > #include > #include > #include > #include > +#include > > #define _GNU_SOURCE > #include > > #include > #include > -#include > -#include > +#include > +#include > > #include "ibdiag_common.h" > > @@ -170,7 +175,7 @@ recv_mad: > if (ibdebug > 1) > xdump(stdout, "SA Response:\n", mad, len); > > - method = mad_get_field(mad, 0, IB_MAD_METHOD_F); > + method = (uint8_t) mad_get_field(mad, 0, IB_MAD_METHOD_F); > offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); > result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F); > result.p_result_madw = mad; > @@ -189,12 +194,12 @@ recv_mad: > static void *get_query_rec(void *mad, unsigned i) > { > int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); > - return mad + IB_SA_DATA_OFFS + i * (offset << 3); > + return (char *) mad + IB_SA_DATA_OFFS + i * (offset << 3); Ditto. > } > > static unsigned valid_gid(ib_gid_t *gid) > { > - ib_gid_t zero_gid = { }; > + ib_gid_t zero_gid = { 0 }; > return memcmp(&zero_gid, gid, sizeof(*gid)); > } > > @@ -442,7 +447,7 @@ static void dump_multicast_member_record(void *data) > char gid_str2[INET6_ADDRSTRLEN]; > ib_member_rec_t *p_mcmr = data; > uint16_t mlid = cl_ntoh16(p_mcmr->mlid); > - int i = 0; > + unsigned i = 0; > char *node_name = ""; > > /* go through the node records searching for a port guid which matches > @@ -758,7 +763,7 @@ static void dump_one_mft_record(void *data) > > static void dump_results(struct query_res *r, void (*dump_func) (void *)) > { > - int i; > + unsigned i; > for (i = 0; i < r->result_cnt; i++) { > void *data = get_query_rec(r->p_result_madw, i); > dump_func(data); > @@ -768,7 +773,7 @@ static void dump_results(struct query_res *r, void (*dump_func) (void *)) > static void return_mad(void) > { > if (result.p_result_madw) { > - free(result.p_result_madw - umad_size()); > + free((char *) result.p_result_madw - umad_size()); Ditto. > result.p_result_madw = NULL; > } > } > @@ -839,7 +844,8 @@ get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid) > { > ib_node_record_t *node_record = NULL; > ib_node_info_t *p_ni = NULL; > - int i = 0, ret; > + unsigned i; > + int ret; > > ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); > if (ret) > @@ -869,7 +875,7 @@ static uint16_t get_lid(bind_handle_t h, const char *name) > if (isalpha(name[0])) > assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS); > else > - rc_lid = atoi(name); > + rc_lid = (uint16_t) atoi(name); > if (rc_lid == 0) > fprintf(stderr, "Failed to find lid for \"%s\"\n", name); > return rc_lid; > @@ -917,8 +923,8 @@ static int parse_lid_and_ports(bind_handle_t h, > > #define cl_hton8(x) (x) > #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \ > - if (val > comp_with) { \ > - target = cl_hton##size(val); \ > + if ((uint##size##_t) val > (uint##size##_t) comp_with) { \ > + target = cl_hton##size((uint##size##_t) val); \ > comp_mask |= IB_##name##_COMPMASK_##mask; \ > } > > @@ -951,7 +957,8 @@ static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask) > > static int print_node_records(bind_handle_t h) > { > - int i = 0, ret; > + unsigned i; > + int ret; > > ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); > if (ret) > @@ -1027,7 +1034,7 @@ static int query_path_records(const struct query_cmd *q, bind_handle_t h, > CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID); > CHECK_AND_SET_VAL(p->hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT); > CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, PR, FLOWLABEL); > - pr.hop_flow_raw |= cl_hton32(flow << 8); > + pr.hop_flow_raw |= (uint8_t) cl_hton32(flow << 8); Why this casting is needed? This should be uint32_t to uint32_t assignment, no? > CHECK_AND_SET_VAL(p->tclass, 8, 0, pr.tclass, PR, TCLASS); > CHECK_AND_SET_VAL(p->reversible, 8, -1, reversible, PR, REVERSIBLE); > CHECK_AND_SET_VAL(p->numb_path, 8, -1, pr.num_path, PR, NUMBPATH); > @@ -1089,7 +1096,7 @@ static int print_multicast_member_records(bind_handle_t h) > > return_mc: > if (mc_group_result.p_result_madw) > - free(mc_group_result.p_result_madw - umad_size()); > + free((char *) mc_group_result.p_result_madw - umad_size()); void * to char * casting again. > > return ret; > } > @@ -1267,7 +1274,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q, > memset(&pktr, 0, sizeof(pktr)); > CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID); > CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT); > - CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK); > + CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK); This fix is unrelated to porting, right? The rest looks fine for me. Sasha > > return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, > comp_mask, &pktr, smkey, > @@ -1503,13 +1510,13 @@ static int process_opt(void *context, int ch, char *optarg) > query_type = IB_SA_ATTR_LINKRECORD; > break; > case 5: > - p->slid = strtoul(optarg, NULL, 0); > + p->slid = (uint16_t) strtoul(optarg, NULL, 0); > break; > case 6: > - p->dlid = strtoul(optarg, NULL, 0); > + p->dlid = (uint16_t) strtoul(optarg, NULL, 0); > break; > case 7: > - p->mlid = strtoul(optarg, NULL, 0); > + p->mlid = (uint16_t) strtoul(optarg, NULL, 0); > break; > case 14: > if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0) > @@ -1534,7 +1541,7 @@ static int process_opt(void *context, int ch, char *optarg) > p->numb_path = strtoul(optarg, NULL, 0); > break; > case 18: > - p->pkey = strtoul(optarg, NULL, 0); > + p->pkey = (uint16_t) strtoul(optarg, NULL, 0); > break; > case 'Q': > p->qos_class = strtoul(optarg, NULL, 0); > @@ -1543,19 +1550,19 @@ static int process_opt(void *context, int ch, char *optarg) > p->sl = strtoul(optarg, NULL, 0); > break; > case 'M': > - p->mtu = strtoul(optarg, NULL, 0); > + p->mtu = (uint8_t) strtoul(optarg, NULL, 0); > break; > case 'R': > - p->rate = strtoul(optarg, NULL, 0); > + p->rate = (uint8_t) strtoul(optarg, NULL, 0); > break; > case 20: > - p->pkt_life = strtoul(optarg, NULL, 0); > + p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0); > break; > case 'q': > p->qkey = strtoul(optarg, NULL, 0); > break; > case 'T': > - p->tclass = strtoul(optarg, NULL, 0); > + p->tclass = (uint8_t) strtoul(optarg, NULL, 0); > break; > case 'F': > p->flow_label = strtoul(optarg, NULL, 0); > @@ -1564,10 +1571,10 @@ static int process_opt(void *context, int ch, char *optarg) > p->hop_limit = strtoul(optarg, NULL, 0); > break; > case 21: > - p->scope = strtoul(optarg, NULL, 0); > + p->scope = (uint8_t) strtoul(optarg, NULL, 0); > break; > case 'J': > - p->join_state = strtoul(optarg, NULL, 0); > + p->join_state = (uint8_t) strtoul(optarg, NULL, 0); > break; > case 'X': > p->proxy_join = strtoul(optarg, NULL, 0); > @@ -1582,14 +1589,7 @@ int main(int argc, char **argv) > { > char usage_args[1024]; > bind_handle_t h; > - struct query_params params = { > - .hop_limit = -1, > - .reversible = -1, > - .numb_path = -1, > - .qos_class = -1, > - .sl = -1, > - .proxy_join = -1, > - }; > + struct query_params params; > const struct query_cmd *q; > ib_api_status_t status; > int n; > @@ -1643,9 +1643,17 @@ int main(int argc, char **argv) > { "scope", 21, 1, NULL, "Scope (MCMemberRecord)" }, > { "join_state", 'J', 1, NULL, "Join state (MCMemberRecord)" }, > { "proxy_join", 'X', 1, NULL, "Proxy join (MCMemberRecord)" }, > - {} > + { 0 } > }; > > + memset(¶ms, 0, sizeof params); > + params.hop_limit = -1; > + params.reversible = -1; > + params.numb_path = -1; > + params.qos_class = -1; > + params.sl = -1; > + params.proxy_join = -1; > + > n = sprintf(usage_args, "[query-name] [ | | ]\n" > "\nSupported query names (and aliases):\n"); > for (q = query_cmds; q->name; q++) { > @@ -1680,7 +1688,7 @@ int main(int argc, char **argv) > > if (argc) { > if (node_print_desc == NAME_OF_LID) { > - requested_lid = strtoul(argv[0], NULL, 0); > + requested_lid = (uint16_t) strtoul(argv[0], NULL, 0); > requested_lid_flag++; > } else if (node_print_desc == NAME_OF_GUID) { > requested_guid = strtoul(argv[0], NULL, 0); > -- > 1.6.1.2.319.gbd9e > From Jeffrey.C.Becker at nasa.gov Thu Feb 26 13:12:38 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Thu, 26 Feb 2009 13:12:38 -0800 Subject: [ofa-general] Re: Sun is asking for a response on Bug No.s 1522, 1523 In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F41F1BFD66@orsmsx507.amr.corp.intel.com> References: <1235592409.22158.80.camel@pc.interlinx.bc.ca> <897D8E97-60A7-4FCB-BD6A-45228C3B4912@Sun.COM> <382A478CAD40FA4FB46605CF81FE39F41F1BFD66@orsmsx507.amr.corp.intel.com> Message-ID: <49A705C6.3090605@nasa.gov> Hi Woodruff, Robert J wrote: > Bill, since the bug was submitted as P3 and normal, > rather than P1 as critical or blocker, it was probably overlooked as > something > that had to be fixed in 1.4. I changed the bug to a blocker P1 bug for > OFED 1.4.1, > and thus should show up on the bug tracking list that gets reviewed in > the EWG. > > In addition to enterring the bug into bugzilla, people may also want > to send > an email to the maintaner for blocker bugs, so that they know it needs > quick > attention. It is asigned to Jeff Becker, the NFS/RDMA maintainer. > I know that there were some issues with backports for NFS/RDMA in 1.4 and > believe he is trying to get these fixed for OFED 1.4.1. I'm currently finishing up the SLES11 backports for OFED, and I'll work on this when I'm done. Thanks. -jeff > > woody > > > ------------------------------------------------------------------------ > *From:* Bill Boas [mailto:Bill.Boas at openfabrics.org] > *Sent:* Thursday, February 26, 2009 10:06 AM > *To:* general at lists.openfabrics.org; ewg at lists.openfabrics.org; > wwg at lists.openfabrics.org > *Cc:* 'Peter Jones'; 'Brian J. Murrell'; 'Makia Minich'; > Bryon.Neitzel at Sun.COM > *Subject:* Sun is asking for a response on Bug No.s 1522, 1523 > > Dear OFA Maintainers: > > > > Sun has contacted me about these bugs and is asking for priority > action to get fixes for them. See the thread below. > > > > I’m sending this email to 3 OFA lists because Sun having to contact > the Alliance this way raises many questions about how a vendor like > Sun or a customer like, say, a Wall St bank gets bugs fixed and where > do we, the OFA, publish this information? > > > > And if we agree that an OFA maintainer is the right person to fix the > bug how does that maintainer learn that and how do they respond? > > > > Thank you for responding to this email and to the Sun team using OFED > in their products. > > > > Bill. > > > > Bill Boas > > Executive Director and Vice Chair > > OpenFabrics Alliance > > 510-375-8840 > > Bill.Boas at openfabrics.org > > www.openfabrics.org > > > > ------------------------------------------------------------------------ > > *From:* Bryon.Neitzel at Sun.COM [mailto:Bryon.Neitzel at Sun.COM] > *Sent:* Wednesday, February 25, 2009 12:20 PM > *To:* Bill Boas > *Cc:* Peter Jones; Brian J. Murrell; Makia Minich > *Subject:* Fwd: OFED 1.4 for NHM chips > > > > Hi Bill, > > I found the bug numbers for the Lustre build issues against OFED 1.4.0 > that I mentioned yesterday. > > Is there any way to bump up the priority on these? This is blocking > our ability to deliver our new Vayu (Nehalem+QDR) hardware to our > customers, since Mellanox says they'll only support OFED 1.4 with QDR > hardware. > > > > Thanks, > > Bryon > > > > > > Begin forwarded message: > > > > *From: *"Brian J. Murrell" > > > *Date: *February 25, 2009 1:06:49 PM MST > > *To: *Bryon Neitzel > > > *Cc: *Peter Jones > > > *Subject: **Re: OFED 1.4 for NHM chips* > > > > On Wed, 2009-02-25 at 12:59 -0700, Bryon Neitzel wrote: > > Hi Brian, what is the website where these OFED bugs were filed? > > > https://bugs.openfabrics.org/show_bug.cgi?id=1522 > https://bugs.openfabrics.org/show_bug.cgi?id=1523 > > > Bill Boas tried to find any OFED bugs opened by Sun yesterday, > > > Yeah. I opened them a number of days ago. > > Not sure if it matters to anyone, but IMHO, 1523 is the correct future > direction and fixing that would implicitly fix 1522. 1523 basically > describes separating the "technology preview" that caused this breakage > out into it's own independent module so that it does not pollute the > OFED core. As such, it might be resisted as it's a "deeper cut" type of > fix, but one that removes the sick part rather than trying to continue > to bandage it. > > b. > > > From sashak at voltaire.com Thu Feb 26 13:25:38 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 23:25:38 +0200 Subject: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> Message-ID: <20090226212538.GL14238@sashak.voltaire.com> On 07:03 Thu 26 Feb , Hal Rosenstock wrote: > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? r = IB_INSUFFICIENT_MEMORY; > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(p_vend->p_log, > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG_ERROR, > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? j, > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ca.ports[j]->pkeys_size); > > > > Also should it be an error? May be it is just enough to fill requested > > pkey entries? > > I agree that being more forgiving is better but then how would it be > known if the pkeys are being truncated ? You could return a real pkeys_size value with table filled up to provided size. Otherwise (in case of just an error) how an user could know which pkey size to provide? Sasha From sashak at voltaire.com Thu Feb 26 13:32:07 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Feb 2009 23:32:07 +0200 Subject: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: References: <20090218153016.GD8489@comcast.net> <20090226071059.GV11192@sashak.voltaire.com> Message-ID: <20090226213200.GM14238@sashak.voltaire.com> On 07:03 Thu 26 Feb , Hal Rosenstock wrote: > On Thu, Feb 26, 2009 at 2:10 AM, Sasha Khapyorsky wrote: > > On 10:30 Wed 18 Feb ?? ?? , Hal Rosenstock wrote: > >> > >> Only supported in osm_vendor_ibumad.c (separate patch for other > >> vendor layers) > >> Also, update applications using this (osmtest, opensm) > > > > It looks that ibutils (ibis) requires same fix (attr_array > > initialization) too. > > Yes, I'm aware but didn't want to send those until these were accepted. attr_array initialization doesn't hurt by itself, so in order to not have broken version it would be better to apply it before the actual change. Sasha From jgunthorpe at obsidianresearch.com Thu Feb 26 13:30:33 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 26 Feb 2009 14:30:33 -0700 Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <20090226210211.GK14238@sashak.voltaire.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> Message-ID: <20090226213033.GG5127@obsidianresearch.com> On Thu, Feb 26, 2009 at 11:02:19PM +0200, Sasha Khapyorsky wrote: > > - size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS, > > + size = mk_reply(attr, (char *) mad + IB_VENDOR_RANGE2_DATA_OFFS, > > What is the reason for such void * to char * casting? Math on void* pointers is a gcc extension, I'm surprised you don't get warnings on linux - it is worth figuring out how to turn those on.. Sean: For this purpose casting to (char *) is somewhat sketchy, it should be (uint8_t *).. char should only ever be used for strings due to possible troubles with environments using 16 bit chars for wide character support. Jason From sean.hefty at intel.com Thu Feb 26 13:39:45 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 13:39:45 -0800 Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <20090226213033.GG5127@obsidianresearch.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <20090226213033.GG5127@obsidianresearch.com> Message-ID: >Sean: For this purpose casting to (char *) is somewhat sketchy, it >should be (uint8_t *).. char should only ever be used for strings due >to possible troubles with environments using 16 bit chars for wide >character support. I'm not aware of any environments that define char as anything other than a byte, but I can change this. From hal.rosenstock at gmail.com Thu Feb 26 13:43:08 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Feb 2009 16:43:08 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: <20090226212538.GL14238@sashak.voltaire.com> References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> <20090226212538.GL14238@sashak.voltaire.com> Message-ID: On Thu, Feb 26, 2009 at 4:25 PM, Sasha Khapyorsky wrote: > On 07:03 Thu 26 Feb     , Hal Rosenstock wrote: >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? r = IB_INSUFFICIENT_MEMORY; >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(p_vend->p_log, >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG_ERROR, >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? j, >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ca.ports[j]->pkeys_size); >> > >> > Also should it be an error? May be it is just enough to fill requested >> > pkey entries? >> >> I agree that being more forgiving is better but then how would it be >> known if the pkeys are being truncated ? > > You could return a real pkeys_size value with table filled up to > provided size. > > Otherwise (in case of just an error) how an user could know which pkey > size to provide? The problem with that is that the user needs to remember how many he asked for originally. Not hard but just a detail that I expect will get lost. -- Hal > Sasha > From sean.hefty at intel.com Thu Feb 26 13:45:36 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 13:45:36 -0800 Subject: [ofa-general] RE: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <20090226210211.GK14238@sashak.voltaire.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> Message-ID: <9E758F38400F48348241220A53295572@amr.corp.intel.com> >> @@ -1027,7 +1034,7 @@ static int query_path_records(const struct query_cmd >*q, bind_handle_t h, >> CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID); >> CHECK_AND_SET_VAL(p->hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT); >> CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, PR, FLOWLABEL); >> - pr.hop_flow_raw |= cl_hton32(flow << 8); >> + pr.hop_flow_raw |= (uint8_t) cl_hton32(flow << 8); > >Why this casting is needed? This should be uint32_t to uint32_t >assignment, no? Hmm... the cast shouldn't be needed. >> @@ -1267,7 +1274,7 @@ static int query_pkey_tbl_records(const struct >query_cmd *q, >> memset(&pktr, 0, sizeof(pktr)); >> CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID); >> CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT); >> - CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK); >> + CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK); > >This fix is unrelated to porting, right? Somewhat - this is a real fix, but without it, there's a build error assigning a uint16 to an 8-bit port_num. I'll remove the cast above and change the (char *) casts to (uint8_t *) casts instead. - Sean From jgunthorpe at obsidianresearch.com Thu Feb 26 14:00:13 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 26 Feb 2009 15:00:13 -0700 Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <20090226213033.GG5127@obsidianresearch.com> Message-ID: <20090226220013.GA16941@obsidianresearch.com> On Thu, Feb 26, 2009 at 01:39:45PM -0800, Sean Hefty wrote: > >Sean: For this purpose casting to (char *) is somewhat sketchy, it > >should be (uint8_t *).. char should only ever be used for strings due > >to possible troubles with environments using 16 bit chars for wide > >character support. > > I'm not aware of any environments that define char as anything other than a > byte, but I can change this. There are some screwy embedded compilers that do this, not the target platform for OFA, but if you are improving portability, may as well do it right, once and for all... It is good portability practice in general to never use char for non-string objects because the signedness and width is undefined by the language, and at least signedness varies by CPU and environment in the real world. This is why C99 introduced fixed width types and types like uint8_t and uintptr_t, because the actual language provides no other guaranteed type to use :( Jason From ralph.campbell at qlogic.com Thu Feb 26 14:39:28 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 26 Feb 2009 14:39:28 -0800 Subject: [ofa-general] [PATCH v2] IB/core: fix null pointer dereference in local_completions() Message-ID: <1235687968.3948.218.camel@chromite.mv.qlogic.com> IB/core: fix null pointer dereference in local_completions() handle_outgoing_dr_smp() can queue a struct ib_mad_local_private *local on the mad_agent_priv->local_work work queue with local->mad_priv == NULL if device->process_mad() returns IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY and (!ib_response_mad(&mad_priv->mad.mad) || !mad_agent_priv->agent.recv_handler). In this case, local_completions() will be called with local->mad_priv == NULL. The code does check for this case and skips calling recv_mad_agent->agent.recv_handler() but recv == 0 so kmem_cache_free() is called with a NULL pointer. Also, since recv isn't reinitialized each time through the loop, it can cause a memory leak if recv should have been zero. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 5c54fc2..735ad4e 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2356,7 +2356,7 @@ static void local_completions(struct work_struct *work) struct ib_mad_local_private *local; struct ib_mad_agent_private *recv_mad_agent; unsigned long flags; - int recv = 0; + int free_mad; struct ib_wc wc; struct ib_mad_send_wc mad_send_wc; @@ -2370,14 +2370,15 @@ static void local_completions(struct work_struct *work) completion_list); list_del(&local->completion_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + free_mad = 0; if (local->mad_priv) { recv_mad_agent = local->recv_mad_agent; if (!recv_mad_agent) { printk(KERN_ERR PFX "No receive MAD agent for local completion\n"); + free_mad = 1; goto local_send_completion; } - recv = 1; /* * Defined behavior is to complete response * before request @@ -2422,7 +2423,7 @@ local_send_completion: spin_lock_irqsave(&mad_agent_priv->lock, flags); atomic_dec(&mad_agent_priv->refcount); - if (!recv) + if (free_mad) kmem_cache_free(ib_mad_cache, local->mad_priv); kfree(local); } From sean.hefty at intel.com Thu Feb 26 14:41:06 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 14:41:06 -0800 Subject: [ofa-general] [PATCH 1/2] [ib-diag] ibsysstat: add support for WinOF In-Reply-To: <20090226210211.GK14238@sashak.voltaire.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> Message-ID: <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com> Signed-off-by: Sean Hefty --- changes from v1: change (char *) casts to (uint8_t *) infiniband-diags/src/ibsysstat.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index cc1418d..da86d8e 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -183,7 +183,7 @@ static char *ibsystat_serv(void) DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod); - size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS, + size = mk_reply(attr, (uint8_t *) mad + IB_VENDOR_RANGE2_DATA_OFFS, sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS); if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0) @@ -210,7 +210,7 @@ static char *ibsystat(ib_portid_t *portid, int attr) { ib_rpc_t rpc = { 0 }; int fd, agent, timeout, len; - void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; + void *data = (uint8_t *) umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; DEBUG("Sysstat ping.."); @@ -318,7 +318,7 @@ int main(int argc, char **argv) const struct ibdiag_opt opts[] = { { "oui", 'o', 1, NULL, "use specified OUI number" }, { "Server", 'S', 0, NULL, "start in server mode" }, - { } + { 0 } }; char usage_args[] = " []"; From sean.hefty at intel.com Thu Feb 26 14:44:26 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 14:44:26 -0800 Subject: [ofa-general] [PATCH 2/2] [ib-diags] saquery: set correct pkey table field In-Reply-To: <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com> Message-ID: <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com> port_num is incorrectly set instead of block_num Signed-off-by: Sean Hefty --- I will resubmit the changes for saquery to support winof. I must have done something wrong with my testing on that patch on linux, since I'm seeing build warnings now. infiniband-diags/src/saquery.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index bcd1f61..3f508b9 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1267,7 +1267,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q, memset(&pktr, 0, sizeof(pktr)); CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID); CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT); - CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK); + CHECK_AND_SET_VAL(block, 16, -1, pktr.block_num, PKEY, BLOCK); return get_and_dump_any_records(h, IB_SA_ATTR_PKEYTABLERECORD, 0, comp_mask, &pktr, smkey, From sean.hefty at intel.com Thu Feb 26 15:10:04 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 26 Feb 2009 15:10:04 -0800 Subject: [ofa-general] [PATCH v2] [ib-diag] saquery: add support for WinOF In-Reply-To: <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com> <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com> Message-ID: Signed-off-by: Sean Hefty --- Ok - that was quicker than I thought it would be... this patch depends on saquery: set correct pkey table field. changes from v1: - use (uint8_t *) casts over (char *) casts - change initialization of zero_gid to use memset - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments are unsigned. This is kind of confusing, but that's how it appears the macro is used. It might be clearer if instead of passing -1 into the macro, that a SET_VAL macro be used instead. infiniband-diags/src/saquery.c | 77 ++++++++++++++++++++++------------------ 1 files changed, 43 insertions(+), 34 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 3f508b9..90ad512 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -37,20 +37,25 @@ * */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + #include #include #include #include #include #include +#include #define _GNU_SOURCE #include #include #include -#include -#include +#include +#include #include "ibdiag_common.h" @@ -170,7 +175,7 @@ recv_mad: if (ibdebug > 1) xdump(stdout, "SA Response:\n", mad, len); - method = mad_get_field(mad, 0, IB_MAD_METHOD_F); + method = (uint8_t) mad_get_field(mad, 0, IB_MAD_METHOD_F); offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F); result.p_result_madw = mad; @@ -189,12 +194,13 @@ recv_mad: static void *get_query_rec(void *mad, unsigned i) { int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F); - return mad + IB_SA_DATA_OFFS + i * (offset << 3); + return (uint8_t *) mad + IB_SA_DATA_OFFS + i * (offset << 3); } static unsigned valid_gid(ib_gid_t *gid) { - ib_gid_t zero_gid = { }; + ib_gid_t zero_gid; + memset(&zero_gid, 0, sizeof zero_gid); return memcmp(&zero_gid, gid, sizeof(*gid)); } @@ -442,7 +448,7 @@ static void dump_multicast_member_record(void *data) char gid_str2[INET6_ADDRSTRLEN]; ib_member_rec_t *p_mcmr = data; uint16_t mlid = cl_ntoh16(p_mcmr->mlid); - int i = 0; + unsigned i = 0; char *node_name = ""; /* go through the node records searching for a port guid which matches @@ -758,7 +764,7 @@ static void dump_one_mft_record(void *data) static void dump_results(struct query_res *r, void (*dump_func) (void *)) { - int i; + unsigned i; for (i = 0; i < r->result_cnt; i++) { void *data = get_query_rec(r->p_result_madw, i); dump_func(data); @@ -768,7 +774,7 @@ static void dump_results(struct query_res *r, void (*dump_func) (void *)) static void return_mad(void) { if (result.p_result_madw) { - free(result.p_result_madw - umad_size()); + free((uint8_t *) result.p_result_madw - umad_size()); result.p_result_madw = NULL; } } @@ -839,7 +845,8 @@ get_lid_from_name(bind_handle_t h, const char *name, uint16_t* lid) { ib_node_record_t *node_record = NULL; ib_node_info_t *p_ni = NULL; - int i = 0, ret; + unsigned i; + int ret; ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); if (ret) @@ -869,7 +876,7 @@ static uint16_t get_lid(bind_handle_t h, const char *name) if (isalpha(name[0])) assert(get_lid_from_name(h, name, &rc_lid) == IB_SUCCESS); else - rc_lid = atoi(name); + rc_lid = (uint16_t) atoi(name); if (rc_lid == 0) fprintf(stderr, "Failed to find lid for \"%s\"\n", name); return rc_lid; @@ -917,8 +924,8 @@ static int parse_lid_and_ports(bind_handle_t h, #define cl_hton8(x) (x) #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \ - if (val > comp_with) { \ - target = cl_hton##size(val); \ + if ((int##size##_t) val > (int##size##_t) comp_with) { \ + target = cl_hton##size((uint##size##_t) val); \ comp_mask |= IB_##name##_COMPMASK_##mask; \ } @@ -951,7 +958,8 @@ static int get_issm_records(bind_handle_t h, ib_net32_t capability_mask) static int print_node_records(bind_handle_t h) { - int i = 0, ret; + unsigned i; + int ret; ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0); if (ret) @@ -1089,7 +1097,7 @@ static int print_multicast_member_records(bind_handle_t h) return_mc: if (mc_group_result.p_result_madw) - free(mc_group_result.p_result_madw - umad_size()); + free((uint8_t *) mc_group_result.p_result_madw - umad_size()); return ret; } @@ -1503,13 +1511,13 @@ static int process_opt(void *context, int ch, char *optarg) query_type = IB_SA_ATTR_LINKRECORD; break; case 5: - p->slid = strtoul(optarg, NULL, 0); + p->slid = (uint16_t) strtoul(optarg, NULL, 0); break; case 6: - p->dlid = strtoul(optarg, NULL, 0); + p->dlid = (uint16_t) strtoul(optarg, NULL, 0); break; case 7: - p->mlid = strtoul(optarg, NULL, 0); + p->mlid = (uint16_t) strtoul(optarg, NULL, 0); break; case 14: if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0) @@ -1534,7 +1542,7 @@ static int process_opt(void *context, int ch, char *optarg) p->numb_path = strtoul(optarg, NULL, 0); break; case 18: - p->pkey = strtoul(optarg, NULL, 0); + p->pkey = (uint16_t) strtoul(optarg, NULL, 0); break; case 'Q': p->qos_class = strtoul(optarg, NULL, 0); @@ -1543,19 +1551,19 @@ static int process_opt(void *context, int ch, char *optarg) p->sl = strtoul(optarg, NULL, 0); break; case 'M': - p->mtu = strtoul(optarg, NULL, 0); + p->mtu = (uint8_t) strtoul(optarg, NULL, 0); break; case 'R': - p->rate = strtoul(optarg, NULL, 0); + p->rate = (uint8_t) strtoul(optarg, NULL, 0); break; case 20: - p->pkt_life = strtoul(optarg, NULL, 0); + p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0); break; case 'q': p->qkey = strtoul(optarg, NULL, 0); break; case 'T': - p->tclass = strtoul(optarg, NULL, 0); + p->tclass = (uint8_t) strtoul(optarg, NULL, 0); break; case 'F': p->flow_label = strtoul(optarg, NULL, 0); @@ -1564,10 +1572,10 @@ static int process_opt(void *context, int ch, char *optarg) p->hop_limit = strtoul(optarg, NULL, 0); break; case 21: - p->scope = strtoul(optarg, NULL, 0); + p->scope = (uint8_t) strtoul(optarg, NULL, 0); break; case 'J': - p->join_state = strtoul(optarg, NULL, 0); + p->join_state = (uint8_t) strtoul(optarg, NULL, 0); break; case 'X': p->proxy_join = strtoul(optarg, NULL, 0); @@ -1582,14 +1590,7 @@ int main(int argc, char **argv) { char usage_args[1024]; bind_handle_t h; - struct query_params params = { - .hop_limit = -1, - .reversible = -1, - .numb_path = -1, - .qos_class = -1, - .sl = -1, - .proxy_join = -1, - }; + struct query_params params; const struct query_cmd *q; ib_api_status_t status; int n; @@ -1643,9 +1644,17 @@ int main(int argc, char **argv) { "scope", 21, 1, NULL, "Scope (MCMemberRecord)" }, { "join_state", 'J', 1, NULL, "Join state (MCMemberRecord)" }, { "proxy_join", 'X', 1, NULL, "Proxy join (MCMemberRecord)" }, - {} + { 0 } }; + memset(¶ms, 0, sizeof params); + params.hop_limit = -1; + params.reversible = -1; + params.numb_path = -1; + params.qos_class = -1; + params.sl = -1; + params.proxy_join = -1; + n = sprintf(usage_args, "[query-name] [ | | ]\n" "\nSupported query names (and aliases):\n"); for (q = query_cmds; q->name; q++) { @@ -1680,7 +1689,7 @@ int main(int argc, char **argv) if (argc) { if (node_print_desc == NAME_OF_LID) { - requested_lid = strtoul(argv[0], NULL, 0); + requested_lid = (uint16_t) strtoul(argv[0], NULL, 0); requested_lid_flag++; } else if (node_print_desc == NAME_OF_GUID) { requested_guid = strtoul(argv[0], NULL, 0); From cameron at harr.org Thu Feb 26 16:18:44 2009 From: cameron at harr.org (Cameron Harr) Date: Thu, 26 Feb 2009 17:18:44 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A6F25F.8060306@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net> <49A4812A.8050202@harr.org> <49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org> <49A6F25F.8060306@vlnb .net> Message-ID: <49A73164.3010109@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr, on 02/26/2009 08:19 PM wrote: >> Cameron Harr wrote: >>> Cameron Harr wrote: >>> I re-compiled and re-ran the tests and numbers are a little better >>> but performance still seems to have gone down from 673: >>> Test 1:373751.66 >>> Test 2:371242.6067 >>> Test 3:347988.1467 >>> Test 4:378247.31 >>> Test 5:375616.53 >> I was curious and did a regression test with 673 and those numbers >> are now even worse, so I'll presume there is an issue on my system >> and not the SCST code: >> Test 1:365204.3067 >> Test 2:364152.2067 >> Test 3:340665.7633 >> Test 4:369916.8133 >> Test 5:369093.5833 > > It's known that any OS, including Linux, is getting "tired" under load > with time from boot, which leads to worse performance. I guess, you > can experience such effect. > > Check with r634. R635 has cache locality in data structures related > change, which intended to improve performance a bit, but might make it > worse instead. > This is with 634. It's pretty bad: 338316.44 329698.04 307972.7133 345682.4733 344165.08 From klakshman03 at hotmail.com Thu Feb 26 20:26:38 2009 From: klakshman03 at hotmail.com (lakshmana swamy) Date: Fri, 27 Feb 2009 09:56:38 +0530 Subject: [ofa-general] ***SPAM*** RE: Problem in IB network without Switch In-Reply-To: <200902261849.40448.jackm@dev.mellanox.co.il> References: <829ded920902260031r6f8b973t9f2e536864e25c85@mail.gmail.com> <200902261335.59927.jackm@dev.mellanox.co.il> <200902261849.40448.jackm@dev.mellanox.co.il> Message-ID: ThanQ Jack I will Update the firmware and let you know the status laxman > From: jackm at dev.mellanox.co.il > To: klakshman03 at hotmail.com > Subject: Re: Problem in IB network without Switch > Date: Thu, 26 Feb 2009 18:49:40 +0200 > CC: keshetti.mahesh at gmail.com; general at lists.openfabrics.org > > You are running VERY old firmware (from 2004), and moreover, on one host > you have 3.0.0, and on the other 3.1.0. > > You need to upgrade your firmware. > Contact your Mellanox FAE (support engineer) for instructions. > > - Jack > > > Hi Jack, > > > > Please find the output of ibstat on both the nodes, . > > > > [root at mattool ~]# /opt/ofed/extras/hca_self_test.ofed > > HCA Firmware Check ..................... FAIL > > REASON: mismatch HCA #0 firmware detected (found v, need v3.5.917) > > Host Driver Initialization ............. PASS > > > > [root at mattool ~]# > > > > ************ IBSTAT output ****************** > > > > > > [root at mattool ~]# ibstat > > CA 'mthca0' > > CA type: MT23108 > > Number of ports: 2 > > Firmware version: 3.1.0 > > > [root at compute-0-0 ~]# ibstat > > CA 'mthca0' > > CA type: MT23108 > > Number of ports: 2 > > Firmware version: 3.0.0 _________________________________________________________________ Find a better job. We have plenty. Visit MSN Jobs http://www.in.msn.com/jobs -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jie.Cai at cs.anu.edu.au Thu Feb 26 22:29:47 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 27 Feb 2009 17:29:47 +1100 Subject: [ofa-general] Bandwidth of performance with multirail IB In-Reply-To: <200902240941.58634.cap@nsc.liu.se> References: <20090223211155.730AFE28137@openfabrics.org> <49A378BC.5010806@cs.anu.edu.au> <200902240941.58634.cap@nsc.liu.se> Message-ID: <49A7885B.3010005@cs.anu.edu.au> Hi Peter, A question on implementation multi-rail with uDAPL connections. What I did is open 2 IAs (corresponding to the 2 ports on HCAs) on each node. Then create one EP for each IA, and connect those EPs to the corresponding EP at other node. Then data been transferred via both EP-connections. I have been notice that there's a MULTIPATH connection flag for dapl, but I did not use it. What's the use of it? Cheers, Jie -- Mr. Jie Cai Peter Kjellstrom wrote: > On Tuesday 24 February 2009, Jie Cai wrote: > >> I have implemented a uDAPL program to measure the bandwidth on IB with >> multirail connections. >> >> The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two >> ports. >> >> The program utilize the two port on each node of cluster to build >> multirail IB connections. >> >> The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which >> is almost the same as single rail connections. >> > > Assuming you have a 2.5 GT/s pci-express x8 that speed is a result of the bus > not being able to keep up with the HCA. Since the bus is holding even a > single DDR IB port back you see no improvement with two ports. > > To fully drive a DDR IB port you need either 16x pci-express 2.5 GT/s or a 8x > 5 GT/s. For one QDR or two DDR you'll need even more... > > /Peter > > >> Does anyone have similar experience? >> From Jie.Cai at cs.anu.edu.au Thu Feb 26 22:49:09 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 27 Feb 2009 17:49:09 +1100 Subject: [ofa-general] configuration question: how to support multiple IB interfaces? Message-ID: <49A78CE5.4000506@cs.anu.edu.au> I have connectX dual port HCAs installed in my system (support pci-e 2.0), and each port shows as an individual interface on ifconfig messages (ib0 and ib1). Communications using OpenMPI and uDAPL via IB connection are fine. However, I have a question on how to utilize the dual ports? Or do I need to specific configure the system to drive dual ports? (I have set net.ipv4.conf.all.arp_ignore et al.) Can't see a better bandwidth on either Open MPI or uDAPL. Does anyone got experience with this? -- Mr. Jie Cai From davem at davemloft.net Fri Feb 27 00:01:50 2009 From: davem at davemloft.net (David Miller) Date: Fri, 27 Feb 2009 00:01:50 -0800 (PST) Subject: [ofa-general] Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> Message-ID: <20090227.000150.177646700.davem@davemloft.net> From: Andrew Grover Date: Wed, 25 Feb 2009 10:43:27 -0800 > On Tue, Feb 24, 2009 at 11:28 PM, David Miller wrote: > > Furthermore the port you've choosen for the protocol is arbitrary, not > > properly allocated with the appropriate standards committee, and > > therefore could conflict with something other people are using. > > I'm sure allocating the port won't be too big an issue. Ok, I added the RDS code to the net-next-2.6 tree, changing AF_RDS to be 21 From sashak at voltaire.com Fri Feb 27 00:32:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 10:32:00 +0200 Subject: [ofa-general] Re: [PATCH v2] [ib-diag] saquery: add support for WinOF In-Reply-To: References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com> <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com> Message-ID: <20090227083200.GA7462@sashak.voltaire.com> On 15:10 Thu 26 Feb , Sean Hefty wrote: > Signed-off-by: Sean Hefty All applied. Thanks. > - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments > are unsigned. This is kind of confusing, but that's how it appears the > macro is used. It might be clearer if instead of passing -1 into the > macro, that a SET_VAL macro be used instead. What do you mean? Another macro? Sasha From sashak at voltaire.com Fri Feb 27 00:36:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 10:36:02 +0200 Subject: [ofa-general] Re: [PATCH 2/6] [ib-diag] ibroute: add support for WinOF In-Reply-To: <20090226213033.GG5127@obsidianresearch.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <20090226213033.GG5127@obsidianresearch.com> Message-ID: <20090227083602.GB7462@sashak.voltaire.com> On 14:30 Thu 26 Feb , Jason Gunthorpe wrote: > > Math on void* pointers is a gcc extension, Indeed. (I forgot about this a long time ago :)). > I'm surprised you don't get > warnings on linux - it is worth figuring out how to turn those on.. Gcc warns when '-pedantic' is used. Sasha From sashak at voltaire.com Fri Feb 27 01:08:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 11:08:45 +0200 Subject: [ofa-general] Re: [PATCH] ibsim/umad2sim.c: Eliminate unneeded umad2sim_dev num In-Reply-To: <20090219174413.GA29805@comcast.net> References: <20090219174413.GA29805@comcast.net> Message-ID: <20090227090838.GC7462@sashak.voltaire.com> Hi Hal, On 12:44 Thu 19 Feb , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock > --- > diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c > index e13e30a..aaa6260 100644 > --- a/umad2sim/umad2sim.c > +++ b/umad2sim/umad2sim.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This file is part of ibsim. > * > @@ -77,7 +78,6 @@ struct ib_user_mad_reg_req { > > struct umad2sim_dev { > int fd; > - unsigned num; Wouldn't it be useful when more than one CA/host ports will be supported using umad2sim? Sasha > char name[32]; > uint8_t port; > struct sim_client sim_client; > @@ -351,15 +351,13 @@ static int dev_sysfs_create(struct umad2sim_dev *dev) > *str = '\0'; > > /* /sys/class/infiniband_mad/umad0/ */ > - snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, > - dev->num); > + snprintf(path, sizeof(path), "%s/umad%u", sysfs_infiniband_mad_dir, 0); > make_path(path); > file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name); > file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port); > > /* /sys/class/infiniband_mad/issm0/ */ > - snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir, > - dev->num); > + snprintf(path, sizeof(path), "%s/issm%u", sysfs_infiniband_mad_dir, 0); > make_path(path); > file_printf(path, SYS_IB_MAD_DEV, "%s\n", dev->name); > file_printf(path, SYS_IB_MAD_PORT, "%d\n", dev->port); > @@ -546,7 +544,7 @@ static int umad2sim_ioctl(struct umad2sim_dev *dev, unsigned long request, > return -1; > } > > -static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) > +static struct umad2sim_dev *umad2sim_dev_create(const char *name) > { > struct umad2sim_dev *dev; > unsigned i; > @@ -558,7 +556,6 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) > return NULL; > memset(dev, 0, sizeof(*dev)); > > - dev->num = num; > strncpy(dev->name, name, sizeof(dev->name) - 1); > > if (sim_client_init(&dev->sim_client) < 0) > @@ -574,9 +571,9 @@ static struct umad2sim_dev *umad2sim_dev_create(unsigned num, const char *name) > dev_sysfs_create(dev); > > snprintf(dev->umad_path, sizeof(dev->umad_path), "%s/%s%u", > - umad_dev_dir, "umad", num); > + umad_dev_dir, "umad", 0); > snprintf(dev->issm_path, sizeof(dev->issm_path), "%s/%s%u", > - umad_dev_dir, "issm", num); > + umad_dev_dir, "issm", 0); > > return dev; > > @@ -646,7 +643,7 @@ static void umad2sim_init(void) > DEBUG("umad2sim_init...\n"); > snprintf(umad2sim_sysfs_prefix, sizeof(umad2sim_sysfs_prefix), > "./sys-%d", getpid()); > - devices[0] = umad2sim_dev_create(0, "ibsim0"); > + devices[0] = umad2sim_dev_create("ibsim0"); > if (!devices[0]) { > ERROR("cannot init umad2sim. Exit.\n"); > exit(-1); From sashak at voltaire.com Fri Feb 27 01:24:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 11:24:05 +0200 Subject: [ofa-general] Re: [PATCH] opensm/console: Enhance perfmgr print_counters for better nodenames In-Reply-To: References: <20090219130653.GA29318@comcast.net> <20090226061551.GQ11192@sashak.voltaire.com> Message-ID: <20090227092405.GD7462@sashak.voltaire.com> On 07:03 Thu 26 Feb , Hal Rosenstock wrote: > On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky wrote: > > [snip...] > > > And in general I think it is better to use C-style comments - /* ... */, > > in C code and not C++-style // ... . > > Is this going to be enforced uniformly across OpenSM ? I didn't think about it (there are no many '//' comments), but I try to not introduce a new ones. Sasha From vlad at lists.openfabrics.org Fri Feb 27 03:17:29 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 27 Feb 2009 03:17:29 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090227-0200 daily build status Message-ID: <20090227111729.38CDDE6101C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Fri Feb 27 03:33:18 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Feb 2009 06:33:18 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> Message-ID: Sasha, On Thu, Feb 26, 2009 at 7:03 AM, Hal Rosenstock wrote: [snip...] >>> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >>> index 73a6274..503d7fa 100644 >>> --- a/opensm/opensm/main.c >>> +++ b/opensm/opensm/main.c >>> @@ -2,6 +2,7 @@ >>>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >>>   * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. >>>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >>> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>>   * >>>   * This software is available to you under a choice of one of two >>>   * licenses.  You may choose to be licensed under the terms of the GNU >>> @@ -364,6 +365,11 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) >>>       uint32_t i, choice = 0; >>>       ib_api_status_t status; >>> >>> +     for (i = 0; i < num_ports; i++) { >>> +             attr_array[i].num_pkeys = 0; >>> +             attr_array[i].p_pkey_table = NULL; >>> +     } >>> + >> >> Here and below. Just >> >>        memset(attr_array, 0, sizeof(attr_array)); >> >> would be enough. > > Sure; next version. The thought above is that it is more efficient to just initialize the needed fields rather than the entire array which is not required. -- Hal From ms at diskware.net Fri Feb 27 03:35:12 2009 From: ms at diskware.net (Martin Scholl) Date: Fri, 27 Feb 2009 12:35:12 +0100 Subject: [ofa-general] RDS: add MSG_NOSIGNAL to rds_sendmsg? Message-ID: <49A7CFF0.4090606@diskware.net> Hello all, [although I liked to discuss this at rds-devel@, I post to general@ as rds-devel@ is still broken for me for several days now.] I just noticed MSG_NOSIGNAL is not part of the allowed set of msg flags to rds_sendmsg(). Attached is a hopefully harmless and tiny fix for this. Martin -------------- next part -------------- A non-text attachment was scrubbed... Name: msg_nosignal.diff Type: text/x-patch Size: 550 bytes Desc: not available URL: From hal.rosenstock at gmail.com Fri Feb 27 03:40:54 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Feb 2009 06:40:54 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> <20090226212538.GL14238@sashak.voltaire.com> Message-ID: On Thu, Feb 26, 2009 at 4:43 PM, Hal Rosenstock wrote: > On Thu, Feb 26, 2009 at 4:25 PM, Sasha Khapyorsky wrote: >> On 07:03 Thu 26 Feb     , Hal Rosenstock wrote: >>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? r = IB_INSUFFICIENT_MEMORY; >>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(p_vend->p_log, >>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG_ERROR, >>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 5419: Insufficient memory for pkeys for port %d; need space for %d pkeys\n", >>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? j, >>> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ca.ports[j]->pkeys_size); >>> > >>> > Also should it be an error? May be it is just enough to fill requested >>> > pkey entries? >>> >>> I agree that being more forgiving is better but then how would it be >>> known if the pkeys are being truncated ? >> >> You could return a real pkeys_size value with table filled up to >> provided size. >> >> Otherwise (in case of just an error) how an user could know which pkey >> size to provide? > The problem with that is that the user needs to remember how many he > asked for originally. Not hard but just a detail that I expect will > get lost. Also, should I assume you don't care about the API inconsistency issue mentioned in that the user can't just request the first n ports but only all ports.? -- Hal > -- Hal > >> Sasha >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Fri Feb 27 03:44:42 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Feb 2009 06:44:42 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] ibsim/umad2sim.c: Eliminate unneeded umad2sim_dev num In-Reply-To: <20090227090838.GC7462@sashak.voltaire.com> References: <20090219174413.GA29805@comcast.net> <20090227090838.GC7462@sashak.voltaire.com> Message-ID: Sasha, On Fri, Feb 27, 2009 at 4:08 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 12:44 Thu 19 Feb     , Hal Rosenstock wrote: >> >> Signed-off-by: Hal Rosenstock >> --- >> diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c >> index e13e30a..aaa6260 100644 >> --- a/umad2sim/umad2sim.c >> +++ b/umad2sim/umad2sim.c >> @@ -1,5 +1,6 @@ >>  /* >>   * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2009 HNR Consulting. All rights reserved. >>   * >>   * This file is part of ibsim. >>   * >> @@ -77,7 +78,6 @@ struct ib_user_mad_reg_req { >> >>  struct umad2sim_dev { >>       int fd; >> -     unsigned num; > > Wouldn't it be useful when more than one CA/host ports will be supported > using umad2sim? Then shouldn't it be added at the time that that feature is supported rather than have the currently unneeded initialization ? -- Hal > Sasha From sashak at voltaire.com Fri Feb 27 04:20:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 14:20:59 +0200 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> <20090226212538.GL14238@sashak.voltaire.com> Message-ID: <20090227122059.GE7462@sashak.voltaire.com> On 06:40 Fri 27 Feb , Hal Rosenstock wrote: > >> > >> You could return a real pkeys_size value with table filled up to > >> provided size. > >> > >> Otherwise (in case of just an error) how an user could know which pkey > >> size to provide? > > > The problem with that is that the user needs to remember how many he > > asked for originally. Not hard but just a detail that I expect will > > get lost. > > Also, should I assume you don't care about the API inconsistency issue > mentioned in that the user can't just request the first n ports but > only all ports.? Why not? num_ports pointer is in/out parameter. Could you explain what do you mean here by API inconsistency? Sasha From sashak at voltaire.com Fri Feb 27 04:49:48 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 14:49:48 +0200 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> Message-ID: <20090227124948.GF7462@sashak.voltaire.com> On 06:33 Fri 27 Feb , Hal Rosenstock wrote: > >> > >> ?? ?? ?? ??memset(attr_array, 0, sizeof(attr_array)); > >> > >> would be enough. > > > > Sure; next version. > > The thought above is that it is more efficient to just initialize the > needed fields rather than the entire array which is not required. I don't know for sure about this specific example, but normally memset() is heavily optimized function so I would expect at least comparable performance here. Sasha From sashak at voltaire.com Fri Feb 27 04:50:22 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 14:50:22 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node In-Reply-To: <49A68976.6000404@dev.mellanox.co.il> References: <49A68976.6000404@dev.mellanox.co.il> Message-ID: <20090227125022.GG7462@sashak.voltaire.com> On 14:22 Thu 26 Feb , Yevgeny Kliteynik wrote: > Hi Sasha, > > [v2: adding CL_ASSERT() and changing comments] > > This patch fixes bugzilla issue #1515. > > The bug was discovered and analyzed by Line Holen. > > Topology: > |---------------| > | SW2 | > |---------------| > |x |y |z |v > |----| | | |----| > | | | | > | |----| |----| | > | | | | > a| b| c| d| > |---------------| |---------------| > | SW1 | | SW3 | > |---------------| |---------------| > | | > | | > HCA with SM HCA > > During the discovery: > > SM sends NodeInfo request to SW1 > SM sends NodeInfo request to SW2 through link a->x > SM discovers new node SW2: > - updates DR to SW2 to go through link a->x > - creates physp x > SM sends NodeInfo request to SW2 through link b->y > SM discovers a known node SW2 > - DOES NOT create physp y > - updates DR to SW2 to go through link b->y > > From now on, the DR to SW2 is going through port y, so OpenSM won't deal with > port y any more, leaving it uninitialized (no physp object for this port). > > The fix is to create physp for the newly discovered port of the known > switch node, same way as it is done for HCAs. > I also added one log message for the case that showed the problem - when > one of the link sides is uninitialized (no valid ports check). Perhaps > this log message should be an error message instead? > > Debugged-by: Line Holen > Signed-off-by: Yevgeny Kliteynik Applied. Thanksa. Sasha From sashak at voltaire.com Fri Feb 27 05:00:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 15:00:34 +0200 Subject: [ofa-general] Re: [PATCH] ibsim/umad2sim.c: Eliminate unneeded umad2sim_dev num In-Reply-To: References: <20090219174413.GA29805@comcast.net> <20090227090838.GC7462@sashak.voltaire.com> Message-ID: <20090227130034.GH7462@sashak.voltaire.com> On 06:44 Fri 27 Feb , Hal Rosenstock wrote: > > > Wouldn't it be useful when more than one CA/host ports will be supported > > using umad2sim? > > Then shouldn't it be added at the time that that feature is supported > rather than have the currently unneeded initialization ? It is matter of clean interface - I prefer to keep it clean and not to use hardcoded device number. Sasha From sashak at voltaire.com Fri Feb 27 05:01:36 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 15:01:36 +0200 Subject: [ofa-general] [PATCH] ibsim: fix LocalPortNum in PortInfo response Message-ID: <20090227130136.GI7462@sashak.voltaire.com> Fix LocalPortNum encoding in PortInfo responses. Signed-off-by: Sasha Khapyorsky --- ibsim/sim_mad.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/ibsim/sim_mad.c b/ibsim/sim_mad.c index 6e08031..9253415 100644 --- a/ibsim/sim_mad.c +++ b/ibsim/sim_mad.c @@ -483,6 +483,7 @@ do_portinfo(Port * port, unsigned op, uint32_t portnum, uint8_t * data) update_portinfo(p); memcpy(data, p->portinfo, IB_SMP_DATA_SIZE); + mad_set_field(data, 0, IB_PORT_LOCAL_PORT_F, port->portnum); return 0; } -- 1.6.1.2.319.gbd9e From hal.rosenstock at gmail.com Fri Feb 27 05:38:27 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Feb 2009 08:38:27 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: <20090227122059.GE7462@sashak.voltaire.com> References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> <20090226212538.GL14238@sashak.voltaire.com> <20090227122059.GE7462@sashak.voltaire.com> Message-ID: On Fri, Feb 27, 2009 at 7:20 AM, Sasha Khapyorsky wrote: > On 06:40 Fri 27 Feb     , Hal Rosenstock wrote: >> >> >> >> You could return a real pkeys_size value with table filled up to >> >> provided size. >> >> >> >> Otherwise (in case of just an error) how an user could know which pkey >> >> size to provide? >> >> > The problem with that is that the user needs to remember how many he >> > asked for originally. Not hard but just a detail that I expect will >> > get lost. >> >> Also, should I assume you don't care about the API inconsistency issue >> mentioned in that the user can't just request the first n ports but >> only all ports.? > > Why not? num_ports pointer is in/out parameter. Could you explain what > do you mean here by API inconsistency? It's implementation rather than API. Not all the vendor implementations support the semantic where num_ports is not 0 and less than the total number of ports (and return insufficient memory for this condition). -- Hal > Sasha From hal.rosenstock at gmail.com Fri Feb 27 08:39:00 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Feb 2009 11:39:00 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] Re: [PATCH] Add pkey table support to osm_get_all_port_attrs In-Reply-To: <20090227124948.GF7462@sashak.voltaire.com> References: <20090218153016.GD8489@comcast.net> <20090226070629.GU11192@sashak.voltaire.com> <20090227124948.GF7462@sashak.voltaire.com> Message-ID: Sasha, On Fri, Feb 27, 2009 at 7:49 AM, Sasha Khapyorsky wrote: > On 06:33 Fri 27 Feb     , Hal Rosenstock wrote: >> >> >> >> ?? ?? ?? ??memset(attr_array, 0, sizeof(attr_array)); >> >> >> >> would be enough. >> > >> > Sure; next version. >> >> The thought above is that it is more efficient to just initialize the >> needed fields rather than the entire array which is not required. > > I don't know for sure about this specific example, but normally memset() > is heavily optimized function so I would expect at least comparable > performance here. It's minor but memset is slower for this. -- Hal > Sasha > From andi at firstfloor.org Fri Feb 27 09:08:34 2009 From: andi at firstfloor.org (Andi Kleen) Date: Fri, 27 Feb 2009 18:08:34 +0100 Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> (Andy Grover's message of "Tue, 24 Feb 2009 17:30:17 -0800") References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> Message-ID: <87myc73izx.fsf@basil.nowhere.org> Andy Grover writes: > This patchset against net-next adds support for RDS sockets. RDS is an > Oracle-originated protocol used to send IPC datagrams (up to 1MB) > reliably, and is used currently in Oracle RAC and Exadata products. Perhaps I missed it earlier, but what is the rationale for putting this as a socket type into the kernel? I assume they also work directly as implemented in user space using raw sockets or similar, don't they? -Andi -- ak at linux.intel.com -- Speaking for myself only. From sean.hefty at intel.com Fri Feb 27 09:51:51 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 27 Feb 2009 09:51:51 -0800 Subject: [ofa-general] RE: [PATCH v2] [ib-diag] saquery: add support for WinOF In-Reply-To: <20090227083200.GA7462@sashak.voltaire.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com> <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com> <20090227083200.GA7462@sashak.voltaire.com> Message-ID: <2E1723D4400C48B5B707979E50DCE370@amr.corp.intel.com> >> - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments >> are unsigned. This is kind of confusing, but that's how it appears the >> macro is used. It might be clearer if instead of passing -1 into the >> macro, that a SET_VAL macro be used instead. > >What do you mean? Another macro? yes -- instead of passing -1 into CHECK_AND_SET_VAL as the value to compare against, call a different macro that just sets the value, unless I'm misunderstanding why -1 is passed in. Then CHECK_AND_SET_VAL would do unsigned comparisons. I can submit a patch for this, but I wasn't completely sure of the intent of using -1 as the compare value. - Sean From andy.grover at gmail.com Fri Feb 27 10:21:30 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Fri, 27 Feb 2009 10:21:30 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <20090227.000150.177646700.davem@davemloft.net> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <20090224.232814.227017310.davem@davemloft.net> <20090227.000150.177646700.davem@davemloft.net> Message-ID: On Fri, Feb 27, 2009 at 12:01 AM, David Miller wrote: > From: Andrew Grover > Date: Wed, 25 Feb 2009 10:43:27 -0800 > >> On Tue, Feb 24, 2009 at 11:28 PM, David Miller wrote: >> > Furthermore the port you've choosen for the protocol is arbitrary, not >> > properly allocated with the appropriate standards committee, and >> > therefore could conflict with something other people are using. >> >> I'm sure allocating the port won't be too big an issue. > > Ok, I added the RDS code to the net-next-2.6 tree, changing > AF_RDS to be 21 Thanks much! -- Andy From rdreier at cisco.com Fri Feb 27 10:29:36 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Feb 2009 10:29:36 -0800 Subject: [ofa-general] Re: [PATCH] mlx4_core: Add device IDs for MT25458 10GigE devices In-Reply-To: <200902261238.26437.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 26 Feb 2009 12:38:26 +0200") References: <200902261238.26437.jackm@dev.mellanox.co.il> Message-ID: thanks, applied From rdreier at cisco.com Fri Feb 27 10:31:00 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Feb 2009 10:31:00 -0800 Subject: [ofa-general] Re: [PATCH] ib/iser: remove hard setting of mtu In-Reply-To: (Or Gerlitz's message of "Thu, 26 Feb 2009 10:57:45 +0200 (IST)") References: Message-ID: thanks, applied. From rdreier at cisco.com Fri Feb 27 10:32:40 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Feb 2009 10:32:40 -0800 Subject: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation In-Reply-To: <71d336490902261109n583f5b26gc9bf6fbee02e092e@mail.gmail.com> (Ramachandra K.'s message of "Fri, 27 Feb 2009 00:39:27 +0530") References: <680215bff5de6924922a2564da88b7f10951235666594.95@15bff5de6924922a2564da88b7f1095> <2B352424BBF540719F498B8DE04F1019@amr.corp.intel.com> <71d336490902261109n583f5b26gc9bf6fbee02e092e@mail.gmail.com> Message-ID: thanks, applied. From rdreier at cisco.com Fri Feb 27 10:36:37 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Feb 2009 10:36:37 -0800 Subject: [ofa-general] Re: [PATCH v2] IB/core: fix null pointer dereference in local_completions() In-Reply-To: <1235687968.3948.218.camel@chromite.mv.qlogic.com> (Ralph Campbell's message of "Thu, 26 Feb 2009 14:39:28 -0800") References: <1235687968.3948.218.camel@chromite.mv.qlogic.com> Message-ID: thanks, applied From hnrose at comcast.net Fri Feb 27 10:46:53 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 27 Feb 2009 13:46:53 -0500 Subject: [ofa-general] [PATCH] opensm/infiniband-diags: Changes for C rather than C++ style comments Message-ID: <20090227184653.GC15668@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c index 048efc7..0c30726 100644 --- a/infiniband-diags/src/grouping.c +++ b/infiniband-diags/src/grouping.c @@ -336,9 +336,9 @@ static void get_router_slot(Node *node, Port *spineport) ch->slotnum = line_slot_2_sfb12[spineport->portnum]; /* this is a smart guess based on nodeguids order on sFB-12 module */ guessnum = spineport->node->nodeguid % 4; - // module 1 <--> remote anafa 3 - // module 2 <--> remote anafa 2 - // module 3 <--> remote anafa 1 + /* module 1 <--> remote anafa 3 */ + /* module 2 <--> remote anafa 2 */ + /* module 3 <--> remote anafa 1 */ ch->anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2)); } else if (is_spine_2004(spineport->node)) { ch->chassistype = ISR2004_CT; diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 6946fd7..2c5240a 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -870,8 +870,10 @@ void dump_ports_report () Node *node; Port *port; - // If switch and LID == 0, search of other switch ports with - // valid LID and assign it to all ports of that switch + /* + * If switch and LID == 0, search of other switch ports with + * valid LID and assign it to all ports of that switch + */ for (b = 0; b <= MAXHOPS; b++) for (node = nodesdist[b]; node; node = node->dnext) if (node->type == SWITCH_NODE) { diff --git a/opensm/include/opensm/osm_console.h b/opensm/include/opensm/osm_console.h index 3ea8fa5..acb36d9 100644 --- a/opensm/include/opensm/osm_console.h +++ b/opensm/include/opensm/osm_console.h @@ -45,7 +45,7 @@ #endif /* __cplusplus */ BEGIN_C_DECLS -// TODO replace p_osm +/* TODO replace p_osm */ void osm_console(osm_opensm_t * p_osm); END_C_DECLS #endif /* _OSM_CONSOLE_H_ */ diff --git a/opensm/opensm/osm_console_io.c b/opensm/opensm/osm_console_io.c index 3d3ece4..8953ab7 100644 --- a/opensm/opensm/osm_console_io.c +++ b/opensm/opensm/osm_console_io.c @@ -59,7 +59,7 @@ static int is_local(char *str) { - // convenience - checks if just stdin/stdout + /* convenience - checks if just stdin/stdout */ if (str) return (strcmp(str, OSM_LOCAL_CONSOLE) == 0); return 0; @@ -67,7 +67,7 @@ static int is_local(char *str) static int is_loopback(char *str) { - // convenience - checks if socket based connection + /* convenience - checks if socket based connection */ if (str) return (strcmp(str, OSM_LOOPBACK_CONSOLE) == 0); return 0; @@ -75,7 +75,7 @@ static int is_loopback(char *str) static int is_remote(char *str) { - // convenience - checks if socket based connection + /* convenience - checks if socket based connection */ if (str) return (strcmp(str, OSM_REMOTE_CONSOLE) == 0) || is_loopback(str); @@ -84,7 +84,7 @@ static int is_remote(char *str) int is_console_enabled(osm_subn_opt_t * p_opt) { - // checks for a variety of types of consoles - default is off or 0 + /* checks for a variety of types of consoles - default is off or 0 */ if (p_opt) return (is_local(p_opt->console) || is_loopback(p_opt->console) @@ -210,14 +210,14 @@ int osm_console_init(osm_subn_opt_t * opt, osm_console_t * p_oct, osm_log_t * p_ /* clean up and release resources */ void osm_console_exit(osm_console_t * p_oct, osm_log_t * p_log) { - // clean up and release resources, currently just close the socket + /* clean up and release resources, currently just close the socket */ osm_console_close(p_oct, p_log); } #ifdef ENABLE_OSM_CONSOLE_SOCKET int cio_open(osm_console_t * p_oct, int new_fd, osm_log_t * p_log) { - // returns zero if opened fine, -1 otherwise + /* returns zero if opened fine, -1 otherwise */ char *p_line; size_t len; ssize_t n; diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 4e783bf..17611f7 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -679,7 +679,7 @@ static void free_lash_structures(lash_t * p_lash) OSM_LOG_ENTER(p_log); - // free cdg_vertex_matrix + /* free cdg_vertex_matrix */ for (i = 0; i < p_lash->vl_min; i++) { for (j = 0; j < num_switches; j++) { for (k = 0; k < num_switches; k++) @@ -695,7 +695,7 @@ static void free_lash_structures(lash_t * p_lash) if (p_lash->cdg_vertex_matrix) free(p_lash->cdg_vertex_matrix); - // free virtual_location + /* free virtual_location */ for (i = 0; i < num_switches; i++) { for (j = 0; j < num_switches; j++) { if (p_lash->virtual_location[i][j]) @@ -723,7 +723,7 @@ static int init_lash_structures(lash_t * p_lash) OSM_LOG_ENTER(p_log); - // initialise cdg_vertex_matrix[num_switches][num_switches][num_switches] + /* initialise cdg_vertex_matrix[num_switches][num_switches][num_switches] */ p_lash->cdg_vertex_matrix = (cdg_vertex_t ****) malloc(vl_min * sizeof(cdg_vertex_t ****)); for (i = 0; i < vl_min; i++) { @@ -749,8 +749,10 @@ static int init_lash_structures(lash_t * p_lash) } } - // initialise virtual_location[num_switches][num_switches][num_layers], - // default value = 0 + /* + * initialise virtual_location[num_switches][num_switches][num_layers], + * default value = 0 + */ p_lash->virtual_location = (int ***)malloc(num_switches * sizeof(int ***)); if (p_lash->virtual_location == NULL) @@ -775,7 +777,7 @@ static int init_lash_structures(lash_t * p_lash) } } - // initialise num_mst_in_lane[num_switches], default 0 + /* initialise num_mst_in_lane[num_switches], default 0 */ p_lash->num_mst_in_lane = (int *)malloc(num_switches * sizeof(int)); if (p_lash->num_mst_in_lane == NULL) goto Exit_Mem_Error; @@ -997,7 +999,7 @@ static void populate_fwd_tbls(lash_t * p_lash) p_next_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl); - // Go through each swtich individually + /* Go through each swtich individually */ while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl)) { uint64_t current_guid; switch_t *sw; @@ -1051,7 +1053,7 @@ static void populate_fwd_tbls(lash_t * p_lash) dst_lash_switch_id, physical_egress_port); } - } // for + } /* for */ osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw); } OSM_LOG_EXIT(p_log); @@ -1069,7 +1071,7 @@ static void osm_lash_process_switch(lash_t * p_lash, osm_switch_t * p_sw) switch_a_lash_id = get_lash_id(p_sw); port_count = osm_node_get_num_physp(p_sw->p_node); - // starting at port 1, ignoring management port on switch + /* starting at port 1, ignoring management port on switch */ for (i = 1; i < port_count; i++) { p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i); @@ -1148,7 +1150,7 @@ static int discover_network_properties(lash_t * p_lash) return -1; memset(p_lash->switches, 0, p_lash->num_switches * sizeof(switch_t *)); - vl_min = 5; // set to a high value + vl_min = 5; /* set to a high value */ p_next_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl); while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl)) { @@ -1163,7 +1165,7 @@ static int discover_network_properties(lash_t * p_lash) port_count = osm_node_get_num_physp(p_sw->p_node); - // Note, ignoring port 0. management port + /* Note, ignoring port 0. management port */ for (i = 1; i < port_count; i++) { osm_physp_t *p_current_physp = osm_node_get_physp_ptr(p_sw->p_node, i); @@ -1178,8 +1180,8 @@ static int discover_network_properties(lash_t * p_lash) if (port_vl_min && port_vl_min < vl_min) vl_min = port_vl_min; } - } // for - } // while + } /* for */ + } /* while */ vl_min = 1 << (vl_min - 1); if (vl_min > 15) @@ -1219,7 +1221,7 @@ static int lash_process(void *context) p_lash->balance_limit = 6; - // everything starts here + /* everything starts here */ lash_cleanup(p_lash); return_status = discover_network_properties(p_lash); From hnrose at comcast.net Fri Feb 27 10:45:57 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 27 Feb 2009 13:45:57 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_perfmgr.c: In osm_perfmgr_shutdown, add missing cl_disp_unregister Message-ID: <20090227184557.GB15668@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 6d325cb..f146fac 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -849,6 +849,7 @@ void osm_perfmgr_shutdown(osm_perfmgr_t * const pm) { OSM_LOG_ENTER(pm->log); cl_timer_stop(&pm->sweep_timer); + cl_disp_unregister(pm->pc_disp_h); osm_perfmgr_mad_unbind(pm); OSM_LOG_EXIT(pm->log); } From hnrose at comcast.net Fri Feb 27 10:44:50 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 27 Feb 2009 13:44:50 -0500 Subject: [ofa-general] [PATCHv2] Add pkey table support to osm_get_all_port_attrs Message-ID: <20090227184450.GA15668@comcast.net> Only supported in osm_vendor_ibumad.c (separate patch for other vendor layers) Also, update applications using this (osmtest, opensm) Signed-off-by: Hal Rosenstock --- Changes from v1: Only copy number of pkeys indicated Also, don't indicate insufficient memory error if insufficient pkey space supplied and always return number of pkeys that the port supports Note: initialization prior to get_all_port_attrs call not changed since it is faster this way Other patch for other vendor layers still appropriate following this ibutils patch to come diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c index 734a860..7a578ea 100644 --- a/opensm/libvendor/osm_vendor_ibumad.c +++ b/opensm/libvendor/osm_vendor_ibumad.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -556,12 +557,13 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, umad_ca_t ca; ib_port_attr_t *attr = p_attr_array; unsigned done = 0; - int r, i, j; + int r, i, j, k; OSM_LOG_ENTER(p_vend->p_log); CL_ASSERT(p_vend && p_num_ports); + r = 0; if (!*p_num_ports) { r = IB_INVALID_PARAMETER; OSM_LOG(p_vend->p_log, OSM_LOG_ERROR, "ERR 5418: " @@ -576,9 +578,7 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, } for (i = 0; i < p_vend->ca_count && !done; i++) { - /* - * For each CA, retrieve the port guids - */ + /* For each CA, retrieve the port attributes */ if (umad_get_ca(p_vend->ca_names[i], &ca) == 0) { if (ca.node_type < 1 || ca.node_type > 3) continue; @@ -590,6 +590,12 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, attr->port_num = ca.ports[j]->portnum; attr->sm_lid = ca.ports[j]->sm_lid; attr->link_state = ca.ports[j]->state; + if (attr->num_pkeys && attr->p_pkey_table) { + for (k = 0; k < attr->num_pkeys; k++) + attr->p_pkey_table[k] = + cl_hton16(ca.ports[j]->pkeys[k]); + } + attr->num_pkeys = ca.ports[j]->pkeys_size; attr++; if (attr - p_attr_array > *p_num_ports) { done = 1; @@ -601,7 +607,6 @@ osm_vendor_get_all_port_attr(IN osm_vendor_t * const p_vend, } *p_num_ports = attr - p_attr_array; - r = 0; Exit: OSM_LOG_EXIT(p_vend->p_log); diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 47fd658..1507fff 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c index f87e33b..bc8999d 100644 --- a/opensm/osmtest/main.c +++ b/opensm/osmtest/main.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -217,6 +218,11 @@ static void print_all_guids(IN osmtest_t * p_osmt) ib_port_attr_t attr_array[GUID_ARRAY_SIZE]; int i; + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* Call the transport layer for a list of local port GUID values. @@ -245,6 +251,11 @@ ib_net64_t get_port_guid(IN osmtest_t * p_osmt, uint64_t port_guid) ib_port_attr_t attr_array[GUID_ARRAY_SIZE]; int i; + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* Call the transport layer for a list of local port GUID values. diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c index 32cfa01..bdfe42c 100644 --- a/opensm/osmtest/osmtest.c +++ b/opensm/osmtest/osmtest.c @@ -2,6 +2,7 @@ * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -7096,9 +7097,15 @@ osmtest_bind(IN osmtest_t * p_osmt, ib_api_status_t status; uint32_t num_ports = GUID_ARRAY_SIZE; ib_port_attr_t attr_array[GUID_ARRAY_SIZE]; + int i; OSM_LOG_ENTER(&p_osmt->log); + for (i = 0; i < num_ports; i++) { + attr_array[i].num_pkeys = 0; + attr_array[i].p_pkey_table = NULL; + } + /* * Call the transport layer for a list of local port * GUID values. From hal.rosenstock at gmail.com Fri Feb 27 10:53:19 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Feb 2009 13:53:19 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/console: Enhance perfmgr print_counters for better nodenames In-Reply-To: <20090227092405.GD7462@sashak.voltaire.com> References: <20090219130653.GA29318@comcast.net> <20090226061551.GQ11192@sashak.voltaire.com> <20090227092405.GD7462@sashak.voltaire.com> Message-ID: On Fri, Feb 27, 2009 at 4:24 AM, Sasha Khapyorsky wrote: > On 07:03 Thu 26 Feb     , Hal Rosenstock wrote: >> On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky wrote: >> >> [snip...] >> >> > And in general I think it is better to use C-style comments - /* ... */, >> > in C code and not C++-style // ... . >> >> Is this going to be enforced uniformly across OpenSM ? > > I didn't think about it (there are no many '//' comments), but I try to > not introduce a new ones. Sent some patches relative to this. I'm not willing to take on ib_types.h right now. Maybe someone else will. -- Hal > > Sasha > From sashak at voltaire.com Fri Feb 27 10:59:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Feb 2009 20:59:05 +0200 Subject: [ofa-general] Re: [PATCH v2] [ib-diag] saquery: add support for WinOF In-Reply-To: <2E1723D4400C48B5B707979E50DCE370@amr.corp.intel.com> References: <20090226101144.GB11192@sashak.voltaire.com> <0F5562867E0B4DBDA634F23F40302E28@amr.corp.intel.com> <20090226210211.GK14238@sashak.voltaire.com> <45768C59A3C0455BBE24FCCC01AEF366@amr.corp.intel.com> <1791B05EBD3245398C0D6B195546FE23@amr.corp.intel.com> <20090227083200.GA7462@sashak.voltaire.com> <2E1723D4400C48B5B707979E50DCE370@amr.corp.intel.com> Message-ID: <20090227185858.GJ7462@sashak.voltaire.com> On 09:51 Fri 27 Feb , Sean Hefty wrote: > >> - modify CHECK_AND_SET_VAL - comparison is done as signed, but assignments > >> are unsigned. This is kind of confusing, but that's how it appears the > >> macro is used. It might be clearer if instead of passing -1 into the > >> macro, that a SET_VAL macro be used instead. > > > >What do you mean? Another macro? > > yes -- instead of passing -1 into CHECK_AND_SET_VAL as the value to compare > against, call a different macro that just sets the value, unless I'm > misunderstanding why -1 is passed in. Then CHECK_AND_SET_VAL would do unsigned > comparisons. > > I can submit a patch for this, but I wasn't completely sure of the intent of > using -1 as the compare value. For some parameters (such as SL) "0" is valid value and could be specified using command line options, so I used -1 as initial value to mark such parameters as non-requested for the query (so its comp_mask is not selected at all). Sasha From neutronsharc at gmail.com Fri Feb 27 11:01:50 2009 From: neutronsharc at gmail.com (neutron) Date: Fri, 27 Feb 2009 14:01:50 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ib_reg_phys_mr( ) results in crash In-Reply-To: <499E6826.704@sun.com> References: <7d5928b30902170650o234f586ax6e27bb82c46427b3@mail.gmail.com> <7d5928b30902191047o25c34462w4cc51d7b88b888c6@mail.gmail.com> <499E6826.704@sun.com> Message-ID: <7d5928b30902271101h589ad61cha59f626572a24802@mail.gmail.com> It might be related to new ConnectX card (with mlx4_ib module). Now I tried the same program on a machine with only "mthca" card, it succeeds without any problems. thanks. I remember one guy in this list also reported a similar issue: ib_phys_reg_mr( ) fails with mlx4 module. On Fri, Feb 20, 2009 at 3:21 AM, Liang Zhen wrote: > Hmm, I didn't see any problem in your code. Have you installed > ofa_kernel_devel (kernel headers of  OFED) after installation of > ofa_kernel_1_3_1? > > Regards > Liang > > neutron: >> >> I'm using Mellanox HCA 'mthca0' type: MT25208, kernel version: >> 2.6.18-53.1.14.el5,  ofed 1.3.1. >> >> The failed function call is like: >> >> { >> >> ctx->send_buf = dma_alloc_coherent(ctx->ib_dev->dma_device, MAX_SIZE, >>                &dma_addr, GFP_KERNEL); >> >> ctx->phy_buf[0].addr = dma_addr; >> ctx->phy_buf[0].size = MAX_SIZE; >> ctx->iovstart = (u64) ctx->send_buf; >> >> printk("pd=%p, phy_buf[0].addr=%p,size=%d, iovstart=%llx\n", >>       ctx->pd, ctx->phy_buf[0].addr, ctx->phy_buf[0].size, ctx->iovstart >> ); >> >> send_mr = ib_reg_phys_mr( ctx->pd, &ctx->phy_buf[0], 1, >>                        IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ >>                         | IB_ACCESS_LOCAL_WRITE, &(ctx->iovstart)); >> } >> >> The phy_buf[0] is a "ib_phys_buf" corresponding to "ctx->send_buf". >> >> Below is /var/log/messages output around the crash. >> ---------------- >> Feb 19 12:50:22 wci30 kernel:  pd=ffff8101da3ddce0, >> phy_buf[0].addr=00000001bbe4b000,size=1024, iovstart=ffff8101bbe4b000 >> >> Feb 19 12:50:22 wci30 kernel: Unable to handle kernel NULL pointer >> dereference at 0000000000000000 >>  RIP: >> Feb 19 12:50:22 wci30 kernel:  [<0000000000000000>] >> _stext+0x7ffff000/0x1000 >> Feb 19 12:50:22 wci30 kernel: PGD 1c06d5067 PUD 1c9dcd067 PMD 0 >> Feb 19 12:50:22 wci30 kernel: Oops: 0010 [1] SMP >> Feb 19 12:50:22 wci30 kernel: last sysfs file: /module/libata/version >> Feb 19 12:50:22 wci30 kernel: CPU 0 >> Feb 19 12:54:05 wci30 syslogd 1.4.1: restart. >> Feb 19 12:54:05 wci30 kernel: klogd 1.4.1, log source = /proc/kmsg >> started. >> Feb 19 12:54:05 wci30 kernel: Linux version 2.6.18-53.1.14.el5 >> (brewbuilder at hs20-bc2-3.build.redha >> t.com) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Tue Feb >> 19 07:18:46 EST 2008 >> Feb 19 12:54:05 wci30 kernel: Command line: ro root=LABEL=/ rhgb quiet >> >> ==================== >> It's strange that the kernel doesn't print out the function call stack >> before crashing. >> >> Any hints?  Thanks a lot! >> >> On Wed, Feb 18, 2009 at 7:40 PM, Roland Dreier wrote: >> >>> >>>  > Before calling ib_reg_phys_mr,  printk() shows that all its arguments >>>  > are valid.  But the system always crashes immediately after entering >>>  > the function ib_reg_phys_mr( ).    Any possible reasons ?  Thanks!! >>> >>> What do you mean by "immediately after entering ib_reg_phys_mr()"?  Do >>> you get an oops message?  If so that would be very important info for >>> debugging this. >>> >>> - R. >>> >>> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > From vst at vlnb.net Fri Feb 27 11:49:03 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Fri, 27 Feb 2009 22:49:03 +0300 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A73164.3010109@harr.org> References: <48E386F6.5040502@fusionio.com> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net> <49A4812A.8050202@harr.org> <49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org> <49A6F25F.8060306@vlnb .net> <49A73164.3010109@har r.org> Message-ID: <49A843AF.3010106@vlnb.net> Cameron Harr, on 02/27/2009 03:18 AM wrote: > Vladislav Bolkhovitin wrote: >> Cameron Harr, on 02/26/2009 08:19 PM wrote: >>> Cameron Harr wrote: >>>> Cameron Harr wrote: >>>> I re-compiled and re-ran the tests and numbers are a little better >>>> but performance still seems to have gone down from 673: >>>> Test 1:373751.66 >>>> Test 2:371242.6067 >>>> Test 3:347988.1467 >>>> Test 4:378247.31 >>>> Test 5:375616.53 >>> I was curious and did a regression test with 673 and those numbers >>> are now even worse, so I'll presume there is an issue on my system >>> and not the SCST code: >>> Test 1:365204.3067 >>> Test 2:364152.2067 >>> Test 3:340665.7633 >>> Test 4:369916.8133 >>> Test 5:369093.5833 >> It's known that any OS, including Linux, is getting "tired" under load >> with time from boot, which leads to worse performance. I guess, you >> can experience such effect. >> >> Check with r634. R635 has cache locality in data structures related >> change, which intended to improve performance a bit, but might make it >> worse instead. >> > > This is with 634. It's pretty bad: > 338316.44 > 329698.04 > 307972.7133 > 345682.4733 > 344165.08 And 633 is better? Definitely, you suffer from the system "tiring" effect. So, to get comparable results you should do measurements in a predefined state of the system, for instance just after boot, and in a row, i.e. one immediately after one. Vlad From cameron at harr.org Fri Feb 27 11:56:31 2009 From: cameron at harr.org (Cameron Harr) Date: Fri, 27 Feb 2009 12:56:31 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49A843AF.3010106@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> <4980B8DE.3060806@harr.org> <4995D1EE.4000807@vlnb.net> <49A42BE9.4030603@har r.org> <49A43439.7080405@vlnb.net> <49A4812A.8050202@harr.org> <49A57256.2000005@harr.o rg> <49A6CF2B.4010002@harr. org> <49A6F25F.8060306@vlnb .net> <49A73164.3010109@har r.org> <49A843AF.3010106@vl nb.net> Message-ID: <49A8456F.6000908@harr.org> Vladislav Bolkhovitin wrote: >> >> This is with 634. It's pretty bad: >> 338316.44 >> 329698.04 >> 307972.7133 >> 345682.4733 >> 344165.08 > > Definitely, you suffer from the system "tiring" effect. So, to get > comparable results you should do measurements in a predefined state of > the system, for instance just after boot, and in a row, i.e. one > immediately after one. I think you're right, that the system is getting "tired." I think I am going to rest with the benchmarking for now though and just stick with the latest code in trunk, noting that my "Test 4" reliably produces the best results. Cameron From Ted.Kim at Sun.COM Fri Feb 27 11:59:18 2009 From: Ted.Kim at Sun.COM (Ted H. Kim) Date: Fri, 27 Feb 2009 11:59:18 -0800 Subject: [ofa-general] 4K MTU for ISR-9024D-M? Message-ID: <49A84616.1000606@sun.com> Folks, Anyone know off hand if 4K MTU firmware/setting is available for an ISR-9024D-M? Please reply to me directly. Just trying to save time, before trying to navigate customer service. -ted > Subject: RE: [ofa-general] Configuring a 4 KB InfniBand link MTU > From: Boris Shpolyansky > Date:Fri, 09 Jan 2009 09:49:35 -0800 > To: James Lentini , Hal Rosenstock > CC: general at lists.openfabrics.org > > James, > > Mellanox InfiniScale III switch chip does support 4K MTU as stated in > the product brief. However it requires special FW settings that might > or might not be available from/supported by particular switch system > vendor. > > Boris Shpolyansky > Sr. Member of Technical Staff, Applications > > Mellanox Technologies Inc. > 350 Oakmead Parkway > Sunnyvale, CA 94085 > Tel.: (408) 916 0014 > Fax: (408) 970 3403 > Cell: (408) 834 9365 > www.mellanox.com -- Ted H. Kim Sun Microsystems, Inc. ted.kim at sun.com 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 El Segundo, CA 90245 (310) 341-1120 FAX From jgunthorpe at obsidianresearch.com Fri Feb 27 13:10:33 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 27 Feb 2009 14:10:33 -0700 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/console: Enhance perfmgr print_counters for better nodenames In-Reply-To: References: <20090219130653.GA29318@comcast.net> <20090226061551.GQ11192@sashak.voltaire.com> <20090227092405.GD7462@sashak.voltaire.com> Message-ID: <20090227211033.GC16941@obsidianresearch.com> On Fri, Feb 27, 2009 at 01:53:19PM -0500, Hal Rosenstock wrote: > On Fri, Feb 27, 2009 at 4:24 AM, Sasha Khapyorsky wrote: > > On 07:03 Thu 26 Feb ?? ?? , Hal Rosenstock wrote: > >> On Thu, Feb 26, 2009 at 1:15 AM, Sasha Khapyorsky wrote: > >> > >> [snip...] > >> > >> > And in general I think it is better to use C-style comments - /* ... */, > >> > in C code and not C++-style // ... . > >> > >> Is this going to be enforced uniformly across OpenSM ? > > > > I didn't think about it (there are no many '//' comments), but I try to > > not introduce a new ones. > > Sent some patches relative to this. I'm not willing to take on > ib_types.h right now. Maybe someone else will. Is this really worth doing? // is included in C99 and many other C99isms are already used in the source (well, until Sean removes them.. :) Jason From hnrose at comcast.net Fri Feb 27 13:18:51 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 27 Feb 2009 16:18:51 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_perfmgr.c: Improve assert in osm_pc_rcv_process Message-ID: <20090227211851.GA25061@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 6d325cb..a74d35e 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -1106,6 +1106,9 @@ static void osm_pc_rcv_process(void *context, void *data) "Processing received MAD status 0x%x context 0x%" PRIx64 " port %u\n", p_mad->status, node_guid, port); + CL_ASSERT(p_mad->attr_id == IB_MAD_ATTR_PORT_CNTRS || + p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO); + /* Response could also be redirection (IBM eHCA PMA does this) */ if (p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) { char gid_str[INET6_ADDRSTRLEN]; @@ -1165,8 +1168,6 @@ static void osm_pc_rcv_process(void *context, void *data) goto Exit; } - CL_ASSERT(p_mad->attr_id == IB_MAD_ATTR_PORT_CNTRS); - perfmgr_db_fill_err_read(wire_read, &err_reading); /* FIXME separate query for extended counters if they are supported * on the port. From ralph.campbell at qlogic.com Fri Feb 27 13:38:11 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 27 Feb 2009 13:38:11 -0800 Subject: [ofa-general] [PATCH] IB/core: initialize mad_agent_priv before putting on lists Message-ID: <1235770691.3948.229.camel@chromite.mv.qlogic.com> There is a potential race in ib_register_mad_agent() where the struct ib_mad_agent_private is not fully initialized before it is added to the list of agents per IB port. This means the ib_mad_agent_private could be seen before the refcount, spin locks, and linked lists are initialized. The fix is to initialize the structure earlier. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 735ad4e..dbcd285 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -301,6 +301,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, mad_agent_priv->agent.context = context; mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; mad_agent_priv->agent.port_num = port_num; + spin_lock_init(&mad_agent_priv->lock); + INIT_LIST_HEAD(&mad_agent_priv->send_list); + INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_LIST_HEAD(&mad_agent_priv->done_list); + INIT_LIST_HEAD(&mad_agent_priv->rmpp_list); + INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends); + INIT_LIST_HEAD(&mad_agent_priv->local_list); + INIT_WORK(&mad_agent_priv->local_work, local_completions); + atomic_set(&mad_agent_priv->refcount, 1); + init_completion(&mad_agent_priv->comp); spin_lock_irqsave(&port_priv->reg_lock, flags); mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; @@ -350,17 +360,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); spin_unlock_irqrestore(&port_priv->reg_lock, flags); - spin_lock_init(&mad_agent_priv->lock); - INIT_LIST_HEAD(&mad_agent_priv->send_list); - INIT_LIST_HEAD(&mad_agent_priv->wait_list); - INIT_LIST_HEAD(&mad_agent_priv->done_list); - INIT_LIST_HEAD(&mad_agent_priv->rmpp_list); - INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends); - INIT_LIST_HEAD(&mad_agent_priv->local_list); - INIT_WORK(&mad_agent_priv->local_work, local_completions); - atomic_set(&mad_agent_priv->refcount, 1); - init_completion(&mad_agent_priv->comp); - return &mad_agent_priv->agent; error4: From ralph.campbell at qlogic.com Fri Feb 27 13:45:57 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 27 Feb 2009 13:45:57 -0800 Subject: [ofa-general] [PATCH] IB/core: ib_post_send_mad() returns zero but doesn't generate send completion Message-ID: <1235771157.3948.233.camel@chromite.mv.qlogic.com> If ib_post_send_mad() returns zero, it guarantees that there will be a callback to the send_buf->mad_agent->send_handler() so that the sender can call ib_free_send_mad(). Otherwise, the ib_mad_send_buf will be leaked and the mad_agent reference count will never go to zero and the IB device module cannot be unloaded. The above can happen without this patch if process_mad() returns (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED). If process_mad() returns IB_MAD_RESULT_SUCCESS and there is no agent registered to receive the mad being sent, handle_outgoing_dr_smp() returns zero which causes a MAD packet which is at the end of the directed route to be incorrectly sent on the wire but doesn't cause a hang since the HCA generates a send completion. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index dbcd285..62a99dc 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -742,9 +742,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, break; case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: kmem_cache_free(ib_mad_cache, mad_priv); - kfree(local); - ret = 1; - goto out; + break; case IB_MAD_RESULT_SUCCESS: /* Treat like an incoming receive MAD */ port_priv = ib_get_mad_port(mad_agent_priv->agent.device, @@ -755,10 +753,12 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, &mad_priv->mad.mad); } if (!port_priv || !recv_mad_agent) { + /* + * No receiving agent so drop packet and + * generate send completion. + */ kmem_cache_free(ib_mad_cache, mad_priv); - kfree(local); - ret = 0; - goto out; + break; } local->mad_priv = mad_priv; local->recv_mad_agent = recv_mad_agent; From sean.hefty at intel.com Fri Feb 27 14:23:11 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 27 Feb 2009 14:23:11 -0800 Subject: [ofa-general] [PATCH] IB/core: initialize mad_agent_priv before putting on lists In-Reply-To: <1235770691.3948.229.camel@chromite.mv.qlogic.com> References: <1235770691.3948.229.camel@chromite.mv.qlogic.com> Message-ID: <41FB31A236F64B6685D188CF3CCDF995@amr.corp.intel.com> >There is a potential race in ib_register_mad_agent() where the struct >ib_mad_agent_private is not fully initialized before it is added >to the list of agents per IB port. This means the ib_mad_agent_private >could be seen before the refcount, spin locks, and linked lists >are initialized. The fix is to initialize the structure earlier. This looks correct and needed to me. From rdreier at cisco.com Fri Feb 27 14:44:45 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 27 Feb 2009 14:44:45 -0800 Subject: [ofa-general] [PATCH] IB/core: initialize mad_agent_priv before putting on lists In-Reply-To: <1235770691.3948.229.camel@chromite.mv.qlogic.com> (Ralph Campbell's message of "Fri, 27 Feb 2009 13:38:11 -0800") References: <1235770691.3948.229.camel@chromite.mv.qlogic.com> Message-ID: thanks, applied From sumeet.lahorani at oracle.com Fri Feb 27 15:36:14 2009 From: sumeet.lahorani at oracle.com (Sumeet Lahorani) Date: Fri, 27 Feb 2009 15:36:14 -0800 Subject: [ofa-general] Measuring SDP throughput Message-ID: <49A878EE.70908@oracle.com> Hi, Is there a tool to observe the SDP throughput while a workload is in progress? I'm not looking for a tool such as qperf which generates it's own workload. We have voltaire switches which give us overall throughput numbers through the PortCounters.csv file but these are not limited to just the SDP traffic. - Sumeet From andy.grover at gmail.com Fri Feb 27 17:53:19 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Fri, 27 Feb 2009 17:53:19 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <87myc73izx.fsf@basil.nowhere.org> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <87myc73izx.fsf@basil.nowhere.org> Message-ID: On Fri, Feb 27, 2009 at 9:08 AM, Andi Kleen wrote: >> This patchset against net-next adds support for RDS sockets. RDS is an >> Oracle-originated protocol used to send IPC datagrams (up to 1MB) >> reliably, and is used currently in Oracle RAC and Exadata products. > > Perhaps I missed it earlier, but what is the rationale for putting > this as a socket type into the kernel? I assume they also work > directly as implemented in user space using raw sockets or similar, > don't they? You want me to implement my fancy protocol in userspace??? Do I even get to write it in C or do I need to use Ruby? Regards -- Andy From andi at firstfloor.org Fri Feb 27 21:56:08 2009 From: andi at firstfloor.org (Andi Kleen) Date: Sat, 28 Feb 2009 06:56:08 +0100 Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <87myc73izx.fsf@basil.nowhere.org> Message-ID: <20090228055608.GB26292@one.firstfloor.org> On Fri, Feb 27, 2009 at 05:53:19PM -0800, Andrew Grover wrote: > On Fri, Feb 27, 2009 at 9:08 AM, Andi Kleen wrote: > >> This patchset against net-next adds support for RDS sockets. RDS is an > >> Oracle-originated protocol used to send IPC datagrams (up to 1MB) > >> reliably, and is used currently in Oracle RAC and Exadata products. > > > > Perhaps I missed it earlier, but what is the rationale for putting > > this as a socket type into the kernel? I assume they also work > > directly as implemented in user space using raw sockets or similar, > > don't they? > > You want me to implement my fancy protocol in userspace??? I just asked why you're putting it in kernel space. > Do I even get to write it in C or do I need to use Ruby? Well normally people who add new subsystems to the kernel explain why they do that. Perhaps it's obvious to you, but at least to me it isn't. -Andi -- ak at linux.intel.com -- Speaking for myself only. From vlad at lists.openfabrics.org Sat Feb 28 03:17:10 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 28 Feb 2009 03:17:10 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090228-0200 daily build status Message-ID: <20090228111711.2FFCDE60FB5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From sashak at voltaire.com Sat Feb 28 09:13:44 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 19:13:44 +0200 Subject: [ofa-general] [ANNOUNCE] management tarballs release Message-ID: <20090228171344.GK7462@sashak.voltaire.com> Hi, There is a new release of the management (OpenSM and infiniband diagnostics) tarballs available in: http://www.openfabrics.org/downloads/management/ md5sum: 97b2609f5eaaf4320b39f44a50500b70 libibumad-1.3.1.tar.gz e60b1c787d7cd2768967ca4766238210 libibmad-1.3.1.tar.gz 8c8c153f21d9f6cee51fc3d501c54fe7 opensm-3.3.1.tar.gz 6b6c87ed01291a2a3322b0ff696c5a11 infiniband-diags-1.5.1.tar.gz All component versions are from recent master branch. Full change log is below. Sasha Arlin Davis (3): libibmad: add os dependent definitions. libibmad: remove c99 definitions within the ib_mad_f structure libibmad: minor changes to source to allow portability to WinOF. David McMillen (1): infiniband-diags/src/ibnetdiscover.c missing LID information on --ports Eli Dorfman (1): opensm/osm_inform.c report IB traps to plugin Eli Dorfman (Voltaire) (10): opensm/osm_subnet.c Fix memory leak for QOS string parameters. libibmad add PortXmitWait and CounterSelect2 to fields. opensm: Add new partition keyword for all hca, switches and routers docs update documenatation about new partition keywords infiniband-diags support PortXmitWait get and set opensm/osm_log.c save log_max_size in subnet opt in MB opensm/osm_subnet.c support subnet configuration rescan and update libibmad/src/dump.c fix dump functions for big endian machines opensm/osm_subnet.c enable log_max_size opt update opensm/osm_subnet.c fix parse functions for big endian machines Hal Rosenstock (23): opensm/libvendor/osm_vendor_sa_api.h: Fix commentary typo opensm/osm_inform.c: Eliminate compile warning opensm/osm_perfmgr_db.h: Remove unused typedef opensm/osm_perfmgr.c: In osm_perfmgr_init, eliminate memory leak on error libibmad/(mad.h fields.c): Add support for PerfMgt ClassPortInfo opensm/include/iba/ib_types.h: Add xmit_wait for PortCounters opensm/PerfMgr: Mainly cosmetic changes opensm/osm_node.h: Fix osm_node_get_num_physp description opensm/PerfMgr: Primarily fix enhanced switch port 0 perf manager operation opensm/doc/perf-manager-arch.txt: Fix some commentary typos opensm/PerfMgr: Add copyrights libibmad: lid print format changed to unsigned libibumad/umad.c: Change lid print format to unsigned infiniband-diags/perfquery: Change option name for extended counters opensm/osm_inform.c: Fix sense of zero GID compare in __match_inf_rec management/libibmad.txt: Remove madrpc_lock/unlock opensm/man/opensm.8.in: Indicate ROUTER_EXP obsoleted opensm/osm_console.c: Improve perfmgr print_counters error message infiniband-diags/smpdump.c: Fix usage examples infiniband-diags/smpdump.c: Release umad resources on exit opensm/console: Enhance perfmgr print_counters for better nodenames libibmad/fields.c: Dump LIDs as unsigned decimal infiniband-diags/saquery.c: Convert more LID prints to unsigned decimal Ira Weiny (3): opensm/opensm/osm_console.c: move reporting of plugins to "status" command. OpenSM: update osmeventplugin example for the new TRAP event. libibmad: Use enum types for function parameters Mike Heinz (1): opensm/osm_vendor_*_sa: fix incompatibility with QLogic SM Nicolas Morey Chaisemartin (4): Corrected incoherency in __osm_ftree_fabric_route_to_non_cns comments opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation opensm/osm_ucast_ftree.c Fixed bad init value for down port index opensm/osm_console.c : Added dump_portguid function to console to generate a list of port guids matching one or more regexps Ralph Campbell (2): libibumad: get_ca() can call release_ca() with uninitialized data opensm: fix structure definition for trap 257-258 Robert Pearson (10): mesh analysis - skeleton mesh analysis - mesh_t data structure mesh analysis - node and link structures mesh analysis - matrix/polynomial routines mesh analysis - local geometry mesh analysis - mesh info table mesh analysis - induce global geometry mesh analysis - reorder links mesh analysis - lash preparation mesh analysis - integrate into lash core Sasha Khapyorsky (111): opensm: remove some unused variables and funcs opensm/osm_ucast_mgr: indentation fix infiniband-diags/saquery: indentation fixes infiniband-diabs/saquery: unify SA queries processors infiniband-diags/saquery: separate queries and commands infiniband-diags/saquery: PortInfoRecord query infinabd-diags: convert type uint -> unsigned int opensm: remove unused header osm_pkey_mgr.h opensm/osm_sm.c: fix MC group creation in race condition opensm/osm_sa_mcmember_record: improve __cleanup_mgrp() opensm/multicast: remove some unused parameters. opensm/osm_subnet: consolidate some duplicated code opensm/event_plugin: link opensm with -rdynamic flag opensm/vendor: save some stack memory infiniband-diags/saquery: minor indentation fixes opensm/osm_subnet.c: indentation fixes opensm/man/opensm.8.in: add descrition for --do_mesh_analysis option opensm: add do_mesh_analysis configuration parameter opensm/mesh: fix memory leaks opensm/lash: fix memory leaks infiniband-diags/ibstat,smpdump: kill unused includes opensm/osm_mesh: make mesh_info static and const opensm/osm_mesh: simplify mesh node links and ports allocation opensm/lash: simplify some memory allocations opensm/opensm.spec: fix event plugin config options libibmad: remove hidden _set/_get_field*() API management: move sysfs()_* function to libibumad opensm: remove libibcommon build dependencies management: remove libibcommon dependencies libibmad: remove not needed header files inclusion libibmad: remove functions which use pthread infiniband-diags/perfquery: indentation fixes opensm: update LFTs when entering master opensm/osm_subnet.c: drop some unneeded braces opensm: invalidate routing cache when entering master state opensm/osm_subnet.c: fix warnings in subn_free_qos_options() infiniband-diags/perfquery.c: fix typo libibmad: cleanup mad.h include path libibmad: indentation fixes libibmad/fields.c: fix MAD MKey offset libibmad: use mad_set_field64() for mkey encoding infiniband-diags/Makefile.am: kill -rpath infiniband-diags/Makefile.am: merge CFLAGS infiniband-diags/Makefile.am: use common library infiniband-diags/ibdiag_common: cosmetic infiniband-diags/ibdiag_common: move get_build_version() infiniband-diags: remove duplicated ibdebug prototype infiniband-diags/smpdump.c: use common ib definitions infiniband-diags/ibdiag_common: cleanup argv0 prototype infiniband-diags/dump_lfts.sh: fix -D format parsing infiniband-diags/dump_mfts.sh: fix -D format parsing libibcommon: remove from the management tree infiniband-diags: command line option processing framework infiniband-diags: using common command line option processing infiniband-diags: remove argv0 global variable infiniband-diags: make get_build_version() static infiniband-diags: remove unneeded includes infiniband-diags/smpquery: usage improvement infiniband-diags/saquery: add lid parameter to NodeRecord query infiniband-diags/ibsysstat: use RMPP for client/server communication infiniband-diags/ibsysstat: backward compatibility fixes infiniband-diags/saquery: fix backward compatibility bug infiniband-diags/smpdump: fix SL value encoding infiniband-diags/saquery: fix encoding of SA queries infiniband-diags/saquery: cosmetic infiniband-diags/saquery: CHECK_AND_SET_VAL() macro infiniband-diags/saquery: adding query params infiniband-diags/saquery: more params for Path and MCMember Records infiniband-diags/saquery: merge PathRecord query functions opensm/osm_subnet.c: fix compile warnings opensm: fix port chooser opensm/main.c: indentation fixes in get_port_guid() opensm/osm_sw_info_rcv.c: cosmetic changes opensm/osm_perfmgr.c: kill some redundant tests infiniband-diags/common: use enum MAD_DEST as ibd_dest_type type opensm: rescan config file even in standby opensm/ib_types.h: cosmetic opensm/osm_subnet.c: indentation fixes opensm/osm_subnet.c: clean_val() remove trailing quotation opensm/osm_subnet.c: break matching when config parameter already found opensm/osm_ucast_ftree.c: cosmetic improvements opensm: avoid memory leaks on config parameters reloading opensm/qos_config: no invalid option message on default values opensm: sort port order for routing by switch loads opensm/ftree: cleanup ftree_sw_tbl_element_t use opensm/ftree: simplify root guids setup. opensm/ftree: make unsigned sw->down_port_groups_idx opensm/osm_helper.c: print port number as decimal libibmad/mad.h: define more SA attributed libibmad/fields.c: define SA SM_Key field details infiniband-diags/saquery: remove osm vendor layer infiniband-diags/saquery: fix types and some cleanup infiniband-diags: some code consolidation infiniabnd-diags/common: wrap debug macros with do {} while (0) opensm/console: dump_portguid command fixes opensm/console: dump_portguid - don't duplicate matched guids opensm/console/dump_portguid: minor improvements opensm: pre-scan command line for config file option opensm/osm_subnet.c: move parse and setup functions opensm: proper config file rescan opensm/osm_subnet: fix crash in qos string config parameters reloading opensm/main.c: remove enable_stack_dump() call opensm/osm_qos.c: cosmetic: remove empty line opensm/Makefile.am: remove osm_build_id.h junk file generation opensm/lid_mgr: fix duplicated lid assignment opensm/lid_mgr: simplify lmc_mask initialization opensm/sweep: add log message before lid assignment opensm/osm_lid_mgr.c: consolidate flows infiniband-diags/ibroute: fix warning opensm: OpenSM Release Notes for 3.3 management: bump all package versions Sean Hefty (17): sminfo: add support for WinOF vendstat: add support for WinOF ibaddr: add support for WinOF perfquery: add support for WinOF ibportstate: add support for WinOF ibstat: add support for WinOF smpdump: add support for WinOF ibping: add support for WinOF smpquery: add support for WinOF [ib-diag] ibnetdiscover: add support for WinOF [ib-diag] ibroute: add support for WinOF [ib-diag] ibtracert: add support for WinOF [ib-diag] ibsendtrap: add support for WinOF [ib-diag] mcm_rereg_test: add support for WinOF [ib-diag] ibsysstat: add support for WinOF [ib-diags] saquery: set correct pkey table field [ib-diag] saquery: add support for WinOF Stan Smith (1): libibmad: add MAD_EXPORT to exported calls Yevgeny Kliteynik (6): opensm/osm_ucast_ftree.c: fixing errors in comments opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 is active opensm/osm_ucast_ftree.c: fix full topology dump opensm/osm_sa.c: fixing SA MAD dump opensm/osm_state_mgr.c: small bug in scanning lid table opensm/osm_node_info_rcv.c: create physp for the newly discovered port of the known node hnrose at comcast.net (5): opensm/osm_ucast_mgr.c: Add error numbers for some OSM_LOG prin opensm/osm_helper.c: Add port counters to __osm_disp_msg_str opensm/osm_console.c: Add list of SMs to status command opensm/osm_console.c: Eliminate some extraneous parentheses opensm/osm_console.c: Add missing command in help_perfmgr From sashak at voltaire.com Sat Feb 28 09:31:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 19:31:40 +0200 Subject: [ofa-general] [PATCH] opensm/osm_console.c: kill warning: defined but not used Message-ID: <20090228173140.GM7462@sashak.voltaire.com> Kill compile warning: osm_console.c:82: warning: 'name_token' defined but not used Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_console.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index e1936fb..63c5ea8 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -78,10 +78,12 @@ static char *next_token(char **p_last) return strtok_r(NULL, " \t\n\r", p_last); } +#ifdef ENABLE_OSM_PERF_MGR static char *name_token(char **p_last) { return strtok_r(NULL, "\t\n\r", p_last); } +#endif static void help_command(FILE * out, int detail) { -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 28 09:32:47 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 19:32:47 +0200 Subject: [ofa-general] [PATCH] opensm/osm_lid_mgr: use single array for used_lids Message-ID: <20090228173247.GN7462@sashak.voltaire.com> Use single array (instead of ptr vector) for used_lids. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_lid_mgr.h | 4 +- opensm/opensm/osm_lid_mgr.c | 60 +++++++++-------------------------- 2 files changed, 17 insertions(+), 47 deletions(-) diff --git a/opensm/include/opensm/osm_lid_mgr.h b/opensm/include/opensm/osm_lid_mgr.h index 714ba41..d6d1ab8 100644 --- a/opensm/include/opensm/osm_lid_mgr.h +++ b/opensm/include/opensm/osm_lid_mgr.h @@ -98,8 +98,8 @@ typedef struct osm_lid_mgr { cl_plock_t *p_lock; boolean_t send_set_reqs; osm_db_domain_t *p_g2l; - cl_ptr_vector_t used_lids; cl_qlist_t free_ranges; + uint8_t used_lids[IB_LID_UCAST_END_HO + 1]; } osm_lid_mgr_t; /* * FIELDS @@ -125,7 +125,7 @@ typedef struct osm_lid_mgr { * Pointer to the database domain storing guid to lid mapping. * * used_lids -* A vector the maps from the lid to its guid. keeps track of +* An array of used lids. keeps track of * existing and non existing mapping of guid->lid * * free_ranges diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index 63c3bb9..e527337 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -109,7 +109,6 @@ typedef struct osm_lid_mgr_range { void osm_lid_mgr_construct(IN osm_lid_mgr_t * const p_mgr) { memset(p_mgr, 0, sizeof(*p_mgr)); - cl_ptr_vector_construct(&p_mgr->used_lids); } /********************************************************************** @@ -120,7 +119,6 @@ void osm_lid_mgr_destroy(IN osm_lid_mgr_t * const p_mgr) OSM_LOG_ENTER(p_mgr->p_log); - cl_ptr_vector_destroy(&p_mgr->used_lids); p_item = cl_qlist_remove_head(&p_mgr->free_ranges); while (p_item != cl_qlist_end(&p_mgr->free_ranges)) { free((osm_lid_mgr_range_t *) p_item); @@ -188,11 +186,7 @@ static void __osm_lid_mgr_validate_db(IN osm_lid_mgr_t * p_mgr) } else { /* check if the lids were not previously assigned */ for (lid = min_lid; lid <= max_lid; lid++) { - if ((cl_ptr_vector_get_size - (&p_mgr->used_lids) > lid) - && - (cl_ptr_vector_get - (&p_mgr->used_lids, lid))) { + if (p_mgr->used_lids[lid]) { OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 0314: " "0x%04x for guid:0x%016" @@ -215,8 +209,7 @@ static void __osm_lid_mgr_validate_db(IN osm_lid_mgr_t * p_mgr) } else { /* mark it was visited */ for (lid = min_lid; lid <= max_lid; lid++) - cl_ptr_vector_set(&p_mgr->used_lids, - lid, (void *)1); + p_mgr->used_lids[lid] = 1; } } /* got a lid */ free(p_item); @@ -252,7 +245,6 @@ osm_lid_mgr_init(IN osm_lid_mgr_t * const p_mgr, IN osm_sm_t *sm) goto Exit; } - cl_ptr_vector_init(&p_mgr->used_lids, 100, 40); cl_qlist_init(&p_mgr->free_ranges); /* we use the stored guid to lid table if not forced to reassign */ @@ -303,7 +295,6 @@ static uint16_t __osm_trim_lid(IN uint16_t lid) static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr) { cl_ptr_vector_t *p_discovered_vec = &p_mgr->p_subn->port_lid_tbl; - cl_ptr_vector_t *p_persistent_vec = &p_mgr->used_lids; uint16_t max_defined_lid; uint16_t max_persistent_lid; uint16_t max_discovered_lid; @@ -335,10 +326,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr) OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "Ignore guid2lid file when coming out of standby\n"); osm_db_clear(p_mgr->p_g2l); - for (lid = 0; - lid < cl_ptr_vector_get_size(&p_mgr->used_lids); - lid++) - cl_ptr_vector_set(p_persistent_vec, lid, NULL); + memset(p_mgr->used_lids, 0, sizeof(p_mgr->used_lids)); } else { OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "Honor current guid2lid file when coming out " @@ -413,7 +401,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr) cl_ntoh64 (osm_port_get_guid(p_port))); for (lid = db_min_lid; lid <= db_max_lid; lid++) - cl_ptr_vector_set(p_persistent_vec, lid, NULL); + p_mgr->used_lids[lid] = 0; } } @@ -437,14 +425,11 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr) /* find the range of lids to scan */ max_discovered_lid = (uint16_t) cl_ptr_vector_get_size(p_discovered_vec); - max_persistent_lid = - (uint16_t) cl_ptr_vector_get_size(p_persistent_vec); + max_persistent_lid = sizeof(p_mgr->used_lids) - 1; /* but the vectors have one extra entry for lid=0 */ if (max_discovered_lid) max_discovered_lid--; - if (max_persistent_lid) - max_persistent_lid--; if (max_persistent_lid > max_discovered_lid) max_defined_lid = max_persistent_lid; @@ -454,8 +439,7 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr) for (lid = 1; lid <= max_defined_lid; lid++) { is_free = TRUE; /* first check to see if the lid is used by a persistent assignment */ - if ((lid <= max_persistent_lid) - && cl_ptr_vector_get(p_persistent_vec, lid)) { + if (lid <= max_persistent_lid && p_mgr->used_lids[lid]) { OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "0x%04x is not free as its mapped by the " "persistent db\n", lid); @@ -515,11 +499,9 @@ static int __osm_lid_mgr_init_sweep(IN osm_lid_mgr_t * const p_mgr) for (req_lid = disc_min_lid + 1; req_lid <= disc_max_lid; req_lid++) { - if ((req_lid <= - max_persistent_lid) && - cl_ptr_vector_get - (p_persistent_vec, - req_lid)) { + if (req_lid <= + max_persistent_lid && + p_mgr->used_lids[req_lid]) { OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "0x%04x is free as it was discovered " @@ -604,28 +586,16 @@ __osm_lid_mgr_is_range_not_persistent(IN osm_lid_mgr_t * const p_mgr, IN const uint16_t num_lids) { uint16_t i; - cl_status_t status; - osm_port_t *p_port; const uint8_t start_lid = (uint8_t) (1 << p_mgr->p_subn->opt.lmc); - const cl_ptr_vector_t *const p_tbl = &p_mgr->used_lids; if (lid < start_lid) - return (FALSE); + return FALSE; - for (i = lid; i < lid + num_lids; i++) { - status = cl_ptr_vector_at(p_tbl, i, (void *)&p_port); - if (status == CL_SUCCESS) { - if (p_port != NULL) - return (FALSE); - } else - /* - We are out of range in the array. - Consider all further entries "free". - */ - return (TRUE); - } + for (i = lid; i < lid + num_lids; i++) + if (p_mgr->used_lids[lid]) + return FALSE; - return (TRUE); + return TRUE; } /********************************************************************** @@ -824,7 +794,7 @@ NewLidSet: /* update the guid2lid db and used_lids */ osm_db_guid2lid_set(p_mgr->p_g2l, guid, *p_min_lid, *p_max_lid); for (lid = *p_min_lid; lid <= *p_max_lid; lid++) - cl_ptr_vector_set(&p_mgr->used_lids, lid, (void *)1); + p_mgr->used_lids[lid] = 1; Exit: /* make sure the assigned lids are marked in port_lid_tbl */ -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 28 09:34:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 19:34:00 +0200 Subject: [ofa-general] [PATCH] opensm: initialize all switch ports Message-ID: <20090228173400.GO7462@sashak.voltaire.com> Initialize all switch port when NodeInfo is received. This addresses the issue described in 8a2d2ddee7 where link could leave uninitialized when SwitchInfo and PortInfo receiving races during discovery and also simplify OpenSM discovery process implementation slightly. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_node.h | 5 ++--- opensm/opensm/osm_node.c | 32 ++++++++------------------------ opensm/opensm/osm_node_info_rcv.c | 4 ++-- 3 files changed, 12 insertions(+), 29 deletions(-) diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h index fec24ba..c7befff 100644 --- a/opensm/include/opensm/osm_node.h +++ b/opensm/include/opensm/osm_node.h @@ -443,9 +443,8 @@ osm_node_get_lmc(IN const osm_node_t * const p_node, IN const uint32_t port_num) * * SYNOPSIS */ -void -osm_node_init_physp(IN osm_node_t * const p_node, - IN const osm_madw_t * const p_madw); +void osm_node_init_physp(IN osm_node_t * const p_node, uint8_t port_num, + IN const osm_madw_t * const p_madw); /* * PARAMETERS * p_node diff --git a/opensm/opensm/osm_node.c b/opensm/opensm/osm_node.c index 07371a2..a97477a 100644 --- a/opensm/opensm/osm_node.c +++ b/opensm/opensm/osm_node.c @@ -51,20 +51,17 @@ /********************************************************************** **********************************************************************/ -void -osm_node_init_physp(IN osm_node_t * const p_node, - IN const osm_madw_t * const p_madw) +void osm_node_init_physp(IN osm_node_t * const p_node, uint8_t port_num, + IN const osm_madw_t * const p_madw) { ib_net64_t port_guid; ib_smp_t *p_smp; ib_node_info_t *p_ni; - uint8_t port_num; p_smp = osm_madw_get_smp_ptr(p_madw); p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); port_guid = p_ni->port_guid; - port_num = ib_node_info_get_local_port_num(p_ni); CL_ASSERT(port_num < p_node->physp_tbl_size); @@ -76,23 +73,6 @@ osm_node_init_physp(IN osm_node_t * const p_node, /********************************************************************** **********************************************************************/ -static void node_init_physp0(IN osm_node_t * const p_node, - IN const osm_madw_t * const p_madw) -{ - ib_smp_t *p_smp; - ib_node_info_t *p_ni; - - p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); - - osm_physp_init(&p_node->physp_table[0], - p_ni->port_guid, 0, p_node, - osm_madw_get_bind_handle(p_madw), - p_smp->hop_count, p_smp->initial_path); -} - -/********************************************************************** - **********************************************************************/ osm_node_t *osm_node_new(IN const osm_madw_t * const p_madw) { osm_node_t *p_node; @@ -133,9 +113,13 @@ osm_node_t *osm_node_new(IN const osm_madw_t * const p_madw) for (i = 0; i < p_node->physp_tbl_size; i++) osm_physp_construct(&p_node->physp_table[i]); - osm_node_init_physp(p_node, p_madw); if (p_ni->node_type == IB_NODE_TYPE_SWITCH) - node_init_physp0(p_node, p_madw); + for (i = 0; i <= p_ni->num_ports; i++) + osm_node_init_physp(p_node, i, p_madw); + else + osm_node_init_physp(p_node, + ib_node_info_get_local_port_num(p_ni), + p_madw); p_node->print_desc = strdup(OSM_NODE_DESC_UNKNOWN); return (p_node); diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index a37630a..9de68f9 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -414,7 +414,7 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm, "Creating new port object with GUID 0x%" PRIx64 "\n", cl_ntoh64(p_ni->port_guid)); - osm_node_init_physp(p_node, p_madw); + osm_node_init_physp(p_node, port_num, p_madw); p_port = osm_port_new(p_ni, p_node); if (p_port == NULL) { @@ -545,7 +545,7 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, PRIx64 ", port %u\n", cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); - osm_node_init_physp(p_node, p_madw); + osm_node_init_physp(p_node, port_num, p_madw); } /* -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 28 09:35:09 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 19:35:09 +0200 Subject: [ofa-general] [PATCH] opensm: remove unneeded anymore physp initializations In-Reply-To: <20090228173400.GO7462@sashak.voltaire.com> References: <20090228173400.GO7462@sashak.voltaire.com> Message-ID: <20090228173509.GP7462@sashak.voltaire.com> Removed unneeded anymore physical port initializations - all should be already initialized in osm_node_new(). Also put some debug assertions (CL_ASSERT()). Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_node_info_rcv.c | 28 +++------------------------- opensm/opensm/osm_port_info_rcv.c | 32 +++++++------------------------- 2 files changed, 10 insertions(+), 50 deletions(-) diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index 9de68f9..ac86b9a 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -155,14 +155,9 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, /* When setting the link, ports on both sides of the link should be initialized */ - if (!osm_node_link_has_valid_ports(p_node, port_num, p_neighbor_node, - p_ni_context->port_num)) { - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Link at node 0x%" PRIx64 ", port %u - no valid ports\n", - cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); - CL_ASSERT(0); - goto _exit; - } + CL_ASSERT(osm_node_link_has_valid_ports(p_node, port_num, + p_neighbor_node, + p_ni_context->port_num)); if (osm_node_link_exists(p_node, port_num, p_neighbor_node, p_ni_context->port_num)) { @@ -529,25 +524,8 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, IN osm_node_t * const p_node, IN const osm_madw_t * const p_madw) { - ib_smp_t *p_smp; - ib_node_info_t *p_ni; - uint8_t port_num; - OSM_LOG_ENTER(sm->p_log); - p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); - port_num = ib_node_info_get_local_port_num(p_ni); - - if (!osm_node_get_physp_ptr(p_node, port_num)) { - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Creating physp for node GUID:0x%" - PRIx64 ", port %u\n", - cl_ntoh64(osm_node_get_node_guid(p_node)), - port_num); - osm_node_init_physp(p_node, port_num, p_madw); - } - /* If this switch has already been probed during this sweep, then don't bother reprobing it. diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 95ebdb4..654ede7 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -614,31 +614,13 @@ void osm_pi_rcv_process(IN void *context, IN void *data) p_physp = osm_node_get_physp_ptr(p_node, port_num); - /* - Determine if we encountered a new Physical Port. - If so, initialize the new Physical Port then - continue processing as normal. - */ - if (!p_physp) { - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, - "Initializing port number %u\n", port_num); - p_physp = &p_node->physp_table[port_num]; - osm_physp_init(p_physp, - port_guid, - port_num, - p_node, - osm_madw_get_bind_handle(p_madw), - p_smp->hop_count, p_smp->initial_path); - } else { - /* - Update the directed route path to this port - in case the old path is no longer usable. - */ - p_dr_path = osm_physp_get_dr_path_ptr(p_physp); - osm_dr_path_init(p_dr_path, - osm_madw_get_bind_handle(p_madw), - p_smp->hop_count, p_smp->initial_path); - } + CL_ASSERT(p_physp); + + /* Update the directed route path to this port + in case the old path is no longer usable. */ + p_dr_path = osm_physp_get_dr_path_ptr(p_physp); + osm_dr_path_init(p_dr_path, osm_madw_get_bind_handle(p_madw), + p_smp->hop_count, p_smp->initial_path); /* if port just inited or reached INIT state (external reset) request update for port related tables */ -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 28 09:36:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 19:36:35 +0200 Subject: [ofa-general] [PATCH] opensm: PortInfo requests for discovered switches Message-ID: <20090228173635.GQ7462@sashak.voltaire.com> Request PortInfo for all switch ports right on first NodeInfo receiving and don't wait for SwitchInfo request results. This will simplify a subnet discovery flow and speed it up. Remove switch->discovery_count which is not needed anymore. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_switch.h | 6 --- opensm/opensm/osm_node_info_rcv.c | 83 ++++++++++++++---------------------- opensm/opensm/osm_perfmgr.c | 1 - opensm/opensm/osm_state_mgr.c | 1 - opensm/opensm/osm_sw_info_rcv.c | 71 ------------------------------ 5 files changed, 32 insertions(+), 130 deletions(-) diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 6279727..3e3626b 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -103,7 +103,6 @@ typedef struct osm_switch { uint8_t *lft; uint8_t *new_lft; osm_mcast_tbl_t mcast_tbl; - uint32_t discovery_count; unsigned endport_links; unsigned need_update; void *priv; @@ -145,11 +144,6 @@ typedef struct osm_switch { * mcast_tbl * Multicast forwarding table for this switch. * -* discovery_count -* The number of times this switch has been discovered -* during the current fabric sweep. This number is reset -* to zero at the start of a sweep. -* * need_update * When set indicates that switch was probably reset, so * fwd tables and rest cached data should be flushed diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index ac86b9a..e40fc82 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -244,51 +244,43 @@ _exit: } /********************************************************************** - The plock must be held before calling this function. **********************************************************************/ -static void -__osm_ni_rcv_process_new_node(IN osm_sm_t * sm, - IN osm_node_t * const p_node, - IN const osm_madw_t * const p_madw) +static void ni_rcv_get_port_info(IN osm_sm_t * sm, IN osm_node_t * node, + IN const osm_madw_t * madw) { - ib_api_status_t status = IB_SUCCESS; osm_madw_context_t context; - osm_physp_t *p_physp; - ib_node_info_t *p_ni; - ib_smp_t *p_smp; - uint8_t port_num; + osm_physp_t *physp; + ib_node_info_t *ni; + unsigned port, num_ports; + ib_api_status_t status; - OSM_LOG_ENTER(sm->p_log); + ni = ib_smp_get_payload_ptr(osm_madw_get_smp_ptr(madw)); - p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); - port_num = ib_node_info_get_local_port_num(p_ni); + if (ni->node_type == IB_NODE_TYPE_SWITCH) { + port = 0; + num_ports = osm_node_get_num_physp(node); + } else { + port = ib_node_info_get_local_port_num(ni); + num_ports = port + 1; + } - /* - Request PortInfo & NodeDescription attributes for the port - that responded to the NodeInfo attribute. - Because this is a channel adapter or router, we are - not allowed to request PortInfo for the other ports. - Set the context union properly, so the recipient - knows which node & port are relevant. - */ - p_physp = osm_node_get_physp_ptr(p_node, port_num); + physp = osm_node_get_physp_ptr(node, port); - context.pi_context.node_guid = p_ni->node_guid; - context.pi_context.port_guid = p_ni->port_guid; + context.pi_context.node_guid = osm_node_get_node_guid(node); + context.pi_context.port_guid = osm_physp_get_port_guid(physp); context.pi_context.set_method = FALSE; context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; - status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp), - IB_MAD_ATTR_PORT_INFO, - cl_hton32(port_num), CL_DISP_MSGID_NONE, &context); - if (status != IB_SUCCESS) - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D02: " - "Failure initiating PortInfo request (%s)\n", - ib_get_err_str(status)); - - OSM_LOG_EXIT(sm->p_log); + for (; port < num_ports; port++) { + status = osm_req_get(sm, osm_physp_get_dr_path_ptr(physp), + IB_MAD_ATTR_PORT_INFO, cl_hton32(port), + CL_DISP_MSGID_NONE, &context); + if (status != IB_SUCCESS) + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR OD02: " + "Failure initiating PortInfo request (%s)\n", + ib_get_err_str(status)); + } } /********************************************************************** @@ -359,7 +351,7 @@ __osm_ni_rcv_process_new_ca_or_router(IN osm_sm_t * sm, { OSM_LOG_ENTER(sm->p_log); - __osm_ni_rcv_process_new_node(sm, p_node, p_madw); + ni_rcv_get_port_info(sm, p_node, p_madw); /* A node guid of 0 is the corner case that indicates @@ -384,10 +376,8 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm, ib_smp_t *p_smp; osm_port_t *p_port; osm_port_t *p_port_check; - osm_madw_context_t context; uint8_t port_num; osm_physp_t *p_physp; - ib_api_status_t status; osm_dr_path_t *p_dr_path; osm_bind_handle_t h_bind; @@ -461,19 +451,7 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm, p_smp->initial_path); } - context.pi_context.node_guid = p_ni->node_guid; - context.pi_context.port_guid = p_ni->port_guid; - context.pi_context.set_method = FALSE; - context.pi_context.light_sweep = FALSE; - - status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp), - IB_MAD_ATTR_PORT_INFO, - cl_hton32(port_num), CL_DISP_MSGID_NONE, &context); - - if (status != IB_SUCCESS) - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D13: " - "Failure initiating PortInfo request (%s)\n", - ib_get_err_str(status)); + ni_rcv_get_port_info(sm, p_node, p_madw); Exit: OSM_LOG_EXIT(sm->p_log); @@ -513,6 +491,9 @@ __osm_ni_rcv_process_switch(IN osm_sm_t * sm, "Failure initiating SwitchInfo request (%s)\n", ib_get_err_str(status)); + if (p_node->discovery_count == 1) + ni_rcv_get_port_info(sm, p_node, p_madw); + OSM_LOG_EXIT(sm->p_log); } @@ -536,7 +517,7 @@ __osm_ni_rcv_process_existing_switch(IN osm_sm_t * sm, */ if (p_node->discovery_count == 1) __osm_ni_rcv_process_switch(sm, p_node, p_madw); - else if (!p_node->sw || p_node->sw->discovery_count == 0) { + else if (!p_node->sw) { /* we don't have the SwitchInfo - retry to get it */ OSM_LOG(sm->p_log, OSM_LOG_DEBUG, "Retry to get SwitchInfo on node GUID:0x%" PRIx64 "\n", diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 6d325cb..58b5dc2 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -726,7 +726,6 @@ static void reset_port_count(cl_map_item_t * const p_map_item, void *cxt) static void reset_switch_count(cl_map_item_t * const p_map_item, void *cxt) { osm_switch_t *p_sw = (osm_switch_t *) p_map_item; - p_sw->discovery_count = 0; p_sw->need_update = 0; } diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index a1efd1a..0d7cf15 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -115,7 +115,6 @@ __osm_state_mgr_reset_switch_count(IN cl_map_item_t * const p_map_item, { osm_switch_t *p_sw = (osm_switch_t *) p_map_item; - p_sw->discovery_count = 0; p_sw->need_update = 1; } diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c index 751c6f4..2f2775a 100644 --- a/opensm/opensm/osm_sw_info_rcv.c +++ b/opensm/opensm/osm_sw_info_rcv.c @@ -55,53 +55,6 @@ #include #include -/********************************************************************** - The plock must be held before calling this function. -**********************************************************************/ -static void si_rcv_get_port_info(IN osm_sm_t * sm, IN osm_switch_t * const p_sw) -{ - osm_madw_context_t context; - uint8_t port_num; - osm_physp_t *p_physp; - osm_node_t *p_node; - uint8_t num_ports; - ib_api_status_t status = IB_SUCCESS; - - OSM_LOG_ENTER(sm->p_log); - - CL_ASSERT(p_sw); - - p_node = p_sw->p_node; - - CL_ASSERT(osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH); - - /* - Request PortInfo attribute for each port on the switch. - */ - p_physp = osm_node_get_physp_ptr(p_node, 0); - - context.pi_context.node_guid = osm_node_get_node_guid(p_node); - context.pi_context.port_guid = osm_physp_get_port_guid(p_physp); - context.pi_context.set_method = FALSE; - context.pi_context.light_sweep = FALSE; - context.pi_context.active_transition = FALSE; - - num_ports = osm_node_get_num_physp(p_node); - - for (port_num = 0; port_num < num_ports; port_num++) { - status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp), - IB_MAD_ATTR_PORT_INFO, cl_hton32(port_num), - CL_DISP_MSGID_NONE, &context); - if (status != IB_SUCCESS) - /* continue the loop despite the error */ - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 3602: " - "Failure initiating PortInfo request (%s)\n", - ib_get_err_str(status)); - } - - OSM_LOG_EXIT(sm->p_log); -} - #if 0 /********************************************************************** The plock must be held before calling this function. @@ -307,12 +260,6 @@ static void si_rcv_process_new(IN osm_sm_t * sm, IN osm_node_t * const p_node, info we just received. */ osm_switch_set_switch_info(p_sw, p_si); - p_sw->discovery_count++; - - /* - Get the PortInfo attribute for every port. - */ - si_rcv_get_port_info(sm, p_sw); /* Don't bother retrieving the current unicast and multicast tables @@ -392,24 +339,6 @@ static boolean_t si_rcv_process_existing(IN osm_sm_t * sm, OSM_LOG_DEBUG); is_change_detected = TRUE; } - } else { - /* - This is a heavy sweep. Get information regardless - of the state change bit. - */ - p_sw->discovery_count++; - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, - "discovery_count is:%u\n", - p_sw->discovery_count); - - /* If this is the first discovery - then get the port_info */ - if (p_sw->discovery_count == 1) - si_rcv_get_port_info(sm, p_sw); - else - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Not discovering again through switch:0x%" - PRIx64 "\n", - osm_node_get_node_guid(p_sw->p_node)); } } -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 28 09:37:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 19:37:17 +0200 Subject: [ofa-general] [PATCH] opensm: remove casting of ib_smp_get_payload_ptr() Message-ID: <20090228173717.GR7462@sashak.voltaire.com> ib_smp_get_payload_ptr() returns void pointer - casting is not needed. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_lin_fwd_rcv.c | 2 +- opensm/opensm/osm_mcast_fwd_rcv.c | 2 +- opensm/opensm/osm_node.c | 4 ++-- opensm/opensm/osm_node_desc_rcv.c | 2 +- opensm/opensm/osm_node_info_rcv.c | 10 +++++----- opensm/opensm/osm_pkey_rcv.c | 2 +- opensm/opensm/osm_port_info_rcv.c | 4 ++-- opensm/opensm/osm_slvl_map_rcv.c | 2 +- opensm/opensm/osm_sw_info_rcv.c | 6 +++--- opensm/opensm/osm_switch.c | 2 +- opensm/opensm/osm_vl_arb_rcv.c | 2 +- 11 files changed, 19 insertions(+), 19 deletions(-) diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c index c3d8633..2edb8d3 100644 --- a/opensm/opensm/osm_lin_fwd_rcv.c +++ b/opensm/opensm/osm_lin_fwd_rcv.c @@ -70,7 +70,7 @@ void osm_lft_rcv_process(IN void *context, IN void *data) CL_ASSERT(p_madw); p_smp = osm_madw_get_smp_ptr(p_madw); - p_block = (uint8_t *) ib_smp_get_payload_ptr(p_smp); + p_block = ib_smp_get_payload_ptr(p_smp); block_num = cl_ntoh32(p_smp->attr_mod); /* diff --git a/opensm/opensm/osm_mcast_fwd_rcv.c b/opensm/opensm/osm_mcast_fwd_rcv.c index 635c7da..f3d0183 100644 --- a/opensm/opensm/osm_mcast_fwd_rcv.c +++ b/opensm/opensm/osm_mcast_fwd_rcv.c @@ -77,7 +77,7 @@ void osm_mft_rcv_process(IN void *context, IN void *data) CL_ASSERT(p_madw); p_smp = osm_madw_get_smp_ptr(p_madw); - p_block = (uint16_t *) ib_smp_get_payload_ptr(p_smp); + p_block = ib_smp_get_payload_ptr(p_smp); block_num = cl_ntoh32(p_smp->attr_mod) & IB_MCAST_BLOCK_ID_MASK_HO; position = (uint8_t) ((cl_ntoh32(p_smp->attr_mod) & IB_MCAST_POSITION_MASK_HO) >> diff --git a/opensm/opensm/osm_node.c b/opensm/opensm/osm_node.c index a97477a..ee2fbed 100644 --- a/opensm/opensm/osm_node.c +++ b/opensm/opensm/osm_node.c @@ -60,7 +60,7 @@ void osm_node_init_physp(IN osm_node_t * const p_node, uint8_t port_num, p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + p_ni = ib_smp_get_payload_ptr(p_smp); port_guid = p_ni->port_guid; CL_ASSERT(port_num < p_node->physp_tbl_size); @@ -82,7 +82,7 @@ osm_node_t *osm_node_new(IN const osm_madw_t * const p_madw) uint32_t size; p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + p_ni = ib_smp_get_payload_ptr(p_smp); /* The node object already contains one physical port object. diff --git a/opensm/opensm/osm_node_desc_rcv.c b/opensm/opensm/osm_node_desc_rcv.c index f6178b9..a79fa22 100644 --- a/opensm/opensm/osm_node_desc_rcv.c +++ b/opensm/opensm/osm_node_desc_rcv.c @@ -106,7 +106,7 @@ void osm_nd_rcv_process(IN void *context, IN void *data) CL_ASSERT(p_madw); p_smp = osm_madw_get_smp_ptr(p_madw); - p_nd = (ib_node_desc_t *) ib_smp_get_payload_ptr(p_smp); + p_nd = ib_smp_get_payload_ptr(p_smp); /* Acquire the node object and add the node description. diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index e40fc82..f5a5082 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -323,7 +323,7 @@ __osm_ni_rcv_get_node_desc(IN osm_sm_t * sm, OSM_LOG_ENTER(sm->p_log); p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + p_ni = ib_smp_get_payload_ptr(p_smp); port_num = ib_node_info_get_local_port_num(p_ni); /* @@ -384,7 +384,7 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm, OSM_LOG_ENTER(sm->p_log); p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + p_ni = ib_smp_get_payload_ptr(p_smp); port_num = ib_node_info_get_local_port_num(p_ni); h_bind = osm_madw_get_bind_handle(p_madw); @@ -573,7 +573,7 @@ __osm_ni_rcv_process_new(IN osm_sm_t * sm, OSM_LOG_ENTER(sm->p_log); p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + p_ni = ib_smp_get_payload_ptr(p_smp); p_ni_context = osm_madw_get_ni_context_ptr(p_madw); port_num = ib_node_info_get_local_port_num(p_ni); @@ -719,7 +719,7 @@ __osm_ni_rcv_process_existing(IN osm_sm_t * sm, OSM_LOG_ENTER(sm->p_log); p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + p_ni = ib_smp_get_payload_ptr(p_smp); p_ni_context = osm_madw_get_ni_context_ptr(p_madw); port_num = ib_node_info_get_local_port_num(p_ni); @@ -776,7 +776,7 @@ void osm_ni_rcv_process(IN void *context, IN void *data) CL_ASSERT(p_madw); p_smp = osm_madw_get_smp_ptr(p_madw); - p_ni = (ib_node_info_t *) ib_smp_get_payload_ptr(p_smp); + p_ni = ib_smp_get_payload_ptr(p_smp); CL_ASSERT(p_smp->attr_id == IB_MAD_ATTR_NODE_INFO); diff --git a/opensm/opensm/osm_pkey_rcv.c b/opensm/opensm/osm_pkey_rcv.c index 7061941..84d8ce1 100644 --- a/opensm/opensm/osm_pkey_rcv.c +++ b/opensm/opensm/osm_pkey_rcv.c @@ -77,7 +77,7 @@ void osm_pkey_rcv_process(IN void *context, IN void *data) p_smp = osm_madw_get_smp_ptr(p_madw); p_context = osm_madw_get_pkey_context_ptr(p_madw); - p_pkey_tbl = (ib_pkey_table_t *) ib_smp_get_payload_ptr(p_smp); + p_pkey_tbl = ib_smp_get_payload_ptr(p_smp); port_guid = p_context->port_guid; node_guid = p_context->node_guid; diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 654ede7..3e39dff 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -473,7 +473,7 @@ osm_pi_rcv_process_set(IN osm_sm_t * sm, IN osm_node_t * const p_node, port_guid = osm_physp_get_port_guid(p_physp); p_smp = osm_madw_get_smp_ptr(p_madw); - p_pi = (ib_port_info_t *) ib_smp_get_payload_ptr(p_smp); + p_pi = ib_smp_get_payload_ptr(p_smp); /* check for error */ if (cl_ntoh16(p_smp->status) & 0x7fff) { @@ -532,7 +532,7 @@ void osm_pi_rcv_process(IN void *context, IN void *data) p_smp = osm_madw_get_smp_ptr(p_madw); p_context = osm_madw_get_pi_context_ptr(p_madw); - p_pi = (ib_port_info_t *) ib_smp_get_payload_ptr(p_smp); + p_pi = ib_smp_get_payload_ptr(p_smp); CL_ASSERT(p_smp->attr_id == IB_MAD_ATTR_PORT_INFO); diff --git a/opensm/opensm/osm_slvl_map_rcv.c b/opensm/opensm/osm_slvl_map_rcv.c index e177345..b3f0a4c 100644 --- a/opensm/opensm/osm_slvl_map_rcv.c +++ b/opensm/opensm/osm_slvl_map_rcv.c @@ -82,7 +82,7 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data) p_smp = osm_madw_get_smp_ptr(p_madw); p_context = osm_madw_get_slvl_context_ptr(p_madw); - p_slvl_tbl = (ib_slvl_table_t *) ib_smp_get_payload_ptr(p_smp); + p_slvl_tbl = ib_smp_get_payload_ptr(p_smp); port_guid = p_context->port_guid; node_guid = p_context->node_guid; diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c index 2f2775a..14df1fd 100644 --- a/opensm/opensm/osm_sw_info_rcv.c +++ b/opensm/opensm/osm_sw_info_rcv.c @@ -208,7 +208,7 @@ static void si_rcv_process_new(IN osm_sm_t * sm, IN osm_node_t * const p_node, p_sw_guid_tbl = &sm->p_subn->sw_guid_tbl; p_smp = osm_madw_get_smp_ptr(p_madw); - p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp); + p_si = ib_smp_get_payload_ptr(p_smp); osm_dump_switch_info(sm->p_log, p_si, OSM_LOG_DEBUG); @@ -302,7 +302,7 @@ static boolean_t si_rcv_process_existing(IN osm_sm_t * sm, CL_ASSERT(p_madw); p_smp = osm_madw_get_smp_ptr(p_madw); - p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp); + p_si = ib_smp_get_payload_ptr(p_smp); p_si_context = osm_madw_get_si_context_ptr(p_madw); if (p_si_context->set_method) { @@ -365,7 +365,7 @@ void osm_si_rcv_process(IN void *context, IN void *data) CL_ASSERT(p_madw); p_smp = osm_madw_get_smp_ptr(p_madw); - p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp); + p_si = ib_smp_get_payload_ptr(p_smp); p_context = osm_madw_get_si_context_ptr(p_madw); node_guid = p_context->node_guid; diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 9807791..6dde47c 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -87,7 +87,7 @@ osm_switch_init(IN osm_switch_t * const p_sw, uint32_t port_num; p_smp = osm_madw_get_smp_ptr(p_madw); - p_si = (ib_switch_info_t *) ib_smp_get_payload_ptr(p_smp); + p_si = ib_smp_get_payload_ptr(p_smp); num_ports = osm_node_get_num_physp(p_node); CL_ASSERT(p_smp->attr_id == IB_MAD_ATTR_SWITCH_INFO); diff --git a/opensm/opensm/osm_vl_arb_rcv.c b/opensm/opensm/osm_vl_arb_rcv.c index ec04d67..89cf7b2 100644 --- a/opensm/opensm/osm_vl_arb_rcv.c +++ b/opensm/opensm/osm_vl_arb_rcv.c @@ -83,7 +83,7 @@ void osm_vla_rcv_process(IN void *context, IN void *data) p_smp = osm_madw_get_smp_ptr(p_madw); p_context = osm_madw_get_vla_context_ptr(p_madw); - p_vla_tbl = (ib_vl_arb_table_t *) ib_smp_get_payload_ptr(p_smp); + p_vla_tbl = ib_smp_get_payload_ptr(p_smp); port_guid = p_context->port_guid; node_guid = p_context->node_guid; -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Feb 28 11:19:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 21:19:21 +0200 Subject: [ofa-general] Re: [PATCH 1/3 v2] opensm: Added io_guid_file and max_reverse_hops options In-Reply-To: <49953C48.3030203@ext.bull.net> References: <49953C48.3030203@ext.bull.net> Message-ID: <20090228191921.GA3936@sashak.voltaire.com> On 10:24 Fri 13 Feb , Nicolas Morey Chaisemartin wrote: > Signed-off-by: Nicolas Morey-Chaisemartin > Applied (I will push it to main stream tomorrow). Thanks. All your patches are whitespace mangled - non-diff lines are started from two spaces. I fixed it with "sed -e 's/^ / /'", but please check your mailer. Also small note below. > --- > Reposted as io_guid_file and max_reverse_hops were missing from the opt_tbl > and wouldn't be read from the cached option file. > > opensm/include/opensm/osm_subnet.h | 6 ++++++ > opensm/opensm/main.c | 26 +++++++++++++++++++++++++- > opensm/opensm/osm_subnet.c | 14 ++++++++++++++ > 3 files changed, 45 insertions(+), 1 deletions(-) > > diff --git a/opensm/include/opensm/osm_subnet.h > b/opensm/include/opensm/osm_subnet.h > index 8863e47..671b51f 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -190,6 +190,8 @@ typedef struct osm_subn_opt { > char *lfts_file; > char *root_guid_file; > char *cn_guid_file; > + char *io_guid_file; > + uint16_t max_reverse_hops; Why should max_reverse_hops be 16 bits long? In IB max hops value is 64. (and of course - use tab as indentation character, next time). Sasha From sashak at voltaire.com Sat Feb 28 11:22:22 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 21:22:22 +0200 Subject: [ofa-general] Re: [PATCH 2/3] opensm/osm_ucast_ftree.c: Added possible reverse hops for Ftree algorithm. In-Reply-To: <4993E7CA.60103@ext.bull.net> References: <4993E7CA.60103@ext.bull.net> Message-ID: <20090228192210.GB3936@sashak.voltaire.com> On 10:11 Thu 12 Feb , Nicolas Morey Chaisemartin wrote: > This allows connectivity between nodes declared in the io_guid_file > when they had none with the regular algorithm > and it can be solved by doin less than max_reverse_hops in the tree. > This is meant to be used for I/O and service nodes connected to the > Top Switches of a Fat Tree, that need connectivity > but no real bandwidth. > > Signed-off-by: Nicolas Morey-Chaisemartin > Applied. Thanks. Please next time don't mix indentation (if necessary send it as separate patch) and functional changes in one patch. Sasha From sashak at voltaire.com Sat Feb 28 11:23:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 21:23:00 +0200 Subject: [ofa-general] Re: [PATCH 3/3] Added documentation for io_guid_file and max_reverse_hop feature In-Reply-To: <4993E7CE.3090908@ext.bull.net> References: <4993E7CE.3090908@ext.bull.net> Message-ID: <20090228192300.GC3936@sashak.voltaire.com> On 10:11 Thu 12 Feb , Nicolas Morey Chaisemartin wrote: > > Signed-off-by: Nicolas Morey-Chaisemartin > Applied. Thanks. Sasha From devel at morey-chaisemartin.com Sat Feb 28 12:43:12 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Sat, 28 Feb 2009 21:43:12 +0100 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH 1/3 v2] opensm: Added io_guid_file and max_reverse_hops options In-Reply-To: <20090228191921.GA3936@sashak.voltaire.com> References: <49953C48.3030203@ext.bull.net> <20090228191921.GA3936@sashak.voltaire.com> Message-ID: <49A9A1E0.4050005@morey-chaisemartin.com> Sasha Khapyorsky a écrit : > On 10:24 Fri 13 Feb , Nicolas Morey Chaisemartin wrote: > >> Signed-off-by: Nicolas Morey-Chaisemartin >> >> > > Applied (I will push it to main stream tomorrow). Thanks. > > All your patches are whitespace mangled - non-diff lines are started > from two spaces. I fixed it with "sed -e 's/^ / /'", but please check > your mailer. > > Also small note below. > Thanks for applying and sorry for the indentation. I tried to put my patches inline as Yevgeni advised me but it seems thunderbird messes things up though I directly output git format-patch into a thunderbird draft file. I guess I'll stick to attachment from now on... > >> --- >> Reposted as io_guid_file and max_reverse_hops were missing from the opt_tbl >> and wouldn't be read from the cached option file. >> >> opensm/include/opensm/osm_subnet.h | 6 ++++++ >> opensm/opensm/main.c | 26 +++++++++++++++++++++++++- >> opensm/opensm/osm_subnet.c | 14 ++++++++++++++ >> 3 files changed, 45 insertions(+), 1 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_subnet.h >> b/opensm/include/opensm/osm_subnet.h >> index 8863e47..671b51f 100644 >> --- a/opensm/include/opensm/osm_subnet.h >> +++ b/opensm/include/opensm/osm_subnet.h >> @@ -190,6 +190,8 @@ typedef struct osm_subn_opt { >> char *lfts_file; >> char *root_guid_file; >> char *cn_guid_file; >> + char *io_guid_file; >> + uint16_t max_reverse_hops; >> > > Why should max_reverse_hops be 16 bits long? In IB max hops value is 64. > In OpenSM Fat-tree max height is 8. So except on really irregular topology, max_reverse_hops shouldn't be more than one byte. For security reasons I chose 2 bytes so it should never overflow. Anyway more than 2^16 reverse hops is a really bad idea I guess. Nicolas From andy.grover at gmail.com Sat Feb 28 12:44:37 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Sat, 28 Feb 2009 12:44:37 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <20090228055608.GB26292@one.firstfloor.org> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <87myc73izx.fsf@basil.nowhere.org> <20090228055608.GB26292@one.firstfloor.org> Message-ID: On Fri, Feb 27, 2009 at 9:56 PM, Andi Kleen wrote: > On Fri, Feb 27, 2009 at 05:53:19PM -0800, Andrew Grover wrote: >> On Fri, Feb 27, 2009 at 9:08 AM, Andi Kleen wrote: >> >> This patchset against net-next adds support for RDS sockets. RDS is an >> >> Oracle-originated protocol used to send IPC datagrams (up to 1MB) >> >> reliably, and is used currently in Oracle RAC and Exadata products. >> > >> > Perhaps I missed it earlier, but what is the rationale for putting >> > this as a socket type into the kernel? I assume they also work >> > directly as implemented in user space using raw sockets or similar, >> > don't they? >> >> You want me to implement my fancy protocol in userspace??? > > I just asked why you're putting it in kernel space. > >> Do I even get to write it in C or do I need to use Ruby? > > Well normally people who add new subsystems to the kernel explain > why they do that. Perhaps it's obvious to you, but at least to > me it isn't. Sure thing, sorry to be flippant :-) The previous solution for IPC that Oracle was using was based on UDP, which I think could be considered very close to using raw sockets -- each process is responsible for its own acks, retransmits, everything. Doing this on a highly loaded machine resulted in a cascade where performance got worse and worse. Moving this to kernel code made a big difference. Additionally, our interconnect is primarily Infiniband. It natively implements a reliable datagram connection type so RDS leverages that. RDS multiplexes all processes' traffic between two hosts over a single IB connection. Since RDS is managing IB connections at the host level (but based on socket traffic) this is also more naturally a fit for kernel code. Regards -- Andy From sashak at voltaire.com Sat Feb 28 13:56:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Feb 2009 23:56:55 +0200 Subject: [ofa-general] Re: [PATCH 1/3 v2] opensm: Added io_guid_file and max_reverse_hops options In-Reply-To: <49A9A1E0.4050005@morey-chaisemartin.com> References: <49953C48.3030203@ext.bull.net> <20090228191921.GA3936@sashak.voltaire.com> <49A9A1E0.4050005@morey-chaisemartin.com> Message-ID: <20090228215645.GD3936@sashak.voltaire.com> On 21:43 Sat 28 Feb , Nicolas Morey-Chaisemartin wrote: > I tried to put my patches inline as Yevgeni advised me but it seems > thunderbird messes things up though I directly output git format-patch > into a thunderbird draft file. > I guess I'll stick to attachment from now on... Attached patches are not friendly for reviewing. Look at Thunderbird related section of: http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches Sasha From andi at firstfloor.org Sat Feb 28 14:36:53 2009 From: andi at firstfloor.org (Andi Kleen) Date: Sat, 28 Feb 2009 23:36:53 +0100 Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <87myc73izx.fsf@basil.nowhere.org> <20090228055608.GB26292@one.firstfloor.org> Message-ID: <20090228223653.GD26292@one.firstfloor.org> > The previous solution for IPC that Oracle was using was based on UDP, > which I think could be considered very close to using raw sockets -- > each process is responsible for its own acks, retransmits, everything. > Doing this on a highly loaded machine resulted in a cascade where > performance got worse and worse. Could you describe that cascade in more detail? The problem was that the retransmits didn't have high enough priority? > Additionally, our interconnect is primarily Infiniband. It natively > implements a reliable datagram connection type so RDS leverages that. So perhaps it would make more sense to have a thin direct interface to that IB service? Or perhaps it already exists? (I admit I don't know the IB interfaces very well) -andi -- ak at linux.intel.com -- Speaking for myself only. From sashak at voltaire.com Sat Feb 28 16:59:52 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 1 Mar 2009 02:59:52 +0200 Subject: [ofa-general] Re: [PATCH 2/2] perfquery: add PortXmtDataSL/PortRcvDataSL read and reset In-Reply-To: References: Message-ID: <20090301005952.GE3936@sashak.voltaire.com> Hi Or, On 14:41 Thu 26 Feb , Or Gerlitz wrote: > > For some reason the Xmt SL help is printed twice, any idea why? Yes. You added '-s' option, but 's' letter is used already by ibdiag_common: Usage: perfquery [options] [ [[port] [reset_mask]]] Options: --extended, -x show extended port counters --all_ports, -a show aggregated counters --loop_ports, -l iterate through each port --reset_after_read, -r reset counters after read --Reset_only, -R only reset counters --Ca, -C Ca name to use --Port, -P Ca port number to use --Lid, -L use LID address argument --Guid, -G use GUID address argument --timeout, -t timeout in ms --sm_port, -s SM port lid ^^^^^^^^^^^^^^^ ... You can mask it by passing 's' as part of exclude string to ibdiag_process_opts(). Or just find another, "free" latter for your option. Sasha From andy.grover at gmail.com Sat Feb 28 16:58:25 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Sat, 28 Feb 2009 16:58:25 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: <20090228223653.GD26292@one.firstfloor.org> References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <87myc73izx.fsf@basil.nowhere.org> <20090228055608.GB26292@one.firstfloor.org> <20090228223653.GD26292@one.firstfloor.org> Message-ID: On Sat, Feb 28, 2009 at 2:36 PM, Andi Kleen wrote: >> The previous solution for IPC that Oracle was using was based on UDP, >> which I think could be considered very close to using raw sockets -- >> each process is responsible for its own acks, retransmits, everything. >> Doing this on a highly loaded machine resulted in a cascade where >> performance got worse and worse. > > Could you describe that cascade in more detail? > The problem was that the retransmits didn't have high enough priority? I think the gist of it is: Higher load -> more time before a process runs -> rcvbuf overfills -> ACKs dropped -> timeouts -> more retransmissions -> even higher load. Things are fine until they hit a point where everything goes to hell. >> Additionally, our interconnect is primarily Infiniband. It natively >> implements a reliable datagram connection type so RDS leverages that. > So perhaps it would make more sense to have a thin direct interface > to that IB service? Or perhaps it already exists? (I admit I don't know > the IB interfaces very well) The most direct userspace API is uDAPL -- apps can create IB connections (queue pairs) directly. This was tried but didn't work out so well. A queue pair (QP) is a TX/RX ring -- a nontrivial amount of memory. If each process needs a new QP to talk to every other process then the number of RAM-hungry QPs becomes huge. RDS is only slightly less direct -- apps don't create queue pairs, they create RDS sockets. RDS uses only one QP for all traffic to each remote node, so the number of QPs on a node is equal to the number of remote nodes, as opposed to (number of local processes * number of remote processes). Regards -- Andy From andi at firstfloor.org Sat Feb 28 17:50:20 2009 From: andi at firstfloor.org (Andi Kleen) Date: Sun, 1 Mar 2009 02:50:20 +0100 Subject: [ofa-general] [PATCH 0/26] Reliable Datagram Sockets (RDS), take 2 In-Reply-To: References: <1235525443-9007-1-git-send-email-andy.grover@oracle.com> <87myc73izx.fsf@basil.nowhere.org> <20090228055608.GB26292@one.firstfloor.org> <20090228223653.GD26292@one.firstfloor.org> Message-ID: <20090301015020.GH26292@one.firstfloor.org> > Higher load -> more time before a process runs -> rcvbuf overfills -> How can the rcvbuf overfill if the sender doesn't run? > ACKs dropped -> timeouts -> more retransmissions -> even higher load. > > Things are fine until they hit a point where everything goes to hell. -Andi From sashak at voltaire.com Sat Feb 28 23:00:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 1 Mar 2009 09:00:20 +0200 Subject: [ofa-general] Re: [PATCH 1/10] libibmad: Clean up "new" interface In-Reply-To: <20090219190525.322681b8.weiny2@llnl.gov> References: <20090219190525.322681b8.weiny2@llnl.gov> Message-ID: <20090301070013.GF3936@sashak.voltaire.com> Hi Ira, On 19:05 Thu 19 Feb , Ira Weiny wrote: > From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Wed, 18 Feb 2009 16:37:36 -0800 > Subject: [PATCH] libibmad: Clean up "new" interface Please don't put email header into commit message body, it breaks tools like 'git rebase' and similar. At least put '>' before first 'From ' line. > > type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * Do you plan to expose 'struct ibmad_port' (I see later in patches that it is going to some libibmad internal header file)? > Create new mad_rpc_portid(struct ibmad_port *srcport) function > which mirrors madrpc_portid(void) > Mark all "old" functions with __attribute__ ((deprecated)) This generates a lot of warnings right now (even after all patch series applying it still have deprecated usages in libibmad itself). And this is not very good. I think our flow should have opposite direction - first to convert, then mark deprecated functions. Now as fast workaround I can mask depreciation by macro: #define DEPRECATED /* __attribute__ ((deprecated)) */ , and we will uncomment this when everything in tree will be converted. Also after looking over patch series I see that all "original" function names become deprecated and replaces by its *_via() brothers. How do you see the next step? Will we remove old names and have almost all API calls with useless then _via suffix? Sasha > > Signed-off-by: Ira Weiny > --- > libibmad/include/infiniband/mad.h | 139 ++++++++++++++++++++++--------------- > libibmad/src/gs.c | 19 +++--- > libibmad/src/libibmad.map | 1 + > libibmad/src/resolve.c | 10 ++- > libibmad/src/rpc.c | 29 ++++---- > libibmad/src/sa.c | 4 +- > libibmad/src/smp.c | 4 +- > 7 files changed, 118 insertions(+), 88 deletions(-) > > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 1aaaa1b..80e38be 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt) > } > > /* rpc.c */ > -MAD_EXPORT int madrpc_portid(void); > -MAD_EXPORT int madrpc_set_retries(int retries); > -MAD_EXPORT int madrpc_set_timeout(int timeout); > -void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata); > -void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, > - void *data); > +MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated)); > +void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) > + __attribute__ ((deprecated)); > +void *madrpc_rmpp(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, void *data) > + __attribute__ ((deprecated)); > MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > - int num_classes); > -void madrpc_save_mad(void *madbuf, int len); > -MAD_EXPORT void madrpc_show_errors(int set); > + int num_classes) __attribute__ ((deprecated)); > +void madrpc_save_mad(void *madbuf, int len) __attribute__ ((deprecated)); > > -void *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > +/* New interface */ > +MAD_EXPORT void madrpc_show_errors(int set); > +MAD_EXPORT int madrpc_set_retries(int retries); > +MAD_EXPORT int madrpc_set_timeout(int timeout); > +MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > -void mad_rpc_close_port(void *ibmad_port); > -void *mad_rpc(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > - void *payload, void *rcvdata); > -void *mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t * rpc, ib_portid_t * dport, > - ib_rmpp_hdr_t * rmpp, void *data); > +MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport); > +MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > + void *payload, void *rcvdata); > +MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > + ib_rmpp_hdr_t * rmpp, void *data); > +MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); > > /* smp.c */ > MAD_EXPORT uint8_t *smp_query(void *buf, ib_portid_t * id, unsigned attrid, > - unsigned mod, unsigned timeout); > + unsigned mod, unsigned timeout) __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *smp_set(void *buf, ib_portid_t * id, unsigned attrid, > - unsigned mod, unsigned timeout); > + unsigned mod, unsigned timeout) __attribute__ ((deprecated)); > + > +/* smp.c new interface */ > MAD_EXPORT uint8_t *smp_query_via(void *buf, ib_portid_t * id, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport); > -uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, > - unsigned timeout, const void *srcport); > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *smp_set_via(void *buf, ib_portid_t * id, unsigned attrid, unsigned mod, > + unsigned timeout, const struct ibmad_port *srcport); > > /* sa.c */ > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > - unsigned timeout); > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > + unsigned timeout) __attribute__ ((deprecated)); > +MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, > + void *buf) __attribute__ ((deprecated)); > + > +/* sa.c new interface */ > +MAD_EXPORT uint8_t *sa_rpc_call(const struct ibmad_port *srcport, void *rcvbuf, ib_portid_t * portid, > ib_sa_call_t * sa, unsigned timeout); > -MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); /* returns lid */ > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > +MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf); > + /* returns lid */ > > /* resolve.c */ > -MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout); > +MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout); > + ib_portid_t * sm_id, int timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > - enum MAD_DEST dest, ib_portid_t * sm_id); > + enum MAD_DEST dest, ib_portid_t * sm_id) > + __attribute__ ((deprecated)); > MAD_EXPORT int ib_resolve_self(ib_portid_t * portid, int *portnum, > - ibmad_gid_t * gid); > - > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport); > -int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout, const void *srcport); > -int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > + ibmad_gid_t * gid) > + __attribute__ ((deprecated)); > + > +/* resolve.c new interface */ > +MAD_EXPORT int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport); > +MAD_EXPORT int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > + ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport); > +MAD_EXPORT int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > enum MAD_DEST dest, ib_portid_t * sm_id, > - const void *srcport); > -int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > - const void *srcport); > + const struct ibmad_port *srcport); > +MAD_EXPORT int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > + const struct ibmad_port *srcport); > > /* gs.c */ > MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout); > + unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout); > + unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t * dest, > - int port, unsigned timeout); > + int port, unsigned timeout) > + __attribute__ ((deprecated)); > > -uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > +/* gs.c new interface */ > +MAD_EXPORT uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > -uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > - unsigned timeout, const void *srcport); > -uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > + unsigned timeout, const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned mask, unsigned timeout, > - const void *srcport); > -uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > -uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout, const void *srcport); > -uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > + unsigned timeout, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > -uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > + const struct ibmad_port *srcport); > +MAD_EXPORT uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport); > + const struct ibmad_port *srcport); > /* dump.c */ > MAD_EXPORT ib_mad_dump_fn > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c > index d2c4574..e302caf 100644 > --- a/libibmad/src/gs.c > +++ b/libibmad/src/gs.c > @@ -47,7 +47,7 @@ > > static uint8_t *pma_query_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned timeout, unsigned id, > - const void *srcport) > + const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > int lid = dest->lid; > @@ -89,7 +89,7 @@ uint8_t *pma_query(void *rcvbuf, ib_portid_t * dest, int port, unsigned timeout, > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, CLASS_PORT_INFO, > srcport); > @@ -102,7 +102,7 @@ uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t * dest, int port, > } > > uint8_t *port_performance_query_via(void *rcvbuf, ib_portid_t * dest, int port, > - unsigned timeout, const void *srcport) > + unsigned timeout, const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_COUNTERS, srcport); > @@ -116,7 +116,7 @@ uint8_t *port_performance_query(void *rcvbuf, ib_portid_t * dest, int port, > > static uint8_t *performance_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, unsigned timeout, > - unsigned id, const void *srcport) > + unsigned id, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > int lid = dest->lid; > @@ -166,7 +166,7 @@ static uint8_t *performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_reset_via(void *rcvbuf, ib_portid_t * dest, int port, > unsigned mask, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > IB_GSI_PORT_COUNTERS, srcport); > @@ -181,7 +181,7 @@ uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_ext_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_COUNTERS_EXT, srcport); > @@ -195,7 +195,8 @@ uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_performance_ext_reset_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned mask, > - unsigned timeout, const void *srcport) > + unsigned timeout, > + const struct ibmad_port *srcport) > { > return performance_reset_via(rcvbuf, dest, port, mask, timeout, > IB_GSI_PORT_COUNTERS_EXT, srcport); > @@ -210,7 +211,7 @@ uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_SAMPLES_CONTROL, srcport); > @@ -225,7 +226,7 @@ uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t * dest, int port, > > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t * dest, > int port, unsigned timeout, > - const void *srcport) > + const struct ibmad_port *srcport) > { > return pma_query_via(rcvbuf, dest, port, timeout, > IB_GSI_PORT_SAMPLES_RESULT, srcport); > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > index f944d86..94d7762 100644 > --- a/libibmad/src/libibmad.map > +++ b/libibmad/src/libibmad.map > @@ -69,6 +69,7 @@ IBMAD_1.3 { > mad_rpc_close_port; > mad_rpc; > mad_rpc_rmpp; > + mad_rpc_portid; > madrpc; > madrpc_def_timeout; > madrpc_init; > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > index 553949d..3291f43 100644 > --- a/libibmad/src/resolve.c > +++ b/libibmad/src/resolve.c > @@ -45,7 +45,8 @@ > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN > > -int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, const void *srcport) > +int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport) > { > ib_portid_t self = { 0 }; > uint8_t portinfo[64]; > @@ -67,7 +68,8 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) > } > > int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > - ib_portid_t * sm_id, int timeout, const void *srcport) > + ib_portid_t * sm_id, int timeout, > + const struct ibmad_port *srcport) > { > ib_portid_t sm_portid; > char buf[IB_SA_DATA_SIZE] = { 0 }; > @@ -93,7 +95,7 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > > int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, > enum MAD_DEST dest_type, ib_portid_t * sm_id, > - const void *srcport) > + const struct ibmad_port *srcport) > { > uint64_t guid; > int lid; > @@ -150,7 +152,7 @@ int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, > } > > int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > - const void *srcport) > + const struct ibmad_port *srcport) > { > ib_portid_t self = { 0 }; > uint8_t portinfo[64]; > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index e811526..d47873b 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -100,6 +100,11 @@ int madrpc_portid(void) > return mad_portid; > } > > +int mad_rpc_portid(struct ibmad_port *srcport) > +{ > + return (srcport->port_id); > +} > + > static int > _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > int timeout) > @@ -164,10 +169,9 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > return -1; > } > > -void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > +void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > void *payload, void *rcvdata) > { > - const struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -177,8 +181,8 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) > return 0; > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > - p->class_agents[rpc->mgtclass], > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > + port->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > @@ -203,10 +207,9 @@ void *mad_rpc(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > return rcvdata; > } > > -void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > +void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, > ib_rmpp_hdr_t * rmpp, void *data) > { > - const struct ibmad_port *p = port_id; > int status, len; > uint8_t sndbuf[1024], rcvbuf[1024], *mad; > > @@ -217,8 +220,8 @@ void *mad_rpc_rmpp(const void *port_id, ib_rpc_t * rpc, ib_portid_t * dport, > if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) > return 0; > > - if ((len = _do_madrpc(p->port_id, sndbuf, rcvbuf, > - p->class_agents[rpc->mgtclass], > + if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > + port->class_agents[rpc->mgtclass], > len, rpc->timeout)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > @@ -303,7 +306,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) > } > } > > -void *mad_rpc_open_port(char *dev_name, int dev_port, > +struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, > int *mgmt_classes, int num_classes) > { > struct ibmad_port *p; > @@ -360,12 +363,10 @@ void *mad_rpc_open_port(char *dev_name, int dev_port, > return p; > } > > -void mad_rpc_close_port(void *port_id) > +void mad_rpc_close_port(struct ibmad_port *port) > { > - struct ibmad_port *p = port_id; > - > - umad_close_port(p->port_id); > - free(p); > + umad_close_port(port->port_id); > + free(port); > } > > uint8_t *sa_call(void *rcvbuf, ib_portid_t * portid, ib_sa_call_t * sa, > diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c > index 7403d4f..ddeb152 100644 > --- a/libibmad/src/sa.c > +++ b/libibmad/src/sa.c > @@ -44,7 +44,7 @@ > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN > > -uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > +uint8_t *sa_rpc_call(const struct ibmad_port *ibmad_port, void *rcvbuf, ib_portid_t * portid, > ib_sa_call_t * sa, unsigned timeout) > { > ib_rpc_t rpc = { 0 }; > @@ -106,7 +106,7 @@ uint8_t *sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t * portid, > IB_PR_COMPMASK_SGID |\ > IB_PR_COMPMASK_NUMBPATH) > > -int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > +int ib_path_query_via(const struct ibmad_port *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t * sm_id, void *buf) > { > int npath; > diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c > index fad263c..e5489b3 100644 > --- a/libibmad/src/smp.c > +++ b/libibmad/src/smp.c > @@ -45,7 +45,7 @@ > #define DEBUG if (ibdebug) IBWARN > > uint8_t *smp_set_via(void *data, ib_portid_t * portid, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport) > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > > @@ -81,7 +81,7 @@ uint8_t *smp_set(void *data, ib_portid_t * portid, unsigned attrid, > } > > uint8_t *smp_query_via(void *rcvbuf, ib_portid_t * portid, unsigned attrid, > - unsigned mod, unsigned timeout, const void *srcport) > + unsigned mod, unsigned timeout, const struct ibmad_port *srcport) > { > ib_rpc_t rpc = { 0 }; > > -- > 1.5.4.5 > From kliteyn at dev.mellanox.co.il Sat Feb 28 23:16:52 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 01 Mar 2009 09:16:52 +0200 Subject: [ofa-general] Re: [PATCH 1/3 v2] opensm: Added io_guid_file and max_reverse_hops options In-Reply-To: <20090228215645.GD3936@sashak.voltaire.com> References: <49953C48.3030203@ext.bull.net> <20090228191921.GA3936@sashak.voltaire.com> <49A9A1E0.4050005@morey-chaisemartin.com> <20090228215645.GD3936@sashak.voltaire.com> Message-ID: <49AA3664.8090104@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 21:43 Sat 28 Feb , Nicolas Morey-Chaisemartin wrote: >> I tried to put my patches inline as Yevgeni advised me but it seems >> thunderbird messes things up though I directly output git format-patch >> into a thunderbird draft file. >> I guess I'll stick to attachment from now on... > > Attached patches are not friendly for reviewing. Look at Thunderbird > related section of: > > http://git.kernel.org/?p=git/git.git;a=blob_plain;f=Documentation/SubmittingPatches The Thunderbird section describes two options. There's also a third option - the QuickText Thunderbird extension: https://addons.mozilla.org/en-US/thunderbird/addon/640 With this extension you will get the new bar when composing mail. Go to "Other"->"Insert file as text" and insert the patch. -- Yevgeny > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Sat Feb 28 23:26:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 1 Mar 2009 09:26:31 +0200 Subject: [ofa-general] [PATCH 11/10] libibmad:infiniband-diags: deprecate madrpc_set_[retries|timeout] WAS: [PATCH 1/10] libibmad: Clean up "new" interface In-Reply-To: <20090220143402.c3b23b0a.weiny2@llnl.gov> References: <20090219190525.322681b8.weiny2@llnl.gov> <20090220143402.c3b23b0a.weiny2@llnl.gov> Message-ID: <20090301072622.GG3936@sashak.voltaire.com> On 14:34 Fri 20 Feb , Ira Weiny wrote: > On Fri, 20 Feb 2009 13:24:35 -0500 > Hal Rosenstock wrote: > > > On Fri, Feb 20, 2009 at 8:41 AM, Hal Rosenstock > > wrote: > > > On Thu, Feb 19, 2009 at 10:05 PM, Ira Weiny wrote: > > >> >From 2774b4ab4608e25bdc365bca3a94c7d51ee19372 Mon Sep 17 00:00:00 2001 > > >> From: Ira Weiny > > >> Date: Wed, 18 Feb 2009 16:37:36 -0800 > > >> Subject: [PATCH] libibmad: Clean up "new" interface > > >> > > >> type all "void *ibmad_port" and "void *srcport" with struct ibmad_port * > > >> Create new mad_rpc_portid(struct ibmad_port *srcport) function > > >> which mirrors madrpc_portid(void) > > >> Mark all "old" functions with __attribute__ ((deprecated)) > > >> > > >> Signed-off-by: Ira Weiny > > >> --- > > >> libibmad/include/infiniband/mad.h | 139 ++++++++++++++++++++++--------------- > > >> libibmad/src/gs.c | 19 +++--- > > >> libibmad/src/libibmad.map | 1 + > > >> libibmad/src/resolve.c | 10 ++- > > >> libibmad/src/rpc.c | 29 ++++---- > > >> libibmad/src/sa.c | 4 +- > > >> libibmad/src/smp.c | 4 +- > > >> 7 files changed, 118 insertions(+), 88 deletions(-) > > >> > > >> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > > >> index 1aaaa1b..80e38be 100644 > > >> --- a/libibmad/include/infiniband/mad.h > > >> +++ b/libibmad/include/infiniband/mad.h > > >> @@ -724,100 +724,125 @@ static inline int mad_is_vendor_range2(int mgmt) > > >> } > > >> > > >> /* rpc.c */ > > >> -MAD_EXPORT int madrpc_portid(void); > > >> -MAD_EXPORT int madrpc_set_retries(int retries); > > >> -MAD_EXPORT int madrpc_set_timeout(int timeout); > > > > retries and timeouts could also be made per ibmad_port struct basis > > rather than one for all clients. Those two APIs would be deprecated in > > favor of new ones (mad_rpc_set_retries/timeout). > > > > Patch below. (To be applied after the others.) > > > >From d12b291041bdfe0d3bddecb7a71ee769a601fd83 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Fri, 20 Feb 2009 14:30:52 -0800 > Subject: [PATCH] libibmad:infiniband-diags: deprecate madrpc_set_[retries|timeout] > > replace with mad_rpc_set_[retries|timeout] which are per ibmad_port > object > Update all diags with new functions > > Signed-off-by: Ira Weiny > --- > infiniband-diags/src/ibaddr.c | 1 + > infiniband-diags/src/ibdiag_common.c | 1 - > infiniband-diags/src/ibping.c | 1 + > infiniband-diags/src/ibportstate.c | 1 + > infiniband-diags/src/ibroute.c | 1 + > infiniband-diags/src/ibsendtrap.c | 1 + > infiniband-diags/src/ibsysstat.c | 1 + > infiniband-diags/src/ibtracert.c | 1 + > infiniband-diags/src/perfquery.c | 1 + > infiniband-diags/src/saquery.c | 1 + > infiniband-diags/src/sminfo.c | 1 + > infiniband-diags/src/smpquery.c | 1 + > infiniband-diags/src/vendstat.c | 1 + > libibmad/include/infiniband/mad.h | 6 ++++-- > libibmad/src/libibmad.map | 2 ++ > libibmad/src/mad_internal.h | 2 ++ > libibmad/src/rpc.c | 29 ++++++++++++++++++++--------- > 17 files changed, 40 insertions(+), 12 deletions(-) > > diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c > index bb22be9..e782b36 100644 > --- a/infiniband-diags/src/ibaddr.c > +++ b/infiniband-diags/src/ibaddr.c > @@ -142,6 +142,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > if (argc) { > if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, > diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c > index 609df69..38d6cd3 100644 > --- a/infiniband-diags/src/ibdiag_common.c > +++ b/infiniband-diags/src/ibdiag_common.c > @@ -175,7 +175,6 @@ static int process_opt(int ch, char *optarg) > break; > case 't': > val = strtoul(optarg, 0, 0); > - madrpc_set_timeout(val); > ibd_timeout = val; > break; > case 's': > diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c > index 901079f..28e3a64 100644 > --- a/infiniband-diags/src/ibping.c > +++ b/infiniband-diags/src/ibping.c > @@ -213,6 +213,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > if (server) { > if (mad_register_server_via(ping_class, 0, 0, oui, srcport) < 0) > diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c > index 65c9ca1..deaad51 100644 > --- a/infiniband-diags/src/ibportstate.c > +++ b/infiniband-diags/src/ibportstate.c > @@ -228,6 +228,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, > ibd_sm_id, srcport) < 0) > diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c > index 60bfdd8..07eddc4 100644 > --- a/infiniband-diags/src/ibroute.c > +++ b/infiniband-diags/src/ibroute.c > @@ -410,6 +410,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > if (!argc) { > if (ib_resolve_self_via(&portid, 0, 0, srcport) < 0) > diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c > index 75120f0..916b537 100644 > --- a/infiniband-diags/src/ibsendtrap.c > +++ b/infiniband-diags/src/ibsendtrap.c > @@ -143,6 +143,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > rc = send_trap(trap_name); > mad_rpc_close_port(srcport); > diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c > index d7daa37..7e668e8 100644 > --- a/infiniband-diags/src/ibsysstat.c > +++ b/infiniband-diags/src/ibsysstat.c > @@ -339,6 +339,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > if (server) { > if (mad_register_server_via(sysstat_class, 1, 0, oui, srcport) < 0) > diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c > index 1965aa0..87b5b17 100644 > --- a/infiniband-diags/src/ibtracert.c > +++ b/infiniband-diags/src/ibtracert.c > @@ -753,6 +753,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > node_name_map = open_node_name_map(node_name_map_file); > > diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c > index 2f104b8..3d89cc7 100644 > --- a/infiniband-diags/src/perfquery.c > +++ b/infiniband-diags/src/perfquery.c > @@ -389,6 +389,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > if (argc) { > if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, > diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c > index e6cbe50..43eff85 100644 > --- a/infiniband-diags/src/saquery.c > +++ b/infiniband-diags/src/saquery.c > @@ -1323,6 +1323,7 @@ static bind_handle_t get_bind_handle(void) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > ib_resolve_smlid_via(&handle.dport, ibd_timeout, srcport); > if (!handle.dport.lid) > diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c > index ebf6a47..0caa3f3 100644 > --- a/infiniband-diags/src/sminfo.c > +++ b/infiniband-diags/src/sminfo.c > @@ -118,6 +118,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > if (argc) { > if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, > diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c > index 2ed1e65..dc6b685 100644 > --- a/infiniband-diags/src/smpquery.c > +++ b/infiniband-diags/src/smpquery.c > @@ -455,6 +455,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 3); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); > > node_name_map = open_node_name_map(node_name_map_file); > > diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c > index d001a01..1c1c08f 100644 > --- a/infiniband-diags/src/vendstat.c > +++ b/infiniband-diags/src/vendstat.c > @@ -157,6 +157,7 @@ int main(int argc, char **argv) > srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); > if (!srcport) > IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); > + mad_rpc_set_timeout(ibd_timeout, srcport); Now you need to duplicate this single call over all tools. For me it looks like an overkill. Probably it would be simpler to just read global ibd_timeout variable in rpc.c? > > if (argc) { > if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 5cf135e..cbd3049 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -693,8 +693,6 @@ MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport, > > /* New interface */ > MAD_EXPORT void madrpc_show_errors(int set); > -MAD_EXPORT int madrpc_set_retries(int retries); > -MAD_EXPORT int madrpc_set_timeout(int timeout); > MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > MAD_EXPORT void mad_rpc_close_port(struct ibmad_port *srcport); > @@ -703,6 +701,8 @@ MAD_EXPORT void *mad_rpc(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_po > MAD_EXPORT void *mad_rpc_rmpp(const struct ibmad_port *srcport, ib_rpc_t * rpc, ib_portid_t * dport, > ib_rmpp_hdr_t * rmpp, void *data); > MAD_EXPORT int mad_rpc_portid(struct ibmad_port *srcport); > +MAD_EXPORT int mad_rpc_set_retries(int retries, struct ibmad_port *srcport); > +MAD_EXPORT int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport); > > /* register.c */ > MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, > @@ -761,6 +761,8 @@ static inline int mad_is_vendor_range2(int mgmt) > } > > /* rpc.c */ > +MAD_EXPORT int madrpc_set_retries(int retries) __attribute__ ((deprecated)); > +MAD_EXPORT int madrpc_set_timeout(int timeout) __attribute__ ((deprecated)); > MAD_EXPORT int madrpc_portid(void) __attribute__ ((deprecated)); > void *madrpc(ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) > __attribute__ ((deprecated)); > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > index 0412027..f231485 100644 > --- a/libibmad/src/libibmad.map > +++ b/libibmad/src/libibmad.map > @@ -80,6 +80,8 @@ IBMAD_1.3 { > madrpc_save_mad; > madrpc_set_retries; > madrpc_set_timeout; > + mad_rpc_set_retries; > + mad_rpc_set_timeout; > madrpc_show_errors; > ib_path_query; > sa_call; > diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h > index 9afe7a9..3991cc3 100644 > --- a/libibmad/src/mad_internal.h > +++ b/libibmad/src/mad_internal.h > @@ -39,6 +39,8 @@ > struct ibmad_port { > int port_id; /* file descriptor returned by umad_open() */ > int class_agents[MAX_CLASS]; /* class2agent mapper */ > + int retries; > + int timeout_ms; > }; > > #endif /* _MAD_INTERNAL_H_ */ > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index 210f0c2..229020d 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -49,7 +49,7 @@ int ibdebug; > > static int mad_portid = -1; > static int iberrs; > - > + int timeout; Typo? > static int madrpc_retries = MAD_DEF_RETRIES; > static int def_madrpc_timeout = MAD_DEF_TIMEOUT_MS; > static void *save_mad; > @@ -85,9 +85,17 @@ int madrpc_set_timeout(int timeout) > return 0; > } > > -int madrpc_def_timeout(void) > +int mad_rpc_set_retries(int retries, struct ibmad_port *srcport) > +{ > + if (retries > 0) > + srcport->retries = retries; > + return srcport->retries; > +} > + > +int mad_rpc_set_timeout(int timeout_ms, struct ibmad_port *srcport) > { > - return def_madrpc_timeout; > + srcport->timeout_ms = timeout_ms; > + return 0; > } > > int madrpc_portid(void) > @@ -102,14 +110,14 @@ int mad_rpc_portid(struct ibmad_port *srcport) > > static int > _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > - int timeout) > + int timeout, const struct ibmad_port *srcport) > { > uint32_t trid; /* only low 32 bits */ > - int retries; > + int retries, max_retries; > int length, status; > > if (!timeout) > - timeout = def_madrpc_timeout; > + timeout = srcport ? srcport->timeout_ms : def_madrpc_timeout; Now you have three timeouts - one in rpc struct, another is per port and default one. Isn't it too much? > > if (ibdebug > 1) { > IBWARN(">>> sending: len %d pktsz %zu", len, umad_size() + len); > @@ -125,7 +133,8 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > trid = > (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); > > - for (retries = 0; retries < madrpc_retries; retries++) { > + max_retries = srcport ? srcport->retries : madrpc_retries; > + for (retries = 0; retries < max_retries; retries++) { Same with retries - it is hard for me to believe that any multithreaded application will try to setup different retry values per port, for different threads, "on the fly".... (rpc.c with all its limited functionality will not be sufficient for such flexibility level anyway :)). Sasha > if (retries) { > ERRS("retry %d (timeout %d ms)", retries, timeout); > } > @@ -178,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport > > if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > port->class_agents[rpc->mgtclass], > - len, rpc->timeout)) < 0) { > + len, rpc->timeout, port)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > } > @@ -217,7 +226,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * > > if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, > port->class_agents[rpc->mgtclass], > - len, rpc->timeout)) < 0) { > + len, rpc->timeout, port)) < 0) { > IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); > return 0; > } > @@ -356,6 +365,8 @@ struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, > } > > p->port_id = port_id; > + p->retries = MAD_DEF_RETRIES; > + p->timeout_ms = MAD_DEF_TIMEOUT_MS; > return p; > } > > -- > 1.5.4.5 > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at Voltaire.com Sat Feb 28 23:50:42 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Sun, 01 Mar 2009 09:50:42 +0200 Subject: [ofa-general] [PATCH 1/2] libibmad: add PortXmtDataSL / PortRcvDataSL support In-Reply-To: <3B25B2D61996446F88703F647919FC4E@amr.corp.intel.com> References: <3B25B2D61996446F88703F647919FC4E@amr.corp.intel.com> Message-ID: <49AA3E52.30804@Voltaire.com> Sean Hefty wrote: > Rather than continue to add more and more interfaces to the library, can we just > export a couple of more generic calls? Hi Sasha, So how you'd like to get this done? should I just expose pma_query, pma_query_via, performance_reset, etc through mad.h? Or.