From vlad at lists.openfabrics.org Sat Aug 1 02:58:11 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 1 Aug 2009 02:58:11 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090801-0200 daily build status Message-ID: <20090801095812.32630E61B3C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:262: error: 'struct net_device' has no member named 'stats' /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats': /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18-128.el5 Log: /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:262: error: 'struct net_device' has no member named 'stats' /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats': /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-128.el5_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-128.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18-93.el5 Log: /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:262: error: 'struct net_device' has no member named 'stats' /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats': /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.18-93.el5_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-93.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:375: warning: pointer targets in assignment differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats': /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:375: warning: pointer targets in assignment differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c: In function 'vnic_get_stats': /home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c:214: warning: control reaches end of non-void function make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic/vnic_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/qlgc_vnic] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090801-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From yossi.openib at gmail.com Sat Aug 1 05:01:24 2009 From: yossi.openib at gmail.com (Yossi Etigin) Date: Sat, 01 Aug 2009 15:01:24 +0300 Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes In-Reply-To: References: <4A6DDFCE.9060009@voltaire.com> <4A70154F.7080300@gmail.com> <4A703DA4.9080300@Voltaire.COM> <4A705B3A.7060404@Voltaire.COM> <4A731818.3060500@voltaire.com> <4A733D24.3040201@voltaire.com> Message-ID: <4A742E94.2070002@gmail.com> Hal Rosenstock wrote: > > Yes, but AFAIK the only "bad" case is if the LID stays the same but > LMC changes to a lower > value. In this case the path refresh will not happen when it is > supposed to. > > > What's the impact of that ? > > Also the LID can change at the same time as the LMC. > > I can't tell if all the possible cases are handled properly. Are they ? > Let's see: Only LID changes - handled correctly. LMC (and possibly LID) change - either we "catch" this, or we don't. If we do, the path and LMC will be refreshed so we will not keep refreshing the path forever (like it could have been if we didn't refresh the LMC). If we don't - ipoib packets will not reach the neighbour, which is the same situation there is today. From vlad at lists.openfabrics.org Sun Aug 2 03:00:55 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 2 Aug 2009 03:00:55 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090802-0200 daily build status Message-ID: <20090802100055.BEDAAE61D78@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg': /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink' /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg': /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink' /home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090802-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Sun Aug 2 03:07:50 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 13:07:50 +0300 Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [4/6] fix libibnetdisc API consistency and bugs In-Reply-To: <1248714771.16723.327.camel@auk31.llnl.gov> References: <1248714771.16723.327.camel@auk31.llnl.gov> Message-ID: <20090802100750.GA5287@me> On 10:12 Mon 27 Jul , Al Chu wrote: > Make api more consistent and make struct ibnd_fabric a struct that > represents just fabric data by removing the ibmad_port and making it a > function paramete in appropriate functions. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From: Albert Chu > Date: Thu, 23 Jul 2009 14:14:57 -0700 > Subject: [PATCH] Make api more consistent and make struct ibnd_fabric a struct that represents just fabric data by removing the ibmad_port and making it a function paramete in appropriate functions. > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 2 03:08:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 13:08:41 +0300 Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [5/6] fix libibnetdisc API consistency and bugs In-Reply-To: <1248714779.16723.328.camel@auk31.llnl.gov> References: <1248714779.16723.328.camel@auk31.llnl.gov> Message-ID: <20090802100841.GB5287@me> On 10:12 Mon 27 Jul , Al Chu wrote: > Check input parameters to libibnetdisc functions > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From: Albert Chu > Date: Thu, 23 Jul 2009 14:15:25 -0700 > Subject: [PATCH] Check input parameters to libibnetdisc functions > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 2 03:08:47 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 13:08:47 +0300 Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [6/6] fix libibnetdisc API consistency and bugs In-Reply-To: <1248714781.16723.329.camel@auk31.llnl.gov> References: <1248714781.16723.329.camel@auk31.llnl.gov> Message-ID: <20090802100847.GC5287@me> On 10:13 Mon 27 Jul , Al Chu wrote: > Remove timeout_ms parameter to ibnd_discover_fabric, timeout parameter > should be specified via the ibmad_port. Remove extraneous use of global > timeout_ms in library. Adjust ibnetdiscover, ibqueryerrors, iblinkinfo, > and test code appropriately for adjustment. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From: Albert Chu > Date: Thu, 23 Jul 2009 14:16:14 -0700 > Subject: [PATCH] Remove timeout_ms parameter to ibnd_discover_fabric, timeout parameter should be specified via the ibmad_port. Remove extraneous use of global timeout_ms in library. Adjust ibnetdiscover, ibqueryerrors, iblinkinfo, and test code appropriately for adjustment. > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 2 03:09:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 13:09:01 +0300 Subject: [ofa-general] Re: [infiniband-diags] [PATCH] [0/6] fix libibnetdisc API consistency and bugs In-Reply-To: <1248714723.16723.322.camel@auk31.llnl.gov> References: <1248714723.16723.322.camel@auk31.llnl.gov> Message-ID: <20090802100901.GD5287@me> Hi Al, On 10:12 Mon 27 Jul , Al Chu wrote: > > This is a redo of my previous patch series. Ira or myself will instead > write 1 big patch later on to make a lot of the structs more public. > These are the patches to fix bugs and/or make things more consistent for > what's already there. I applied this series. Please next time when you are sending patch series use different subjects for email messaged, so that patch subject will be more descriptive. Actually you can look at /usr/src/linux/Documentation/SubmittingPatches for more details about desirable patch and patch series format. Sasha From sashak at voltaire.com Sun Aug 2 03:09:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 13:09:40 +0300 Subject: [ofa-general] [PATCHv2] opensm/mesh/lash: Fix use after free problem in osm_mesh_node_delete In-Reply-To: <20090731135147.GA10365@comcast.net> References: <20090731135147.GA10365@comcast.net> Message-ID: <20090802100940.GE5287@me> Hi Hal, On 09:51 Fri 31 Jul , Hal Rosenstock wrote: > > When osm_mesh_node_delete is called, osm_switch_delete may already have > been called so sw->p_sw is no longer valid to be used although it was > being used to obtain num_ports. > > Fix this by performing osm_mesh_delete_switches at the end of lash_process. > > Signed-off-by: Hal Rosenstock > --- > Changes since v1: > Rather than saving num_ports in the mesh node structure on creation and using > this on deletion, mesh switches deletion should occur at end of the lash > calculation as none of this state is needed after that > Approach proposed by Sasha > > diff --git a/opensm/include/opensm/osm_mesh.h b/opensm/include/opensm/osm_mesh.h > index 173fa86..89c07e5 100644 > --- a/opensm/include/opensm/osm_mesh.h > +++ b/opensm/include/opensm/osm_mesh.h > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2088 System Fabric Works, Inc. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -70,6 +71,7 @@ typedef struct _mesh_node { > } mesh_node_t; > > void osm_mesh_node_delete(struct _lash *p_lash, struct _switch *sw); > +void osm_mesh_delete_switches(struct _lash *p_lash); > int osm_mesh_node_create(struct _lash *p_lash, struct _switch *sw); > int osm_do_mesh_analysis(struct _lash *p_lash); > > diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c > index 23fad87..b22fe6e 100644 > --- a/opensm/opensm/osm_mesh.c > +++ b/opensm/opensm/osm_mesh.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -1358,6 +1359,20 @@ void osm_mesh_node_delete(lash_t *p_lash, switch_t *sw) > } > > /* > + * osm_mesh_delete_switches - cleanup switches resources > + */ > +void osm_mesh_delete_switches(lash_t *p_lash) > +{ > + if (p_lash->switches) { > + unsigned id; > + for (id = 0; ((int)id) < p_lash->num_switches; id++) > + if (p_lash->switches[id]) > + osm_mesh_node_delete(p_lash, > + p_lash->switches[id]); > + } > +} Why should it be in osm_mesh.c? osm_mesh_node_create() and osm_mesh_node_delete() are called in osm_ucast_lash.c now. For me it looks that more appropriate place for such cleanup is lash_free_structures() func in osm_ucast_lash.c. Sasha From sashak at voltaire.com Sun Aug 2 03:32:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 13:32:56 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to return success/failure status In-Reply-To: <20090731135316.GB10365@comcast.net> References: <20090731135316.GB10365@comcast.net> Message-ID: <20090802103256.GF5287@me> Hi Hal, On 09:53 Fri 31 Jul , Hal Rosenstock wrote: > > based on valid/invalid hop count rather than relying on debug assert When could an invalid hop count be passed to this function? And what could happen? ib_smp_init_new() is a simple structure fill-up helper (inlined and defined in header file) and I don't think that we need to check parameters there. This patch also introduces sort of inconsistency - a hop count is checked and other parameters aren't. > Handle invalid status appropriate in callers of ib_smp_init_new > > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > index beb7492..6668d96 100644 > --- a/opensm/include/iba/ib_types.h > +++ b/opensm/include/iba/ib_types.h > @@ -4091,11 +4092,11 @@ static inline boolean_t OSM_API ib_smp_is_d(IN const ib_smp_t * const p_smp) > * > * TODO > * This is too big for inlining, but leave it here for now > -* since there is not yet another convient spot. > +* since there is not yet another convenient spot. > * > * SYNOPSIS > */ > -static inline void OSM_API > +static inline boolean_t OSM_API > ib_smp_init_new(IN ib_smp_t * const p_smp, > IN const uint8_t method, > IN const ib_net64_t trans_id, > @@ -4107,7 +4108,9 @@ ib_smp_init_new(IN ib_smp_t * const p_smp, > IN const ib_net16_t dr_slid, IN const ib_net16_t dr_dlid) > { > CL_ASSERT(p_smp); > - CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX); > + > + if (hop_count >= IB_SUBNET_PATH_HOPS_MAX) > + return FALSE; > p_smp->base_ver = 1; > p_smp->mgmt_class = IB_MCLASS_SUBN_DIR; > p_smp->class_ver = 1; > @@ -4130,6 +4133,7 @@ ib_smp_init_new(IN ib_smp_t * const p_smp, > > /* copy the path */ > memcpy(&p_smp->initial_path, path_out, sizeof(p_smp->initial_path)); > + return TRUE; > } > > /* > diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c > index be9a92b..7934173 100644 > --- a/opensm/opensm/osm_req.c > +++ b/opensm/opensm/osm_req.c > @@ -102,14 +102,21 @@ osm_req_get(IN osm_sm_t * sm, > ib_get_sm_attr_str(attr_id), cl_ntoh16(attr_id), > cl_ntoh32(attr_mod), cl_ntoh64(tid)); > > - ib_smp_init_new(osm_madw_get_smp_ptr(p_madw), > - IB_MAD_METHOD_GET, > - tid, > - attr_id, > - attr_mod, > - p_path->hop_count, > - sm->p_subn->opt.m_key, > - p_path->path, IB_LID_PERMISSIVE, IB_LID_PERMISSIVE); > + if (!ib_smp_init_new(osm_madw_get_smp_ptr(p_madw), > + IB_MAD_METHOD_GET, > + tid, > + attr_id, > + attr_mod, > + p_path->hop_count, > + sm->p_subn->opt.m_key, > + p_path->path, > + IB_LID_PERMISSIVE, IB_LID_PERMISSIVE)) { > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 1108: " > + "ib_smp_init_new failed: hop count %d\n", > + p_path->hop_count); This is assumption on how ib_smp_init_new() is actually implemented - not perfect. Sasha From hal.rosenstock at gmail.com Sun Aug 2 03:50:56 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 2 Aug 2009 06:50:56 -0400 Subject: [ofa-general] [PATCHv2] opensm/mesh/lash: Fix use after free problem in osm_mesh_node_delete In-Reply-To: <20090802100940.GE5287@me> References: <20090731135147.GA10365@comcast.net> <20090802100940.GE5287@me> Message-ID: Hi Sasha, On Sun, Aug 2, 2009 at 6:09 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 09:51 Fri 31 Jul , Hal Rosenstock wrote: > > > > When osm_mesh_node_delete is called, osm_switch_delete may already have > > been called so sw->p_sw is no longer valid to be used although it was > > being used to obtain num_ports. > > > > Fix this by performing osm_mesh_delete_switches at the end of > lash_process. > > > > Signed-off-by: Hal Rosenstock > > --- > > Changes since v1: > > Rather than saving num_ports in the mesh node structure on creation and > using > > this on deletion, mesh switches deletion should occur at end of the lash > > calculation as none of this state is needed after that > > Approach proposed by Sasha > > > > diff --git a/opensm/include/opensm/osm_mesh.h > b/opensm/include/opensm/osm_mesh.h > > index 173fa86..89c07e5 100644 > > --- a/opensm/include/opensm/osm_mesh.h > > +++ b/opensm/include/opensm/osm_mesh.h > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2088 System Fabric Works, Inc. > > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -70,6 +71,7 @@ typedef struct _mesh_node { > > } mesh_node_t; > > > > void osm_mesh_node_delete(struct _lash *p_lash, struct _switch *sw); > > +void osm_mesh_delete_switches(struct _lash *p_lash); > > int osm_mesh_node_create(struct _lash *p_lash, struct _switch *sw); > > int osm_do_mesh_analysis(struct _lash *p_lash); > > > > diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c > > index 23fad87..b22fe6e 100644 > > --- a/opensm/opensm/osm_mesh.c > > +++ b/opensm/opensm/osm_mesh.c > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights > reserved. > > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -1358,6 +1359,20 @@ void osm_mesh_node_delete(lash_t *p_lash, switch_t > *sw) > > } > > > > /* > > + * osm_mesh_delete_switches - cleanup switches resources > > + */ > > +void osm_mesh_delete_switches(lash_t *p_lash) > > +{ > > + if (p_lash->switches) { > > + unsigned id; > > + for (id = 0; ((int)id) < p_lash->num_switches; id++) > > + if (p_lash->switches[id]) > > + osm_mesh_node_delete(p_lash, > > + p_lash->switches[id]); > > + } > > +} > > Why should it be in osm_mesh.c? osm_mesh_node_create() and > osm_mesh_node_delete() are called in osm_ucast_lash.c now. > > For me it looks that more appropriate place for such cleanup is > lash_free_structures() func in osm_ucast_lash.c. Not quite as it cannot be cleaned up until after discover_network_properties is called and succeeds. -- Hal > > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Sun Aug 2 03:53:31 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 06:53:31 -0400 Subject: [ofa-general] [PATCHv3] opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete Message-ID: <20090802105331.GA26002@comcast.net> When osm_mesh_node_delete is called, osm_switch_delete may already have been called so sw->p_sw is no longer valid to be used although it was being used to obtain num_ports. Fix this by performing delete_switches at the end of lash_process. Signed-off-by: Hal Rosenstock --- Changes since v2: Moved mesh switches deletion into lash Changes since v1: Rather than saving num_ports in the mesh node structure on creation and using this on deletion, mesh switches deletion should occur at end of the lash calculation as none of this state is needed after that Approach proposed by Sasha diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 1c55a90..cf8e793 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -5,6 +5,7 @@ * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw) free(sw); } +static void delete_switches(lash_t *p_lash) +{ + if (p_lash->switches) { + unsigned id; + for (id = 0; ((int)id) < p_lash->num_switches; id++) + if (p_lash->switches[id]) + osm_mesh_node_delete(p_lash, + p_lash->switches[id]); + } +} + + static void free_lash_structures(lash_t * p_lash) { unsigned int i, j, k; @@ -1219,7 +1232,7 @@ static int lash_process(void *context) return_status = discover_network_properties(p_lash); if (return_status != IB_SUCCESS) - goto Exit; + goto Exit2; return_status = init_lash_structures(p_lash); if (return_status != IB_SUCCESS) @@ -1234,6 +1247,9 @@ static int lash_process(void *context) populate_fwd_tbls(p_lash); Exit: + delete_switches(p_lash); + +Exit2: if (p_lash->vl_min) free_lash_structures(p_lash); OSM_LOG_EXIT(p_log); From hal.rosenstock at gmail.com Sun Aug 2 03:59:27 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 2 Aug 2009 06:59:27 -0400 Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to return success/failure status In-Reply-To: <20090802103256.GF5287@me> References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me> Message-ID: HI Sasha, On Sun, Aug 2, 2009 at 6:32 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 09:53 Fri 31 Jul , Hal Rosenstock wrote: > > > > based on valid/invalid hop count rather than relying on debug assert > > When could an invalid hop count be passed to this function? Some out of tree user. > And what could happen? It writes past the end of the path array. > > ib_smp_init_new() is a simple structure fill-up helper (inlined and > defined in header file) and I don't think that we need to check > parameters there. > > This patch also introduces sort of inconsistency - a hop count is checked > and other parameters aren't. It's to protect against writing past end of array. Do any other parameters need checking ? I think they just result in some timeout condition resulting. -- Hal > > > > Handle invalid status appropriate in callers of ib_smp_init_new > > > > Signed-off-by: Hal Rosenstock > > --- > > diff --git a/opensm/include/iba/ib_types.h > b/opensm/include/iba/ib_types.h > > index beb7492..6668d96 100644 > > --- a/opensm/include/iba/ib_types.h > > +++ b/opensm/include/iba/ib_types.h > > @@ -4091,11 +4092,11 @@ static inline boolean_t OSM_API ib_smp_is_d(IN > const ib_smp_t * const p_smp) > > * > > * TODO > > * This is too big for inlining, but leave it here for now > > -* since there is not yet another convient spot. > > +* since there is not yet another convenient spot. > > * > > * SYNOPSIS > > */ > > -static inline void OSM_API > > +static inline boolean_t OSM_API > > ib_smp_init_new(IN ib_smp_t * const p_smp, > > IN const uint8_t method, > > IN const ib_net64_t trans_id, > > @@ -4107,7 +4108,9 @@ ib_smp_init_new(IN ib_smp_t * const p_smp, > > IN const ib_net16_t dr_slid, IN const ib_net16_t dr_dlid) > > { > > CL_ASSERT(p_smp); > > - CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX); > > + > > + if (hop_count >= IB_SUBNET_PATH_HOPS_MAX) > > + return FALSE; > > p_smp->base_ver = 1; > > p_smp->mgmt_class = IB_MCLASS_SUBN_DIR; > > p_smp->class_ver = 1; > > @@ -4130,6 +4133,7 @@ ib_smp_init_new(IN ib_smp_t * const p_smp, > > > > /* copy the path */ > > memcpy(&p_smp->initial_path, path_out, > sizeof(p_smp->initial_path)); > > + return TRUE; > > } > > > > /* > > diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c > > index be9a92b..7934173 100644 > > --- a/opensm/opensm/osm_req.c > > +++ b/opensm/opensm/osm_req.c > > @@ -102,14 +102,21 @@ osm_req_get(IN osm_sm_t * sm, > > ib_get_sm_attr_str(attr_id), cl_ntoh16(attr_id), > > cl_ntoh32(attr_mod), cl_ntoh64(tid)); > > > > - ib_smp_init_new(osm_madw_get_smp_ptr(p_madw), > > - IB_MAD_METHOD_GET, > > - tid, > > - attr_id, > > - attr_mod, > > - p_path->hop_count, > > - sm->p_subn->opt.m_key, > > - p_path->path, IB_LID_PERMISSIVE, > IB_LID_PERMISSIVE); > > + if (!ib_smp_init_new(osm_madw_get_smp_ptr(p_madw), > > + IB_MAD_METHOD_GET, > > + tid, > > + attr_id, > > + attr_mod, > > + p_path->hop_count, > > + sm->p_subn->opt.m_key, > > + p_path->path, > > + IB_LID_PERMISSIVE, IB_LID_PERMISSIVE)) { > > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 1108: " > > + "ib_smp_init_new failed: hop count %d\n", > > + p_path->hop_count); > > This is assumption on how ib_smp_init_new() is actually implemented - > not perfect. > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Aug 2 04:16:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 14:16:01 +0300 Subject: [ofa-general] Re: [PATCHv3] opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete In-Reply-To: <20090802105331.GA26002@comcast.net> References: <20090802105331.GA26002@comcast.net> Message-ID: <20090802111601.GI5287@me> On 06:53 Sun 02 Aug , Hal Rosenstock wrote: > diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c > index 1c55a90..cf8e793 100644 > --- a/opensm/opensm/osm_ucast_lash.c > +++ b/opensm/opensm/osm_ucast_lash.c > @@ -5,6 +5,7 @@ > * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. > * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. > * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw) > free(sw); > } > > +static void delete_switches(lash_t *p_lash) Would delete_mesh_switches() (or cleanup_mesh*()) be a better name? It doesn't delete lash's switches, only mesh nodes. > +{ > + if (p_lash->switches) { > + unsigned id; > + for (id = 0; ((int)id) < p_lash->num_switches; id++) > + if (p_lash->switches[id]) > + osm_mesh_node_delete(p_lash, > + p_lash->switches[id]); > + } > +} > + > + > static void free_lash_structures(lash_t * p_lash) > { > unsigned int i, j, k; > @@ -1219,7 +1232,7 @@ static int lash_process(void *context) > > return_status = discover_network_properties(p_lash); discover_network_properties() can fail in a middle of allocations and full clean is desired anyway. It should be safe to 'goto Exit' below since mesh node deletion is protected against not yet initialized input. Sasha > if (return_status != IB_SUCCESS) > - goto Exit; > + goto Exit2; > > return_status = init_lash_structures(p_lash); > if (return_status != IB_SUCCESS) > @@ -1234,6 +1247,9 @@ static int lash_process(void *context) > populate_fwd_tbls(p_lash); > > Exit: > + delete_switches(p_lash); > + > +Exit2: > if (p_lash->vl_min) > free_lash_structures(p_lash); > OSM_LOG_EXIT(p_log); > From hal.rosenstock at gmail.com Sun Aug 2 04:17:21 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 2 Aug 2009 07:17:21 -0400 Subject: [ofa-general] Re: [PATCHv3] opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete In-Reply-To: <20090802111601.GI5287@me> References: <20090802105331.GA26002@comcast.net> <20090802111601.GI5287@me> Message-ID: On Sun, Aug 2, 2009 at 7:16 AM, Sasha Khapyorsky wrote: > On 06:53 Sun 02 Aug , Hal Rosenstock wrote: > > diff --git a/opensm/opensm/osm_ucast_lash.c > b/opensm/opensm/osm_ucast_lash.c > > index 1c55a90..cf8e793 100644 > > --- a/opensm/opensm/osm_ucast_lash.c > > +++ b/opensm/opensm/osm_ucast_lash.c > > @@ -5,6 +5,7 @@ > > * Copyright (c) 2007 Simula Research Laboratory. All rights > reserved. > > * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. > > * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights > reserved. > > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * > sw) > > free(sw); > > } > > > > +static void delete_switches(lash_t *p_lash) > > Would delete_mesh_switches() (or cleanup_mesh*()) be a better name? It > doesn't delete lash's switches, only mesh nodes. Sure. > > > > +{ > > + if (p_lash->switches) { > > + unsigned id; > > + for (id = 0; ((int)id) < p_lash->num_switches; id++) > > + if (p_lash->switches[id]) > > + osm_mesh_node_delete(p_lash, > > + p_lash->switches[id]); > > + } > > +} > > + > > + > > static void free_lash_structures(lash_t * p_lash) > > { > > unsigned int i, j, k; > > @@ -1219,7 +1232,7 @@ static int lash_process(void *context) > > > > return_status = discover_network_properties(p_lash); > > discover_network_properties() can fail in a middle of allocations and > full clean is desired anyway. It should be safe to 'goto Exit' below > since mesh node deletion is protected against not yet initialized input. It's not; I had tried doing that. -- Hal > > > Sasha > > > if (return_status != IB_SUCCESS) > > - goto Exit; > > + goto Exit2; > > > > return_status = init_lash_structures(p_lash); > > if (return_status != IB_SUCCESS) > > @@ -1234,6 +1247,9 @@ static int lash_process(void *context) > > populate_fwd_tbls(p_lash); > > > > Exit: > > + delete_switches(p_lash); > > + > > +Exit2: > > if (p_lash->vl_min) > > free_lash_structures(p_lash); > > OSM_LOG_EXIT(p_log); > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Sun Aug 2 04:18:22 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 2 Aug 2009 07:18:22 -0400 Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to return success/failure status In-Reply-To: References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me> Message-ID: On Sun, Aug 2, 2009 at 6:59 AM, Hal Rosenstock wrote: > HI Sasha, > > On Sun, Aug 2, 2009 at 6:32 AM, Sasha Khapyorsky wrote: > >> Hi Hal, >> >> On 09:53 Fri 31 Jul , Hal Rosenstock wrote: >> > >> > based on valid/invalid hop count rather than relying on debug assert >> >> When could an invalid hop count be passed to this function? > > > Some out of tree user. > Also, I think opensm can also do this now with such topologies (without some other changes which are in the pipe). -- Hal > > >> And what could happen? > > > It writes past the end of the path array. > > >> >> ib_smp_init_new() is a simple structure fill-up helper (inlined and >> defined in header file) and I don't think that we need to check >> parameters there. >> >> This patch also introduces sort of inconsistency - a hop count is checked >> and other parameters aren't. > > > It's to protect against writing past end of array. Do any other parameters > need checking ? I think they just result in some timeout condition > resulting. > > -- Hal > > >> >> >> > Handle invalid status appropriate in callers of ib_smp_init_new >> > >> > Signed-off-by: Hal Rosenstock >> > --- >> > diff --git a/opensm/include/iba/ib_types.h >> b/opensm/include/iba/ib_types.h >> > index beb7492..6668d96 100644 >> > --- a/opensm/include/iba/ib_types.h >> > +++ b/opensm/include/iba/ib_types.h >> > @@ -4091,11 +4092,11 @@ static inline boolean_t OSM_API ib_smp_is_d(IN >> const ib_smp_t * const p_smp) >> > * >> > * TODO >> > * This is too big for inlining, but leave it here for now >> > -* since there is not yet another convient spot. >> > +* since there is not yet another convenient spot. >> > * >> > * SYNOPSIS >> > */ >> > -static inline void OSM_API >> > +static inline boolean_t OSM_API >> > ib_smp_init_new(IN ib_smp_t * const p_smp, >> > IN const uint8_t method, >> > IN const ib_net64_t trans_id, >> > @@ -4107,7 +4108,9 @@ ib_smp_init_new(IN ib_smp_t * const p_smp, >> > IN const ib_net16_t dr_slid, IN const ib_net16_t dr_dlid) >> > { >> > CL_ASSERT(p_smp); >> > - CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX); >> > + >> > + if (hop_count >= IB_SUBNET_PATH_HOPS_MAX) >> > + return FALSE; >> > p_smp->base_ver = 1; >> > p_smp->mgmt_class = IB_MCLASS_SUBN_DIR; >> > p_smp->class_ver = 1; >> > @@ -4130,6 +4133,7 @@ ib_smp_init_new(IN ib_smp_t * const p_smp, >> > >> > /* copy the path */ >> > memcpy(&p_smp->initial_path, path_out, >> sizeof(p_smp->initial_path)); >> > + return TRUE; >> > } >> > >> > /* >> > diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c >> > index be9a92b..7934173 100644 >> > --- a/opensm/opensm/osm_req.c >> > +++ b/opensm/opensm/osm_req.c >> > @@ -102,14 +102,21 @@ osm_req_get(IN osm_sm_t * sm, >> > ib_get_sm_attr_str(attr_id), cl_ntoh16(attr_id), >> > cl_ntoh32(attr_mod), cl_ntoh64(tid)); >> > >> > - ib_smp_init_new(osm_madw_get_smp_ptr(p_madw), >> > - IB_MAD_METHOD_GET, >> > - tid, >> > - attr_id, >> > - attr_mod, >> > - p_path->hop_count, >> > - sm->p_subn->opt.m_key, >> > - p_path->path, IB_LID_PERMISSIVE, >> IB_LID_PERMISSIVE); >> > + if (!ib_smp_init_new(osm_madw_get_smp_ptr(p_madw), >> > + IB_MAD_METHOD_GET, >> > + tid, >> > + attr_id, >> > + attr_mod, >> > + p_path->hop_count, >> > + sm->p_subn->opt.m_key, >> > + p_path->path, >> > + IB_LID_PERMISSIVE, IB_LID_PERMISSIVE)) { >> > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 1108: " >> > + "ib_smp_init_new failed: hop count %d\n", >> > + p_path->hop_count); >> >> This is assumption on how ib_smp_init_new() is actually implemented - >> not perfect. >> >> Sasha >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Sun Aug 2 04:26:37 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 2 Aug 2009 07:26:37 -0400 Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes In-Reply-To: <4A742E94.2070002@gmail.com> References: <4A6DDFCE.9060009@voltaire.com> <4A703DA4.9080300@Voltaire.COM> <4A705B3A.7060404@Voltaire.COM> <4A731818.3060500@voltaire.com> <4A733D24.3040201@voltaire.com> <4A742E94.2070002@gmail.com> Message-ID: On Sat, Aug 1, 2009 at 8:01 AM, Yossi Etigin wrote: > > Hal Rosenstock wrote: > > > > Yes, but AFAIK the only "bad" case is if the LID stays the same but > > LMC changes to a lower > > value. In this case the path refresh will not happen when it is > > supposed to. > > > > > > What's the impact of that ? > > > > Also the LID can change at the same time as the LMC. > > > > I can't tell if all the possible cases are handled properly. Are they ? > > > > Let's see: > Only LID changes - handled correctly. > LMC (and possibly LID) change - either we "catch" this, or we don't. > If we do, the path and LMC will be refreshed so we will not keep refreshing > the path forever (like it could have been if we didn't refresh the LMC). > If we don't - ipoib packets will not reach the neighbour, which is the same > situation there is today. By handled correctly, you mean that the ARP request gets to the remote node, is responded to, and the response makes it back and that is treated as valid path indication, right ? If so, is the original ARP request unicast or broadcast ? If the request is unicast, couldn't it be sent using the wrong static rate as isn't it using the original path parameters ? Even if it is broadcast, if the original path parameters are still used (like rate, etc.) at the local node, doesn't this assume a homogeneous subnet ? From sashak at voltaire.com Sun Aug 2 04:49:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 14:49:32 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to return success/failure status In-Reply-To: References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me> Message-ID: <20090802114932.GK5287@me> On 06:59 Sun 02 Aug , Hal Rosenstock wrote: > > Some out of tree user. They should care to pass a valid data - ib_smp_init_new() is a simple helper. Sasha From hnrose at comcast.net Sun Aug 2 04:50:11 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 07:50:11 -0400 Subject: [ofa-general] [PATCHv4] opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete Message-ID: <20090802115011.GA9345@comcast.net> When osm_mesh_node_delete is called, osm_switch_delete may already have been called so sw->p_sw is no longer valid to be used although it was being used to obtain num_ports. Fix this by performing delete_mesh_switches at the end of lash_process. Signed-off-by: Hal Rosenstock --- Changes since v3: Changed name of delete_switches to delete_mesh_switches Changes since v2: Moved mesh switches deletion into lash Changes since v1: Rather than saving num_ports in the mesh node structure on creation and using this on deletion, mesh switches deletion should occur at end of the lash calculation as none of this state is needed after that Approach proposed by Sasha diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 1c55a90..841c0fd 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -5,6 +5,7 @@ * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw) free(sw); } +static void delete_mesh_switches(lash_t *p_lash) +{ + if (p_lash->switches) { + unsigned id; + for (id = 0; ((int)id) < p_lash->num_switches; id++) + if (p_lash->switches[id]) + osm_mesh_node_delete(p_lash, + p_lash->switches[id]); + } +} + + static void free_lash_structures(lash_t * p_lash) { unsigned int i, j, k; @@ -1219,7 +1232,7 @@ static int lash_process(void *context) return_status = discover_network_properties(p_lash); if (return_status != IB_SUCCESS) - goto Exit; + goto Exit2; return_status = init_lash_structures(p_lash); if (return_status != IB_SUCCESS) @@ -1234,6 +1247,9 @@ static int lash_process(void *context) populate_fwd_tbls(p_lash); Exit: + delete_mesh_switches(p_lash); + +Exit2: if (p_lash->vl_min) free_lash_structures(p_lash); OSM_LOG_EXIT(p_log); From hal.rosenstock at gmail.com Sun Aug 2 04:54:01 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 2 Aug 2009 07:54:01 -0400 Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to return success/failure status In-Reply-To: <20090802114932.GK5287@me> References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me> <20090802114932.GK5287@me> Message-ID: On Sun, Aug 2, 2009 at 7:49 AM, Sasha Khapyorsky wrote: > On 06:59 Sun 02 Aug , Hal Rosenstock wrote: > > > > Some out of tree user. > > They should care to pass a valid data - ib_smp_init_new() is a simple > helper. opensm too ? Why replicate this simple check all over the place ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Aug 2 04:57:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 14:57:35 +0300 Subject: [ofa-general] Re: [PATCHv3] opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete In-Reply-To: References: <20090802105331.GA26002@comcast.net> <20090802111601.GI5287@me> Message-ID: <20090802115735.GL5287@me> On 07:17 Sun 02 Aug , Hal Rosenstock wrote: > > > > > +{ > > > + if (p_lash->switches) { > > > + unsigned id; > > > + for (id = 0; ((int)id) < p_lash->num_switches; id++) > > > + if (p_lash->switches[id]) > > > + osm_mesh_node_delete(p_lash, > > > + p_lash->switches[id]); > > > + } > > > +} > > > + > > > + > > > static void free_lash_structures(lash_t * p_lash) > > > { > > > unsigned int i, j, k; > > > @@ -1219,7 +1232,7 @@ static int lash_process(void *context) > > > > > > return_status = discover_network_properties(p_lash); > > > > discover_network_properties() can fail in a middle of allocations and > > full clean is desired anyway. It should be safe to 'goto Exit' below > > since mesh node deletion is protected against not yet initialized input. > > > It's not; Could you elaborate? Sasha From hnrose at comcast.net Sun Aug 2 05:40:52 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 08:40:52 -0400 Subject: [ofa-general] [PATCHv5] opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete Message-ID: <20090802124052.GA18247@comcast.net> When osm_mesh_node_delete is called, osm_switch_delete may already have been called so sw->p_sw is no longer valid to be used although it was being used to obtain num_ports. Fix this by performing delete_mesh_switches in free_lash_structures. Signed-off-by: Hal Rosenstock --- Changes since v4: Moved call of delete_mesh_switches into free_lash_structures Changes since v3: Changed name of delete_switches to delete_mesh_switches Changes since v2: Moved mesh switches deletion into lash Changes since v1: Rather than saving num_ports in the mesh node structure on creation and using this on deletion, mesh switches deletion should occur at end of the lash calculation as none of this state is needed after that Approach proposed by Sasha diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 1c55a90..a62cb3d 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -5,6 +5,7 @@ * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -659,6 +660,18 @@ static void switch_delete(lash_t *p_lash, switch_t * sw) free(sw); } +static void delete_mesh_switches(lash_t *p_lash) +{ + if (p_lash->switches) { + unsigned id; + for (id = 0; ((int)id) < p_lash->num_switches; id++) + if (p_lash->switches[id]) + osm_mesh_node_delete(p_lash, + p_lash->switches[id]); + } +} + + static void free_lash_structures(lash_t * p_lash) { unsigned int i, j, k; @@ -667,6 +680,8 @@ static void free_lash_structures(lash_t * p_lash) OSM_LOG_ENTER(p_log); + delete_mesh_switches(p_lash); + /* free cdg_vertex_matrix */ for (i = 0; i < p_lash->vl_min; i++) { for (j = 0; j < num_switches; j++) { From sashak at voltaire.com Sun Aug 2 06:16:16 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 16:16:16 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Change ib_smp_init_new to return success/failure status In-Reply-To: References: <20090731135316.GB10365@comcast.net> <20090802103256.GF5287@me> <20090802114932.GK5287@me> Message-ID: <20090802131616.GM5287@me> On 07:54 Sun 02 Aug , Hal Rosenstock wrote: > > > > They should care to pass a valid data - ib_smp_init_new() is a simple > > helper. > > > opensm too ? Ok, path overflow is theoretically possible only when path is extended (by itself wrong extension will overflow osm_dr_path_t path buffer). So it should be pretty enough to check for overflow only in three places: requery_dup_node_info() in osm_node_info_rcv.c pi_rcv_process_switch_port() in osm_port_info_rcv.c state_mgr_get_remote_port_info() in osm_state_mgr.c > Why replicate this simple check all over the place ? And if you wish to make a single point check then I guess that function osm_dr_path_extend() is the place (and this is called less frequently than ib_smp_init_new()). Sasha From hnrose at comcast.net Sun Aug 2 08:03:18 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 11:03:18 -0400 Subject: [ofa-general] [PATCH] opensm: osm_dr_path_extend can fail due to invalid hop count Message-ID: <20090802150318.GA20037@comcast.net> Change routine to return success/failure status rather than depend on debug assert Also, fix callers of this routine to handle this return status Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h index 8d65d2c..7ef0fc5 100644 --- a/opensm/include/opensm/osm_path.h +++ b/opensm/include/opensm/osm_path.h @@ -2,6 +2,7 @@ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -188,15 +189,18 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path, * * SYNOPSIS */ -static inline void +static inline boolean_t osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num) { p_path->hop_count++; - CL_ASSERT(p_path->hop_count < IB_SUBNET_PATH_HOPS_MAX); + + if (p_path->hop_count >= IB_SUBNET_PATH_HOPS_MAX) + return FALSE; /* Location 0 in the path array is reserved per IB spec. */ p_path->path[p_path->hop_count] = port_num; + return TRUE; } /* @@ -208,7 +212,7 @@ osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num) * [in] Additional port to add to the DR path. * * RETURN VALUE -* None. +* Boolean indicating whether or not path was extended. * * NOTES * diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index bfa5b1f..f5ef1ac 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -85,7 +86,10 @@ static void report_duplicated_guid(IN osm_sm_t * sm, osm_physp_t * p_physp, OSM_LOG_ERROR); path = *osm_physp_get_dr_path_ptr(p_new); - osm_dr_path_extend(&path, port_num); + if (!osm_dr_path_extend(&path, port_num)) + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D05: " + "DR path with hop count %d couldn't be extended\n", + path.hop_count); osm_dump_dr_path(sm->p_log, &path, OSM_LOG_ERROR); osm_log(sm->p_log, OSM_LOG_SYS, @@ -100,7 +104,12 @@ static void requery_dup_node_info(IN osm_sm_t * sm, osm_physp_t * p_physp, cl_status_t status; path = *osm_physp_get_dr_path_ptr(p_physp->p_remote_physp); - osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num); + if (!osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num)) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D08: " + "DR path with hop count %d couldn't be extended\n", + path.hop_count); + return; + } context.ni_context.node_guid = p_physp->p_remote_physp->p_node->node_info.port_guid; diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 7b6fb1a..a451de7 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -246,9 +247,15 @@ static void pi_rcv_process_switch_port(IN osm_sm_t * sm, IN osm_node_t * p_node, osm_physp_get_port_num(p_physp)) { path = *osm_physp_get_dr_path_ptr(p_physp); - osm_dr_path_extend(&path, - osm_physp_get_port_num - (p_physp)); + if (!osm_dr_path_extend(&path, + osm_physp_get_port_num + (p_physp))) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, + "ERR 0F08: " + "DR path with hop count %d couldn't be extended\n", + path.hop_count); + break; + } memset(&context, 0, sizeof(context)); context.ni_context.node_guid = diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index adc39a0..44b0f6c 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -166,7 +167,13 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm, /* generate a dr path leaving on the physp to the remote node */ p_dr_path = osm_physp_get_dr_path_ptr(p_physp); memcpy(&rem_node_dr_path, p_dr_path, sizeof(osm_dr_path_t)); - osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp)); + if (!osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp))) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332D: " + "DR path with hop count %d couldn't be extended " + "so skipping PortInfo query\n", + p_dr_path->hop_count); + goto Exit; + } memset(&mad_context, 0, sizeof(mad_context)); @@ -187,6 +194,7 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm, OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332E: " "Request for PortInfo failed\n"); +Exit: OSM_LOG_EXIT(sm->p_log); } From sashak at voltaire.com Sun Aug 2 08:07:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 18:07:12 +0300 Subject: [ofa-general] Re: [PATCHv5] opensm/osm_lash: Fix use after free problem in osm_mesh_node_delete In-Reply-To: <20090802124052.GA18247@comcast.net> References: <20090802124052.GA18247@comcast.net> Message-ID: <20090802150712.GP5287@me> On 08:40 Sun 02 Aug , Hal Rosenstock wrote: > > When osm_mesh_node_delete is called, osm_switch_delete may already have > been called so sw->p_sw is no longer valid to be used although it was > being used to obtain num_ports. > > Fix this by performing delete_mesh_switches in free_lash_structures. > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hnrose at comcast.net Sun Aug 2 08:16:12 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 11:16:12 -0400 Subject: [ofa-general] [PATCH] opensm/ib_types.h: Update ib_mad_is_response description Message-ID: <20090802151612.GA27074@comcast.net> Also, fix typo in ib_smp_init_new TODO Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index beb7492..fe3f051 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -3779,7 +3779,8 @@ ib_mad_init_response(IN const ib_mad_t * const p_req_mad, * ib_mad_is_response * * DESCRIPTION -* Returns TRUE if the MAD is a response ('R' bit set), +* Returns TRUE if the MAD is a response ('R' bit set) +* or if the MAD is a TRAP REPRESS, * FALSE otherwise. * * SYNOPSIS @@ -4091,7 +4092,7 @@ static inline boolean_t OSM_API ib_smp_is_d(IN const ib_smp_t * const p_smp) * * TODO * This is too big for inlining, but leave it here for now -* since there is not yet another convient spot. +* since there is not yet another convenient spot. * * SYNOPSIS */ From sashak at voltaire.com Sun Aug 2 08:20:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 18:20:06 +0300 Subject: [ofa-general] Re: [PATCH] opensm: osm_dr_path_extend can fail due to invalid hop count In-Reply-To: <20090802150318.GA20037@comcast.net> References: <20090802150318.GA20037@comcast.net> Message-ID: <20090802152006.GR5287@me> On 11:03 Sun 02 Aug , Hal Rosenstock wrote: > > Change routine to return success/failure status rather than > depend on debug assert > Also, fix callers of this routine to handle this return status > > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h > index 8d65d2c..7ef0fc5 100644 > --- a/opensm/include/opensm/osm_path.h > +++ b/opensm/include/opensm/osm_path.h > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -188,15 +189,18 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path, > * > * SYNOPSIS > */ > -static inline void > +static inline boolean_t But why boolean? It is not logical operation, what is wrong with just int? Sasha From sashak at voltaire.com Sun Aug 2 08:21:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 18:21:26 +0300 Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: Update ib_mad_is_response description In-Reply-To: <20090802151612.GA27074@comcast.net> References: <20090802151612.GA27074@comcast.net> Message-ID: <20090802152126.GS5287@me> On 11:16 Sun 02 Aug , Hal Rosenstock wrote: > > Also, fix typo in ib_smp_init_new TODO > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hnrose at comcast.net Sun Aug 2 08:22:04 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 11:22:04 -0400 Subject: [ofa-general] [PATCH] opensm/osm_path.h: In osm_dr_path_init, only copy needed part of path Message-ID: <20090802152204.GA27199@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h index 8d65d2c..7ef0fc5 100644 --- a/opensm/include/opensm/osm_path.h +++ b/opensm/include/opensm/osm_path.h @@ -155,7 +156,7 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path, CL_ASSERT(hop_count < IB_SUBNET_PATH_HOPS_MAX); p_path->h_bind = h_bind; p_path->hop_count = hop_count; - memcpy(p_path->path, path, IB_SUBNET_PATH_HOPS_MAX); + memcpy(p_path->path, path, hop_count + 1); } /* From hnrose at comcast.net Sun Aug 2 08:31:33 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 11:31:33 -0400 Subject: [ofa-general] [PATCHv2] opensm: osm_dr_path_extend can fail due to invalid hop count Message-ID: <20090802153133.GA29647@comcast.net> Change routine to return success/failure status rather than depend on debug assert Also, fix callers of this routine to handle this return status Signed-off-by: Hal Rosenstock --- Changes since v1: Make osm_dr_path_extend return int rather than boolean diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h index 8d65d2c..d02576b 100644 --- a/opensm/include/opensm/osm_path.h +++ b/opensm/include/opensm/osm_path.h @@ -2,6 +2,7 @@ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -188,15 +189,18 @@ osm_dr_path_init(IN osm_dr_path_t * const p_path, * * SYNOPSIS */ -static inline void +static inline int osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num) { p_path->hop_count++; - CL_ASSERT(p_path->hop_count < IB_SUBNET_PATH_HOPS_MAX); + + if (p_path->hop_count >= IB_SUBNET_PATH_HOPS_MAX) + return -1; /* Location 0 in the path array is reserved per IB spec. */ p_path->path[p_path->hop_count] = port_num; + return 0; } /* @@ -208,7 +212,7 @@ osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num) * [in] Additional port to add to the DR path. * * RETURN VALUE -* None. +* Boolean indicating whether or not path was extended. * * NOTES * diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index bfa5b1f..c454d02 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -85,7 +86,10 @@ static void report_duplicated_guid(IN osm_sm_t * sm, osm_physp_t * p_physp, OSM_LOG_ERROR); path = *osm_physp_get_dr_path_ptr(p_new); - osm_dr_path_extend(&path, port_num); + if (osm_dr_path_extend(&path, port_num)) + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D05: " + "DR path with hop count %d couldn't be extended\n", + path.hop_count); osm_dump_dr_path(sm->p_log, &path, OSM_LOG_ERROR); osm_log(sm->p_log, OSM_LOG_SYS, @@ -100,7 +104,12 @@ static void requery_dup_node_info(IN osm_sm_t * sm, osm_physp_t * p_physp, cl_status_t status; path = *osm_physp_get_dr_path_ptr(p_physp->p_remote_physp); - osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num); + if (osm_dr_path_extend(&path, p_physp->p_remote_physp->port_num)) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0D08: " + "DR path with hop count %d couldn't be extended\n", + path.hop_count); + return; + } context.ni_context.node_guid = p_physp->p_remote_physp->p_node->node_info.port_guid; diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 7b6fb1a..57cc494 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -246,9 +247,15 @@ static void pi_rcv_process_switch_port(IN osm_sm_t * sm, IN osm_node_t * p_node, osm_physp_get_port_num(p_physp)) { path = *osm_physp_get_dr_path_ptr(p_physp); - osm_dr_path_extend(&path, - osm_physp_get_port_num - (p_physp)); + if (osm_dr_path_extend(&path, + osm_physp_get_port_num + (p_physp))) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, + "ERR 0F08: " + "DR path with hop count %d couldn't be extended\n", + path.hop_count); + break; + } memset(&context, 0, sizeof(context)); context.ni_context.node_guid = diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index adc39a0..90bef87 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -166,7 +167,13 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm, /* generate a dr path leaving on the physp to the remote node */ p_dr_path = osm_physp_get_dr_path_ptr(p_physp); memcpy(&rem_node_dr_path, p_dr_path, sizeof(osm_dr_path_t)); - osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp)); + if (osm_dr_path_extend(&rem_node_dr_path, osm_physp_get_port_num(p_physp))) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332D: " + "DR path with hop count %d couldn't be extended " + "so skipping PortInfo query\n", + p_dr_path->hop_count); + goto Exit; + } memset(&mad_context, 0, sizeof(mad_context)); @@ -187,6 +194,7 @@ static void state_mgr_get_remote_port_info(IN osm_sm_t * sm, OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 332E: " "Request for PortInfo failed\n"); +Exit: OSM_LOG_EXIT(sm->p_log); } From sashak at voltaire.com Sun Aug 2 08:51:30 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 18:51:30 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: osm_dr_path_extend can fail due to invalid hop count In-Reply-To: <20090802153133.GA29647@comcast.net> References: <20090802153133.GA29647@comcast.net> Message-ID: <20090802155130.GT5287@me> On 11:31 Sun 02 Aug , Hal Rosenstock wrote: > > Change routine to return success/failure status rather than > depend on debug assert > Also, fix callers of this routine to handle this return status > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 2 08:54:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 18:54:10 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_path.h: In osm_dr_path_init, only copy needed part of path In-Reply-To: <20090802152204.GA27199@comcast.net> References: <20090802152204.GA27199@comcast.net> Message-ID: <20090802155410.GU5287@me> On 11:22 Sun 02 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From bugzilla-daemon at bugzilla.kernel.org Sun Aug 2 10:57:42 2009 From: bugzilla-daemon at bugzilla.kernel.org (bugzilla-daemon at bugzilla.kernel.org) Date: Sun, 2 Aug 2009 17:57:42 GMT Subject: [ofa-general] [Bug 13893] New: NULL pointer dereference by SRP initiator after restarting SRP target followed by SCSI reset of initiator Message-ID: http://bugzilla.kernel.org/show_bug.cgi?id=13893 Summary: NULL pointer dereference by SRP initiator after restarting SRP target followed by SCSI reset of initiator Product: Drivers Version: 2.5 Kernel Version: 2.6.30.3 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Infiniband/RDMA AssignedTo: drivers_infiniband-rdma at kernel-bugs.osdl.org ReportedBy: bart.vanassche at gmail.com Regression: No Setup of the target system: - SCST revision 1000. - Contents of /etc/scst.conf on the target: [HANDLER vdisk] DEVICE disk01,/dev/exported-block,NV_CACHE,512 [HANDLER vcdrom] [GROUP Default] [ASSIGNMENT Default] DEVICE disk01,0 [TARGETS enable] [TARGETS disable] - After having installed SCST, start it as follows: dd if=/dev/zero of=/dev/exported-block bs=1M count=1000 /etc/init.d/scst restart Setup of the initiator system: - Vanilla 2.6.30.3 kernel. - Once the target has been set up, import the SRP target as follows: rmmod ib_srp; modprobe ib_srp; ibsrpdm -c | while readtarget_info; do echo "${target_info}"; echo "${target_info}" > /sys/class/infiniband_srp/srp-mlx4_0-1/add_target; done How to reproduce the NULL pointer dereference: - Run the following command on the target: /etc/init.d/scst restart - Run the following command on the initiator: sg_reset -d /dev/sdb Result: scsi host7: SRP reset_device called BUG: unable to handle kernel NULL pointer dereference at 0000000000000074 IP: [] srp_send_tsk_mgmt+0xb4/0x130 [ib_srp] PGD 51e7067 PUD 48543067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map CPU 0 Modules linked in: ib_srp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_filter ip_tables x_tables vboxnetflt(N) vboxdrv(N) snd_pcm_oss snd_mixer_oss binfmt_misc snd_seq snd_seq_device rdma_ucm scsi_transport_srp scsi_tgt ib_ipoib ib_uverbs ib_umad ib_iser rdma_cm ib_cm iw_cm mlx4_ib ib_sa ipv6 ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq fuse loop dm_mod coretemp(N) snd_hda_intel snd_pcm snd_timer snd_page_alloc snd_hwdep ohci1394 i2c_i801 snd rtc_cmos mlx4_core sr_mod serio_raw pcspkr ieee1394 i2c_core intel_agp pata_marvell rtc_core skge soundcore button rtc_lib sky2 cdrom sg floppy uhci_hcd ehci_hcd sd_mod crc_t10dif usbcore edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix thermal processor thermal_sys hwmon pata_jmicron ahci libata scsi_mod dock [last unloaded: ib_srp] Supported: No Pid: 17736, comm: sg_reset Tainted: G 2.6.27.25-0.1-default #1 RIP: 0010:[] [] srp_send_tsk_mgmt+0xb4/0x130 [ib_srp] RSP: 0018:ffff88005e4ddbc8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff8800623d8620 RCX: 0000000000000000 RDX: ffff8800778d2000 RSI: ffff88006f088d80 RDI: ffff8800623d8620 RBP: ffff8800623d8b40 R08: ffffffff806e2c70 R09: 0000000100000000 R10: 0000000000000046 R11: 0000000000000000 R12: ffff88006f088d80 R13: 0000000000000008 R14: ffff8800623d8000 R15: ffff88007e7d3c00 FS: 00007f3cab09f6f0(0000) GS:ffffffff80a43080(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000074 CR3: 00000000069b6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sg_reset (pid: 17736, threadinfo ffff88005e4dc000, task ffff8800095ca0c0) Stack: ffff8800623d82a8 0000000000000000 ffff8800623d8620 ffff8800623d8000 ffff8800381fd380 ffffffffa03f2ea5 ffff88005e4ddc38 ffff8800381fd380 ffff8800623d8000 0000000000000000 00007fff39b51144 ffffffffa0008351 Call Trace: [] srp_reset_device+0x77/0x101 [ib_srp] [] scsi_reset_provider+0xc8/0x18d [scsi_mod] [] scsi_nonblockable_ioctl+0x90/0xb5 [scsi_mod] [] sd_ioctl+0x61/0xc6 [sd_mod] [] blkdev_driver_ioctl+0x5d/0x72 [] blkdev_ioctl+0x1f5/0x217 [] block_ioctl+0x1b/0x20 [] vfs_ioctl+0x21/0x6c [] do_vfs_ioctl+0x222/0x231 [] sys_ioctl+0x51/0x73 [] system_call_fastpath+0x16/0x1b [<00007f3caac19b77>] 0x7f3caac19b77 Code: 00 4d 85 e4 0f 84 85 00 00 00 49 8b 54 24 08 31 c0 b9 0c 00 00 00 4c 89 e6 48 89 d7 f3 ab c6 02 01 48 89 df 48 8b 45 10 48 8b 00 <8b> 40 74 48 c1 e0 30 48 0f c8 48 89 42 14 8b 45 50 44 88 6a 1e RIP [] srp_send_tsk_mgmt+0xb4/0x130 [ib_srp] RSP CR2: 0000000000000074 ---[ end trace 4cec2e39421a0374 ]--- -- Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From hnrose at comcast.net Sun Aug 2 11:48:05 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 14:48:05 -0400 Subject: [ofa-general] [PATCH] opensm/osm_path.h: Fix osm_dr_path_extend return values comment Message-ID: <20090802184805.GB15622@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_path.h b/opensm/include/opensm/osm_path.h index d02576b..da55aa8 100644 --- a/opensm/include/opensm/osm_path.h +++ b/opensm/include/opensm/osm_path.h @@ -211,8 +211,9 @@ osm_dr_path_extend(IN osm_dr_path_t * const p_path, IN const uint8_t port_num) * port_num * [in] Additional port to add to the DR path. * -* RETURN VALUE -* Boolean indicating whether or not path was extended. +* RETURN VALUES +* 0 indicates path was extended. +* Other than 0 indicates path was not extended. * * NOTES * From hnrose at comcast.net Sun Aug 2 11:47:16 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 14:47:16 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Remove osm_mesh_node_delete call from switch_delete Message-ID: <20090802184716.GA15622@comcast.net> osm_mesh_node_delete now called from free_lash_structures Mistakenly omitted from commit 46e56687e629cbd21cbca453bb088c90c20a38aa Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index a62cb3d..2715fe7 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -651,8 +651,6 @@ static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw static void switch_delete(lash_t *p_lash, switch_t * sw) { - osm_mesh_node_delete(p_lash, sw); - if (sw->dij_channels) free(sw->dij_channels); if (sw->p_sw) @@ -671,7 +669,6 @@ static void delete_mesh_switches(lash_t *p_lash) } } - static void free_lash_structures(lash_t * p_lash) { unsigned int i, j, k; From hnrose at comcast.net Sun Aug 2 11:48:49 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 14:48:49 -0400 Subject: [ofa-general] [PATCH] opensm/osm_helper.h: Fix some commentary typos Message-ID: <20090802184849.GC15622@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_helper.h b/opensm/include/opensm/osm_helper.h index 91c9f84..d76af8d 100644 --- a/opensm/include/opensm/osm_helper.h +++ b/opensm/include/opensm/osm_helper.h @@ -470,7 +470,7 @@ const char *osm_get_disp_msg_str(IN cl_disp_msgid_t msg); * [in] Dispatcher message ID value. * * RETURN VALUES -* Pointer to the message discription string. +* Pointer to the message description string. * * NOTES * @@ -509,7 +509,7 @@ const char *osm_get_sm_signal_str(IN osm_signal_t signal); * [in] Signal value * * RETURN VALUES -* Pointer to the signal discription string. +* Pointer to the signal description string. * * NOTES * @@ -548,7 +548,7 @@ const char *osm_get_sm_mgr_signal_str(IN osm_sm_signal_t signal); * [in] SM manager signal * * RETURN VALUES -* Pointer to the signal discription string. +* Pointer to the signal description string. * * NOTES * @@ -571,7 +571,7 @@ const char *osm_get_sm_mgr_state_str(IN uint16_t state); * [in] SM manager state * * RETURN VALUES -* Pointer to the state discription string. +* Pointer to the state description string. * * NOTES * From sashak at voltaire.com Sun Aug 2 12:18:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 22:18:53 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Remove osm_mesh_node_delete call from switch_delete In-Reply-To: <20090802184716.GA15622@comcast.net> References: <20090802184716.GA15622@comcast.net> Message-ID: <20090802191853.GV5287@me> On 14:47 Sun 02 Aug , Hal Rosenstock wrote: > > osm_mesh_node_delete now called from free_lash_structures > Mistakenly omitted from commit 46e56687e629cbd21cbca453bb088c90c20a38aa > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 2 12:19:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 22:19:27 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_path.h: Fix osm_dr_path_extend return values comment In-Reply-To: <20090802184805.GB15622@comcast.net> References: <20090802184805.GB15622@comcast.net> Message-ID: <20090802191927.GW5287@me> On 14:48 Sun 02 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 2 12:20:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Aug 2009 22:20:05 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.h: Fix some commentary typos In-Reply-To: <20090802184849.GC15622@comcast.net> References: <20090802184849.GC15622@comcast.net> Message-ID: <20090802192005.GX5287@me> On 14:48 Sun 02 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From todd.rimmer at qlogic.com Sun Aug 2 14:45:21 2009 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Sun, 2 Aug 2009 16:45:21 -0500 Subject: [ofa-general] umad SLID and LMC In-Reply-To: References: <356B6978-3308-4EE9-8C00-00199558BDEA@redhat.com> <200907231121.00140.jackm@dev.mellanox.co.il> Message-ID: <5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org> What is the proper way to control the SLID used for outgoing umad sends? For example, when using LMC>0, the PathRecord returned from the SM for talking to a given remove node may have a SLID which is not the BaseLid for the sender. How does the sender ensure the correct SLID is used for the outgoing mad? In reviewing the API it seems like the only way to do this is: void *umad = umad_alloc(...); // call various umad calls to initialize address and contents umad_get_mad_addr(umad)->path_bits = lower LMC bits of SLID; umad_send(..., umad, ...); Was path_bits an intentional omission in the API? It would seem that a function which could update the ib_mad_addr in a umad given a path record would seem appropriate. Todd Rimmer Chief Architect QLogic Network Systems Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com From arlin.r.davis at intel.com Sun Aug 2 16:26:52 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Sun, 2 Aug 2009 16:26:52 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: CNO support broken in both CMA and SCM providers. Message-ID: <7939FE16A13F4A7EA126666873A88141@amr.corp.intel.com> CQ thread/callback mechanism was removed by mistake. Still need indirect DTO callbacks when CNO is attached to EVD's. Add CQ event channel to cma provider's thread and add to select for rdma_cm and async channels. For scm provider there is no easy way to add this channel to the select across sockets on windows. So, for portablity reasons a 2nd thread is started to process the ASYNC and CQ channels for events. Must also disable EVD (evd_endabled=FALSE) during destroy to prevent EVD events firing for CNOs and re-arming CQ while CQ is being destroyed. Slight modification to dtest to check EVD after CNO timeout. Signed-off-by: Arlin Davis --- dapl/common/dapl_evd_util.c | 1 + dapl/openib_cma/dapl_ib_util.h | 5 +- dapl/openib_cma/device.c | 154 ++++--------- dapl/openib_common/cq.c | 192 +++++---------- dapl/openib_common/dapl_ib_common.h | 2 + dapl/openib_common/util.c | 98 ++++++++ dapl/openib_scm/dapl_ib_util.h | 5 + dapl/openib_scm/device.c | 458 ++++++++++++++++++++++++++++++----- test/dtest/dtest.c | 54 +++-- 9 files changed, 649 insertions(+), 320 deletions(-) diff --git a/dapl/common/dapl_evd_util.c b/dapl/common/dapl_evd_util.c index 88c3f8f..02909e9 100644 --- a/dapl/common/dapl_evd_util.c +++ b/dapl/common/dapl_evd_util.c @@ -469,6 +469,7 @@ DAT_RETURN dapls_evd_dealloc(IN DAPL_EVD * evd_ptr) * Destroy the CQ first, to keep any more callbacks from coming * up from it. */ + evd_ptr->evd_enabled = DAT_FALSE; if (evd_ptr->ib_cq_handle != IB_INVALID_HANDLE) { ia_ptr = evd_ptr->header.owner_ia; diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h index f466c06..c9ab4d6 100755 --- a/dapl/openib_cma/dapl_ib_util.h +++ b/dapl/openib_cma/dapl_ib_util.h @@ -84,7 +84,6 @@ typedef struct _ib_hca_transport { struct dapl_llist_entry entry; int destroy; - struct dapl_hca *d_hca; struct rdma_cm_id *cm_id; struct ibv_comp_channel *ib_cq; ib_cq_handle_t ib_cq_empty; @@ -99,6 +98,7 @@ typedef struct _ib_hca_transport /* device attributes */ int rd_atom_in; int rd_atom_out; + struct ibv_context *ib_ctx; struct ibv_device *ib_dev; /* dapls_modify_qp_state */ uint16_t lid; @@ -119,7 +119,8 @@ void dapli_thread(void *arg); DAT_RETURN dapli_ib_thread_init(void); void dapli_ib_thread_destroy(void); void dapli_cma_event_cb(void); -void dapli_async_event_cb(struct _ib_hca_transport *hca); +void dapli_async_event_cb(struct _ib_hca_transport *tp); +void dapli_cq_event_cb(struct _ib_hca_transport *tp); dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep); void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep); DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, diff --git a/dapl/openib_cma/device.c b/dapl/openib_cma/device.c index 81203bf..743e8fa 100644 --- a/dapl/openib_cma/device.c +++ b/dapl/openib_cma/device.c @@ -123,6 +123,12 @@ static int dapls_config_verbs(struct ibv_context *verbs) return 0; } +static int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + channel->comp_channel.Milliseconds = 0; + return 0; +} + static int dapls_thread_signal(void) { CompManagerCancel(windata.comp_mgr); @@ -205,6 +211,11 @@ static int dapls_config_verbs(struct ibv_context *verbs) return dapls_config_fd(verbs->async_fd); } +static int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + return dapls_config_fd(channel->fd); +} + static int dapls_thread_signal(void) { return write(g_ib_pipe[1], "w", sizeof "w"); @@ -334,10 +345,6 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " open_hca: RDMA channel created (%p)\n", g_cm_events); - dat_status = dapli_ib_thread_init(); - if (dat_status != DAT_SUCCESS) - return dat_status; - /* HCA name will be hostname or IP address */ if (getipaddr((char *)hca_name, (char *)&hca_ptr->hca_address, @@ -357,6 +364,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) dapl_log(DAPL_DBG_TYPE_ERR, " open_hca: rdma_bind ERR %s." " Is %s configured?\n", strerror(errno), hca_name); + rdma_destroy_id(cm_id); return DAT_INVALID_ADDRESS; } @@ -366,6 +374,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) dapls_config_verbs(cm_id->verbs); hca_ptr->port_num = cm_id->port_num; hca_ptr->ib_trans.ib_dev = cm_id->verbs->device; + hca_ptr->ib_trans.ib_ctx = cm_id->verbs; gid = &cm_id->route.addr.addr.ibaddr.sgid; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, @@ -374,6 +383,21 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) (unsigned long long)ntohll(gid->global.subnet_prefix), (unsigned long long)ntohll(gid->global.interface_id)); + /* support for EVD's with CNO's: one channel via thread */ + hca_ptr->ib_trans.ib_cq = + ibv_create_comp_channel(hca_ptr->ib_hca_handle); + if (hca_ptr->ib_trans.ib_cq == NULL) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: ibv_create_comp_channel ERR %s\n", + strerror(errno)); + rdma_destroy_id(cm_id); + return DAT_INTERNAL_ERROR; + } + if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) { + rdma_destroy_id(cm_id); + return DAT_INTERNAL_ERROR; + } + /* set inline max with env or default, get local lid and gid 0 */ if (hca_ptr->ib_hca_handle->device->transport_type == IBV_TRANSPORT_IWARP) @@ -395,14 +419,17 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) /* set default IB MTU */ hca_ptr->ib_trans.mtu = dapl_ib_mtu(2048); + dat_status = dapli_ib_thread_init(); + if (dat_status != DAT_SUCCESS) + return dat_status; /* * Put new hca_transport on list for async and CQ event processing * Wakeup work thread to add to polling list */ - dapl_llist_init_entry((DAPL_LLIST_ENTRY *) & hca_ptr->ib_trans.entry); + dapl_llist_init_entry((DAPL_LLIST_ENTRY *) &hca_ptr->ib_trans.entry); dapl_os_lock(&g_hca_lock); dapl_llist_add_tail(&g_hca_list, - (DAPL_LLIST_ENTRY *) & hca_ptr->ib_trans.entry, + (DAPL_LLIST_ENTRY *) &hca_ptr->ib_trans.entry, &hca_ptr->ib_trans.entry); if (dapls_thread_signal() == -1) dapl_log(DAPL_DBG_TYPE_UTIL, @@ -425,7 +452,6 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) &hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, hca_ptr->ib_trans.max_inline_send); - hca_ptr->ib_trans.d_hca = hca_ptr; return DAT_SUCCESS; } @@ -574,105 +600,6 @@ bail: " ib_thread_destroy(%d) exit\n", dapl_os_getpid()); } -void dapli_async_event_cb(struct _ib_hca_transport *hca) -{ - struct ibv_async_event event; - - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " async_event(%p)\n", hca); - - if (hca->destroy) - return; - - if (!ibv_get_async_event(hca->cm_id->verbs, &event)) { - - switch (event.event_type) { - case IBV_EVENT_CQ_ERR: - { - struct dapl_ep *evd_ptr = - event.element.cq->cq_context; - - dapl_log(DAPL_DBG_TYPE_ERR, - "dapl async_event CQ (%p) ERR %d\n", - evd_ptr, event.event_type); - - /* report up if async callback still setup */ - if (hca->async_cq_error) - hca->async_cq_error(hca->cm_id->verbs, - event.element.cq, - &event, - (void *)evd_ptr); - break; - } - case IBV_EVENT_COMM_EST: - { - /* Received msgs on connected QP before RTU */ - dapl_log(DAPL_DBG_TYPE_UTIL, - " async_event COMM_EST(%p) rdata beat RTU\n", - event.element.qp); - - break; - } - case IBV_EVENT_QP_FATAL: - case IBV_EVENT_QP_REQ_ERR: - case IBV_EVENT_QP_ACCESS_ERR: - case IBV_EVENT_QP_LAST_WQE_REACHED: - case IBV_EVENT_SRQ_ERR: - case IBV_EVENT_SRQ_LIMIT_REACHED: - case IBV_EVENT_SQ_DRAINED: - { - struct dapl_ep *ep_ptr = - event.element.qp->qp_context; - - dapl_log(DAPL_DBG_TYPE_ERR, - "dapl async_event QP (%p) ERR %d\n", - ep_ptr, event.event_type); - - /* report up if async callback still setup */ - if (hca->async_qp_error) - hca->async_qp_error(hca->cm_id->verbs, - ep_ptr->qp_handle, - &event, - (void *)ep_ptr); - break; - } - case IBV_EVENT_PATH_MIG: - case IBV_EVENT_PATH_MIG_ERR: - case IBV_EVENT_DEVICE_FATAL: - case IBV_EVENT_PORT_ACTIVE: - case IBV_EVENT_PORT_ERR: - case IBV_EVENT_LID_CHANGE: - case IBV_EVENT_PKEY_CHANGE: - case IBV_EVENT_SM_CHANGE: - { - dapl_log(DAPL_DBG_TYPE_WARN, - "dapl async_event: DEV ERR %d\n", - event.event_type); - - /* report up if async callback still setup */ - if (hca->async_unafiliated) - hca->async_unafiliated(hca->cm_id-> - verbs, &event, - hca-> - async_un_ctx); - break; - } - case IBV_EVENT_CLIENT_REREGISTER: - /* no need to report this event this time */ - dapl_log(DAPL_DBG_TYPE_UTIL, - " async_event: IBV_CLIENT_REREGISTER\n"); - break; - - default: - dapl_log(DAPL_DBG_TYPE_WARN, - "dapl async_event: %d UNKNOWN\n", - event.event_type); - break; - - } - ibv_ack_async_event(&event); - } -} - #if defined(_WIN64) || defined(_WIN32) /* work thread for uAT, uCM, CQ, and async events */ void dapli_thread(void *arg) @@ -721,6 +648,7 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); uhca[idx]->destroy = 2; } else { + dapli_cq_event_cb(uhca[idx]); dapli_async_event_cb(uhca[idx]); } } @@ -732,6 +660,7 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); } #else // _WIN64 || WIN32 + /* work thread for uAT, uCM, CQ, and async events */ void dapli_thread(void *arg) { @@ -771,7 +700,13 @@ void dapli_thread(void *arg) while (hca) { /* uASYNC events */ - ufds[++idx].fd = hca->cm_id->verbs->async_fd; + ufds[++idx].fd = hca->ib_ctx->async_fd; + ufds[idx].events = POLLIN; + ufds[idx].revents = 0; + uhca[idx] = hca; + + /* CQ events are non-direct with CNO's */ + ufds[++idx].fd = hca->ib_cq->fd; ufds[idx].events = POLLIN; ufds[idx].revents = 0; uhca[idx] = hca; @@ -809,9 +744,10 @@ void dapli_thread(void *arg) if (ufds[1].revents == POLLIN) dapli_cma_event_cb(); - /* check and process ASYNC events, per device */ + /* check and process CQ and ASYNC events, per device */ for (idx = 2; idx < fds; idx++) { if (ufds[idx].revents == POLLIN) { + dapli_cq_event_cb(uhca[idx]); dapli_async_event_cb(uhca[idx]); } } @@ -824,7 +760,7 @@ void dapli_thread(void *arg) strerror(errno)); /* cleanup any device on list marked for destroy */ - for (idx = 2; idx < fds; idx++) { + for (idx = 3; idx < fds; idx++) { if (uhca[idx] && uhca[idx]->destroy == 1) { dapl_os_lock(&g_hca_lock); dapl_llist_remove_entry( diff --git a/dapl/openib_common/cq.c b/dapl/openib_common/cq.c index 096167c..16d4f18 100644 --- a/dapl/openib_common/cq.c +++ b/dapl/openib_common/cq.c @@ -171,36 +171,32 @@ DAT_RETURN dapls_ib_get_async_event(IN ib_error_record_t * err_record, * DAT_INSUFFICIENT_RESOURCES * */ -#if defined(_WIN32) - DAT_RETURN dapls_ib_cq_alloc(IN DAPL_IA * ia_ptr, IN DAPL_EVD * evd_ptr, IN DAT_COUNT * cqlen) { - OVERLAPPED *overlap; + struct ibv_comp_channel *channel; DAT_RETURN ret; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen); - evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, - *cqlen, evd_ptr, NULL, 0); + if (!evd_ptr->cno_ptr) + channel = ibv_create_comp_channel(ia_ptr->hca_ptr->ib_hca_handle); + else + channel = ia_ptr->hca_ptr->ib_trans.ib_cq; - if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) + if (!channel) return DAT_INSUFFICIENT_RESOURCES; - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " cq_object_create: (%p)\n", evd_ptr); + evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, + *cqlen, evd_ptr, channel, 0); - overlap = &evd_ptr->ib_cq_handle->comp_entry.Overlap; - overlap->hEvent = CreateEvent(NULL, FALSE, FALSE, NULL); - if (!overlap->hEvent) { + if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) { ret = DAT_INSUFFICIENT_RESOURCES; goto err; } - overlap->hEvent = (HANDLE) ((ULONG_PTR) overlap->hEvent | 1); - /* arm cq for events */ dapls_set_cq_notify(ia_ptr, evd_ptr); @@ -214,7 +210,8 @@ dapls_ib_cq_alloc(IN DAPL_IA * ia_ptr, return DAT_SUCCESS; err: - ibv_destroy_cq(evd_ptr->ib_cq_handle); + if (!evd_ptr->cno_ptr) + ibv_destroy_comp_channel(channel); return ret; } @@ -239,18 +236,18 @@ DAT_RETURN dapls_ib_cq_free(IN DAPL_IA * ia_ptr, IN DAPL_EVD * evd_ptr) { DAT_EVENT event; ib_work_completion_t wc; - HANDLE hevent; + struct ibv_comp_channel *channel; if (evd_ptr->ib_cq_handle != IB_INVALID_HANDLE) { /* pull off CQ and EVD entries and toss */ while (ibv_poll_cq(evd_ptr->ib_cq_handle, 1, &wc) == 1) ; while (dapl_evd_dequeue(evd_ptr, &event) == DAT_SUCCESS) ; - hevent = evd_ptr->ib_cq_handle->comp_entry.Overlap.hEvent; + channel = evd_ptr->ib_cq_handle->channel; if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) return (dapl_convert_errno(errno, "ibv_destroy_cq")); - - CloseHandle(hevent); + if (!evd_ptr->cno_ptr) + ibv_destroy_comp_channel(channel); evd_ptr->ib_cq_handle = IB_INVALID_HANDLE; } return DAT_SUCCESS; @@ -262,105 +259,42 @@ dapls_evd_dto_wakeup(IN DAPL_EVD * evd_ptr) dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wakeup: evd=%p\n", evd_ptr); - if (!SetEvent(evd_ptr->ib_cq_handle->comp_entry.Overlap.hEvent)) - return DAT_INTERNAL_ERROR; - + /* no wake up mechanism */ return DAT_SUCCESS; } -DAT_RETURN -dapls_evd_dto_wait(IN DAPL_EVD * evd_ptr, IN uint32_t timeout) +#if defined(_WIN32) +static int +dapls_wait_comp_channel(IN struct ibv_comp_channel *channel, IN uint32_t timeout) { - int status; - - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " cq_object_wait: EVD %p time %d\n", - evd_ptr, timeout); - - status = WaitForSingleObject(evd_ptr->ib_cq_handle-> - comp_entry.Overlap.hEvent, - timeout / 1000); - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " cq_object_wait: EVD %p status 0x%x\n", - evd_ptr, status); - if (status) - return DAT_TIMEOUT_EXPIRED; - - InterlockedExchange(&evd_ptr->ib_cq_handle->comp_entry.Busy, 0); - return DAT_SUCCESS; + channel->comp_channel.Milliseconds = + (timeout == DAT_TIMEOUT_INFINITE) ? INFINITE : timeout / 1000; + return 0; } #else // WIN32 -DAT_RETURN -dapls_ib_cq_alloc(IN DAPL_IA * ia_ptr, - IN DAPL_EVD * evd_ptr, IN DAT_COUNT * cqlen) -{ - struct ibv_comp_channel *channel; - DAT_RETURN ret; - - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen); - - channel = ibv_create_comp_channel(ia_ptr->hca_ptr->ib_hca_handle); - if (!channel) - return DAT_INSUFFICIENT_RESOURCES; - - evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, - *cqlen, evd_ptr, channel, 0); - - if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) { - ret = DAT_INSUFFICIENT_RESOURCES; - goto err; - } - - /* arm cq for events */ - dapls_set_cq_notify(ia_ptr, evd_ptr); - - /* update with returned cq entry size */ - *cqlen = evd_ptr->ib_cq_handle->cqe; - - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - "dapls_ib_cq_alloc: new_cq %p cqlen=%d \n", - evd_ptr->ib_cq_handle, *cqlen); - - return DAT_SUCCESS; - -err: - ibv_destroy_comp_channel(channel); - return ret; -} - -DAT_RETURN dapls_ib_cq_free(IN DAPL_IA * ia_ptr, IN DAPL_EVD * evd_ptr) +static int +dapls_wait_comp_channel(IN struct ibv_comp_channel *channel, IN uint32_t timeout) { - DAT_EVENT event; - ib_work_completion_t wc; - struct ibv_comp_channel *channel; - - if (evd_ptr->ib_cq_handle != IB_INVALID_HANDLE) { - /* pull off CQ and EVD entries and toss */ - while (ibv_poll_cq(evd_ptr->ib_cq_handle, 1, &wc) == 1) ; - while (dapl_evd_dequeue(evd_ptr, &event) == DAT_SUCCESS) ; - - channel = evd_ptr->ib_cq_handle->channel; - if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) - return (dapl_convert_errno(errno, "ibv_destroy_cq")); - - ibv_destroy_comp_channel(channel); - evd_ptr->ib_cq_handle = IB_INVALID_HANDLE; - } - return DAT_SUCCESS; -} - -DAT_RETURN -dapls_evd_dto_wakeup(IN DAPL_EVD * evd_ptr) -{ - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " cq_object_wakeup: evd=%p\n", evd_ptr); + int status, timeout_ms; + struct pollfd cq_fd = { + .fd = channel->fd, + .events = POLLIN, + .revents = 0 + }; - /* no wake up mechanism */ - return DAT_SUCCESS; + /* uDAPL timeout values in usecs */ + timeout_ms = (timeout == DAT_TIMEOUT_INFINITE) ? -1 : timeout / 1000; + status = poll(&cq_fd, 1, timeout_ms); + if (status > 0) + return 0; + else if (status == 0) + return ETIMEDOUT; + else + return status; } +#endif DAT_RETURN dapls_evd_dto_wait(IN DAPL_EVD * evd_ptr, IN uint32_t timeout) @@ -368,43 +302,45 @@ dapls_evd_dto_wait(IN DAPL_EVD * evd_ptr, IN uint32_t timeout) struct ibv_comp_channel *channel = evd_ptr->ib_cq_handle->channel; struct ibv_cq *ibv_cq = NULL; void *context; - int status = 0; - int timeout_ms = -1; - struct pollfd cq_fd = { - .fd = channel->fd, - .events = POLLIN, - .revents = 0 - }; + int status; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: EVD %p time %d\n", evd_ptr, timeout); - /* uDAPL timeout values in usecs */ - if (timeout != DAT_TIMEOUT_INFINITE) - timeout_ms = timeout / 1000; - - status = poll(&cq_fd, 1, timeout_ms); - - /* returned event */ - if (status > 0) { + status = dapls_wait_comp_channel(channel, timeout); + if (!status) { if (!ibv_get_cq_event(channel, &ibv_cq, &context)) { ibv_ack_cq_events(ibv_cq, 1); } - status = 0; - - /* timeout */ - } else if (status == 0) - status = ETIMEDOUT; + } dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: RET evd %p ibv_cq %p %s\n", evd_ptr, ibv_cq, strerror(errno)); - return (dapl_convert_errno(status, "cq_wait_object_wait")); + return dapl_convert_errno(status, "cq_wait_object_wait"); +} +void dapli_cq_event_cb(struct _ib_hca_transport *tp) +{ + /* check all comp events on this device */ + struct dapl_evd *evd = NULL; + struct ibv_cq *ibv_cq = NULL; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," dapli_cq_event_cb(%p)\n", tp); + + while (!ibv_get_cq_event(tp->ib_cq, &ibv_cq, (void*)&evd)) { + + if (!DAPL_BAD_HANDLE(evd, DAPL_MAGIC_EVD)) { + /* Both EVD or EVD->CNO event via callback */ + dapl_evd_dto_callback(tp->ib_ctx, + evd->ib_cq_handle, (void*)evd); + } + + ibv_ack_cq_events(ibv_cq, 1); + } } -#endif /* * dapl_ib_cq_resize diff --git a/dapl/openib_common/dapl_ib_common.h b/dapl/openib_common/dapl_ib_common.h index 0b417b8..2195767 100644 --- a/dapl/openib_common/dapl_ib_common.h +++ b/dapl/openib_common/dapl_ib_common.h @@ -208,6 +208,8 @@ typedef uint32_t ib_shm_transport_t; /* prototypes */ int32_t dapls_ib_init(void); int32_t dapls_ib_release(void); + +/* util.c */ enum ibv_mtu dapl_ib_mtu(int mtu); char *dapl_ib_mtu_str(enum ibv_mtu mtu); DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len); diff --git a/dapl/openib_common/util.c b/dapl/openib_common/util.c index da913c5..3963e1f 100644 --- a/dapl/openib_common/util.c +++ b/dapl/openib_common/util.c @@ -320,6 +320,104 @@ DAT_RETURN dapls_ib_setup_async_callback(IN DAPL_IA * ia_ptr, return DAT_SUCCESS; } +void dapli_async_event_cb(struct _ib_hca_transport *hca) +{ + struct ibv_async_event event; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " async_event(%p)\n", hca); + + if (hca->destroy) + return; + + if (!ibv_get_async_event(hca->ib_ctx, &event)) { + + switch (event.event_type) { + case IBV_EVENT_CQ_ERR: + { + struct dapl_ep *evd_ptr = + event.element.cq->cq_context; + + dapl_log(DAPL_DBG_TYPE_ERR, + "dapl async_event CQ (%p) ERR %d\n", + evd_ptr, event.event_type); + + /* report up if async callback still setup */ + if (hca->async_cq_error) + hca->async_cq_error(hca->ib_ctx, + event.element.cq, + &event, + (void *)evd_ptr); + break; + } + case IBV_EVENT_COMM_EST: + { + /* Received msgs on connected QP before RTU */ + dapl_log(DAPL_DBG_TYPE_UTIL, + " async_event COMM_EST(%p) rdata beat RTU\n", + event.element.qp); + + break; + } + case IBV_EVENT_QP_FATAL: + case IBV_EVENT_QP_REQ_ERR: + case IBV_EVENT_QP_ACCESS_ERR: + case IBV_EVENT_QP_LAST_WQE_REACHED: + case IBV_EVENT_SRQ_ERR: + case IBV_EVENT_SRQ_LIMIT_REACHED: + case IBV_EVENT_SQ_DRAINED: + { + struct dapl_ep *ep_ptr = + event.element.qp->qp_context; + + dapl_log(DAPL_DBG_TYPE_ERR, + "dapl async_event QP (%p) ERR %d\n", + ep_ptr, event.event_type); + + /* report up if async callback still setup */ + if (hca->async_qp_error) + hca->async_qp_error(hca->ib_ctx, + ep_ptr->qp_handle, + &event, + (void *)ep_ptr); + break; + } + case IBV_EVENT_PATH_MIG: + case IBV_EVENT_PATH_MIG_ERR: + case IBV_EVENT_DEVICE_FATAL: + case IBV_EVENT_PORT_ACTIVE: + case IBV_EVENT_PORT_ERR: + case IBV_EVENT_LID_CHANGE: + case IBV_EVENT_PKEY_CHANGE: + case IBV_EVENT_SM_CHANGE: + { + dapl_log(DAPL_DBG_TYPE_WARN, + "dapl async_event: DEV ERR %d\n", + event.event_type); + + /* report up if async callback still setup */ + if (hca->async_unafiliated) + hca->async_unafiliated(hca->ib_ctx, + &event, + hca->async_un_ctx); + break; + } + case IBV_EVENT_CLIENT_REREGISTER: + /* no need to report this event this time */ + dapl_log(DAPL_DBG_TYPE_UTIL, + " async_event: IBV_CLIENT_REREGISTER\n"); + break; + + default: + dapl_log(DAPL_DBG_TYPE_WARN, + "dapl async_event: %d UNKNOWN\n", + event.event_type); + break; + + } + ibv_ack_async_event(&event); + } +} + /* * dapls_set_provider_specific_attr * diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h index a5e734e..933364c 100644 --- a/dapl/openib_scm/dapl_ib_util.h +++ b/dapl/openib_scm/dapl_ib_util.h @@ -78,8 +78,11 @@ typedef dp_ib_cm_handle_t ib_cm_srvc_handle_t; /* ib_hca_transport_t, specific to this implementation */ typedef struct _ib_hca_transport { + struct dapl_llist_entry entry; + int destroy; union ibv_gid gid; struct ibv_device *ib_dev; + struct ibv_context *ib_ctx; ib_cq_handle_t ib_cq_empty; DAPL_OS_LOCK cq_lock; int max_inline_send; @@ -114,6 +117,8 @@ typedef struct _ib_hca_transport void cr_thread(void *arg); int dapli_cq_thread_init(struct dapl_hca *hca_ptr); void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); +void dapli_async_event_cb(struct _ib_hca_transport *tp); +void dapli_cq_event_cb(struct _ib_hca_transport *tp); DAT_RETURN dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr); void dapls_print_cm_list(IN DAPL_IA *ia_ptr); dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep); diff --git a/dapl/openib_scm/device.c b/dapl/openib_scm/device.c index d5089aa..9c91b78 100644 --- a/dapl/openib_scm/device.c +++ b/dapl/openib_scm/device.c @@ -57,6 +57,96 @@ static const char rcsid[] = "$Id: $"; #include +ib_thread_state_t g_ib_thread_state = 0; +DAPL_OS_THREAD g_ib_thread; +DAPL_OS_LOCK g_hca_lock; +struct dapl_llist_entry *g_hca_list; + +void dapli_thread(void *arg); +DAT_RETURN dapli_ib_thread_init(void); +void dapli_ib_thread_destroy(void); + +#if defined(_WIN64) || defined(_WIN32) +#include "..\..\..\..\..\etc\user\comp_channel.cpp" +#include + +struct ibvw_windata windata; + +static int dapls_os_init(void) +{ + return ibvw_get_windata(&windata, IBVW_WINDATA_VERSION); +} + +static void dapls_os_release(void) +{ + if (windata.comp_mgr) + ibvw_release_windata(&windata, IBVW_WINDATA_VERSION); + windata.comp_mgr = NULL; +} + +static int dapls_config_verbs(struct ibv_context *verbs) +{ + verbs->channel.Milliseconds = 0; + return 0; +} + +static int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + channel->comp_channel.Milliseconds = 0; + return 0; +} + +static int dapls_thread_signal(void) +{ + CompManagerCancel(windata.comp_mgr); + return 0; +} +#else // _WIN64 || WIN32 +int g_ib_pipe[2]; + +static int dapls_os_init(void) +{ + /* create pipe for waking up work thread */ + return pipe(g_ib_pipe); +} + +static void dapls_os_release(void) +{ + /* close pipe? */ +} + +static int dapls_config_fd(int fd) +{ + int opts; + + opts = fcntl(fd, F_GETFL); + if (opts < 0 || fcntl(fd, F_SETFL, opts | O_NONBLOCK) < 0) { + dapl_log(DAPL_DBG_TYPE_ERR, + " dapls_config_fd: fcntl on fd %d ERR %d %s\n", + fd, opts, strerror(errno)); + return errno; + } + + return 0; +} + +static int dapls_config_verbs(struct ibv_context *verbs) +{ + return dapls_config_fd(verbs->async_fd); +} + +static int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + return dapls_config_fd(channel->fd); +} + +static int dapls_thread_signal(void) +{ + return write(g_ib_pipe[1], "w", sizeof "w"); +} +#endif + + static int32_t create_cr_pipe(IN DAPL_HCA * hca_ptr) { DAPL_SOCKET listen_socket; @@ -130,35 +220,22 @@ static void destroy_cr_pipe(IN DAPL_HCA * hca_ptr) */ int32_t dapls_ib_init(void) { - return 0; -} + /* initialize hca_list */ + dapl_os_lock_init(&g_hca_lock); + dapl_llist_init_head(&g_hca_list); -int32_t dapls_ib_release(void) -{ - return 0; -} + if (dapls_os_init()) + return 1; -#if defined(_WIN64) || defined(_WIN32) -int dapls_config_comp_channel(struct ibv_comp_channel *channel) -{ return 0; } -#else // _WIN64 || WIN32 -int dapls_config_comp_channel(struct ibv_comp_channel *channel) -{ - int opts; - - opts = fcntl(channel->fd, F_GETFL); /* uCQ */ - if (opts < 0 || fcntl(channel->fd, F_SETFL, opts | O_NONBLOCK) < 0) { - dapl_log(DAPL_DBG_TYPE_ERR, - " dapls_create_comp_channel: fcntl on ib_cq->fd %d ERR %d %s\n", - channel->fd, opts, strerror(errno)); - return errno; - } +int32_t dapls_ib_release(void) +{ + dapli_ib_thread_destroy(); + dapls_os_release(); return 0; } -#endif /* * dapls_ib_open_hca @@ -213,7 +290,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) " open_hca: device %s not found\n", hca_name); goto err; - found: +found: dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " open_hca: Found dev %s %016llx\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev), (unsigned long long) @@ -227,6 +304,8 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) strerror(errno)); goto err; } + hca_ptr->ib_trans.ib_ctx = hca_ptr->ib_hca_handle; + dapls_config_verbs(hca_ptr->ib_hca_handle); /* get lid for this hca-port, network order */ if (ibv_query_port(hca_ptr->ib_hca_handle, @@ -271,15 +350,8 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) hca_ptr->ib_trans.mtu = dapl_ib_mtu(dapl_os_get_env_val("DAPL_IB_MTU", SCM_IB_MTU)); -#ifndef CQ_WAIT_OBJECT - /* initialize cq_lock */ - dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); - if (dat_status != DAT_SUCCESS) { - dapl_log(DAPL_DBG_TYPE_ERR, - " open_hca: failed to init cq_lock\n"); - goto bail; - } - /* EVD events without direct CQ channels, non-blocking */ + + /* EVD events without direct CQ channels, CNO support */ hca_ptr->ib_trans.ib_cq = ibv_create_comp_channel(hca_ptr->ib_hca_handle); if (hca_ptr->ib_trans.ib_cq == NULL) { @@ -288,18 +360,28 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) strerror(errno)); goto bail; } - - if (dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq)) { - goto bail; - } - - if (dapli_cq_thread_init(hca_ptr)) { + dapls_config_comp_channel(hca_ptr->ib_trans.ib_cq); + + dat_status = dapli_ib_thread_init(); + if (dat_status != DAT_SUCCESS) { dapl_log(DAPL_DBG_TYPE_ERR, - " open_hca: cq_thread_init failed for %s\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); + " open_hca: failed to init cq thread lock\n"); goto bail; } -#endif /* CQ_WAIT_OBJECT */ + /* + * Put new hca_transport on list for async and CQ event processing + * Wakeup work thread to add to polling list + */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY *)&hca_ptr->ib_trans.entry); + dapl_os_lock(&g_hca_lock); + dapl_llist_add_tail(&g_hca_list, + (DAPL_LLIST_ENTRY *) &hca_ptr->ib_trans.entry, + &hca_ptr->ib_trans.entry); + if (dapls_thread_signal() == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " open_hca: thread wakeup error = %s\n", + strerror(errno)); + dapl_os_unlock(&g_hca_lock); /* initialize cr_list lock */ dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock); @@ -333,7 +415,7 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) /* wait for thread */ while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { - dapl_os_sleep_usec(2000); + dapl_os_sleep_usec(1000); } dapl_dbg_log(DAPL_DBG_TYPE_UTIL, @@ -380,33 +462,297 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA * hca_ptr) { dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: %p\n", hca_ptr); -#ifndef CQ_WAIT_OBJECT - dapli_cq_thread_destroy(hca_ptr); - dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock); -#endif /* CQ_WAIT_OBJECT */ - if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { if (ibv_close_device(hca_ptr->ib_hca_handle)) return (dapl_convert_errno(errno, "ib_close_device")); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; } + dapl_os_lock(&g_hca_lock); + if (g_ib_thread_state != IB_THREAD_RUN) { + dapl_os_unlock(&g_hca_lock); + return (DAT_SUCCESS); + } + dapl_os_unlock(&g_hca_lock); + /* destroy cr_thread and lock */ hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; - if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1) - dapl_log(DAPL_DBG_TYPE_UTIL, - " thread_destroy: thread wakeup err = %s\n", - strerror(errno)); + send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0); while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: waiting for cr_thread\n"); - if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1) - dapl_log(DAPL_DBG_TYPE_UTIL, - " thread_destroy: thread wakeup err = %s\n", - strerror(errno)); - dapl_os_sleep_usec(2000); + send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0); + dapl_os_sleep_usec(1000); } dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); destroy_cr_pipe(hca_ptr); /* no longer need pipe */ + + /* + * Remove hca from async event processing list + * Wakeup work thread to remove from polling list + */ + hca_ptr->ib_trans.destroy = 1; + if (dapls_thread_signal() == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); + + /* wait for thread to remove HCA references */ + while (hca_ptr->ib_trans.destroy != 2) { + if (dapls_thread_signal() == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); + dapl_os_sleep_usec(1000); + } + return (DAT_SUCCESS); } + +DAT_RETURN dapli_ib_thread_init(void) +{ + DAT_RETURN dat_status; + + dapl_os_lock(&g_hca_lock); + if (g_ib_thread_state != IB_THREAD_INIT) { + dapl_os_unlock(&g_hca_lock); + return DAT_SUCCESS; + } + + g_ib_thread_state = IB_THREAD_CREATE; + dapl_os_unlock(&g_hca_lock); + + /* create thread to process inbound connect request */ + dat_status = dapl_os_thread_create(dapli_thread, NULL, &g_ib_thread); + if (dat_status != DAT_SUCCESS) + return (dapl_convert_errno(errno, + "create_thread ERR:" + " check resource limits")); + + /* wait for thread to start */ + dapl_os_lock(&g_hca_lock); + while (g_ib_thread_state != IB_THREAD_RUN) { + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_init: waiting for ib_thread\n"); + dapl_os_unlock(&g_hca_lock); + dapl_os_sleep_usec(1000); + dapl_os_lock(&g_hca_lock); + } + dapl_os_unlock(&g_hca_lock); + + return DAT_SUCCESS; +} + +void dapli_ib_thread_destroy(void) +{ + int retries = 10; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_destroy(%d)\n", dapl_os_getpid()); + /* + * wait for async thread to terminate. + * pthread_join would be the correct method + * but some applications have some issues + */ + + /* destroy ib_thread, wait for termination, if not already */ + dapl_os_lock(&g_hca_lock); + if (g_ib_thread_state != IB_THREAD_RUN) + goto bail; + + g_ib_thread_state = IB_THREAD_CANCEL; + if (dapls_thread_signal() == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); + while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) { + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_destroy: waiting for ib_thread\n"); + if (dapls_thread_signal() == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); + dapl_os_unlock(&g_hca_lock); + dapl_os_sleep_usec(2000); + dapl_os_lock(&g_hca_lock); + } +bail: + dapl_os_unlock(&g_hca_lock); + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_destroy(%d) exit\n", dapl_os_getpid()); +} + + +#if defined(_WIN64) || defined(_WIN32) +/* work thread for uAT, uCM, CQ, and async events */ +void dapli_thread(void *arg) +{ + struct _ib_hca_transport *hca; + struct _ib_hca_transport *uhca[8]; + COMP_CHANNEL *channel; + int ret, idx, cnt; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d,0x%x): ENTER: \n", + dapl_os_getpid(), g_ib_thread); + + dapl_os_lock(&g_hca_lock); + for (g_ib_thread_state = IB_THREAD_RUN; + g_ib_thread_state == IB_THREAD_RUN; + dapl_os_lock(&g_hca_lock)) { + + idx = 0; + hca = dapl_llist_is_empty(&g_hca_list) ? NULL : + dapl_llist_peek_head(&g_hca_list); + + while (hca) { + uhca[idx++] = hca; + hca = dapl_llist_next_entry(&g_hca_list, + (DAPL_LLIST_ENTRY *) + &hca->entry); + } + cnt = idx; + + dapl_os_unlock(&g_hca_lock); + ret = CompManagerPoll(windata.comp_mgr, INFINITE, &channel); + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread(%d) poll_event 0x%x\n", + dapl_os_getpid(), ret); + + + /* check and process ASYNC events, per device */ + for (idx = 0; idx < cnt; idx++) { + if (uhca[idx]->destroy == 1) { + dapl_os_lock(&g_hca_lock); + dapl_llist_remove_entry(&g_hca_list, + (DAPL_LLIST_ENTRY *) + &uhca[idx]->entry); + dapl_os_unlock(&g_hca_lock); + uhca[idx]->destroy = 2; + } else { + dapli_cq_event_cb(uhca[idx]); + dapli_async_event_cb(uhca[idx]); + } + } + } + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) EXIT\n", + dapl_os_getpid()); + g_ib_thread_state = IB_THREAD_EXIT; + dapl_os_unlock(&g_hca_lock); +} +#else // _WIN64 || WIN32 + +/* work thread for uAT, uCM, CQ, and async events */ +void dapli_thread(void *arg) +{ + struct pollfd ufds[__FD_SETSIZE]; + struct _ib_hca_transport *uhca[__FD_SETSIZE] = { NULL }; + struct _ib_hca_transport *hca; + int ret, idx, fds; + char rbuf[2]; + + dapl_dbg_log(DAPL_DBG_TYPE_THREAD, + " ib_thread(%d,0x%x): ENTER: pipe %d \n", + dapl_os_getpid(), g_ib_thread, g_ib_pipe[0]); + + /* Poll across pipe, CM, AT never changes */ + dapl_os_lock(&g_hca_lock); + g_ib_thread_state = IB_THREAD_RUN; + + ufds[0].fd = g_ib_pipe[0]; /* pipe */ + ufds[0].events = POLLIN; + + while (g_ib_thread_state == IB_THREAD_RUN) { + + /* build ufds after pipe and uCMA events */ + ufds[0].revents = 0; + idx = 0; + + /* Walk HCA list and setup async and CQ events */ + if (!dapl_llist_is_empty(&g_hca_list)) + hca = dapl_llist_peek_head(&g_hca_list); + else + hca = NULL; + + while (hca) { + + /* uASYNC events */ + ufds[++idx].fd = hca->ib_ctx->async_fd; + ufds[idx].events = POLLIN; + ufds[idx].revents = 0; + uhca[idx] = hca; + + /* CQ events are non-direct with CNO's */ + ufds[++idx].fd = hca->ib_cq->fd; + ufds[idx].events = POLLIN; + ufds[idx].revents = 0; + uhca[idx] = hca; + + dapl_dbg_log(DAPL_DBG_TYPE_THREAD, + " ib_thread(%d) poll_fd: hca[%d]=%p," + " async=%d pipe=%d \n", + dapl_os_getpid(), hca, ufds[idx - 1].fd, + ufds[0].fd); + + hca = dapl_llist_next_entry(&g_hca_list, + (DAPL_LLIST_ENTRY *) + &hca->entry); + } + + /* unlock, and setup poll */ + fds = idx + 1; + dapl_os_unlock(&g_hca_lock); + ret = poll(ufds, fds, -1); + if (ret <= 0) { + dapl_dbg_log(DAPL_DBG_TYPE_THREAD, + " ib_thread(%d): ERR %s poll\n", + dapl_os_getpid(), strerror(errno)); + dapl_os_lock(&g_hca_lock); + continue; + } + + dapl_dbg_log(DAPL_DBG_TYPE_THREAD, + " ib_thread(%d) poll_event: " + " async=0x%x pipe=0x%x \n", + dapl_os_getpid(), ufds[idx].revents, + ufds[0].revents); + + /* check and process CQ and ASYNC events, per device */ + for (idx = 1; idx < fds; idx++) { + if (ufds[idx].revents == POLLIN) { + dapli_cq_event_cb(uhca[idx]); + dapli_async_event_cb(uhca[idx]); + } + } + + /* check and process user events, PIPE */ + if (ufds[0].revents == POLLIN) { + if (read(g_ib_pipe[0], rbuf, 2) == -1) + dapl_log(DAPL_DBG_TYPE_THREAD, + " cr_thread: pipe rd err= %s\n", + strerror(errno)); + + /* cleanup any device on list marked for destroy */ + for (idx = 1; idx < fds; idx++) { + if (uhca[idx] && uhca[idx]->destroy == 1) { + dapl_os_lock(&g_hca_lock); + dapl_llist_remove_entry( + &g_hca_list, + (DAPL_LLIST_ENTRY*) + &uhca[idx]->entry); + dapl_os_unlock(&g_hca_lock); + uhca[idx]->destroy = 2; + } + } + } + dapl_os_lock(&g_hca_lock); + } + + dapl_dbg_log(DAPL_DBG_TYPE_THREAD, " ib_thread(%d) EXIT\n", + dapl_os_getpid()); + g_ib_thread_state = IB_THREAD_EXIT; + dapl_os_unlock(&g_hca_lock); +} +#endif diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 77d78b2..739ccca 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -689,10 +689,9 @@ send_msg(void *data, LOGPRINTF("%d cno wait return evd_handle=%p\n", getpid(), evd); if (evd != h_dto_req_evd) { - fprintf(stderr, - "%d Error waiting on h_dto_cno: evd != h_dto_req_evd\n", - getpid()); - return (DAT_ABORT); + /* CNO timeout, already on EVD */ + if (evd != NULL) + return (ret); } } /* use wait to dequeue */ @@ -1085,10 +1084,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) LOGPRINTF("%d cno wait return evd_handle=%p\n", getpid(), evd); if (evd != h_dto_rcv_evd) { - fprintf(stderr, - "%d Error waiting on h_dto_cno: evd != h_dto_rcv_evd\n", - getpid()); - return (DAT_ABORT); + /* CNO timeout, already on EVD */ + if (evd != NULL) + return (ret); } } /* use wait to dequeue */ @@ -1319,10 +1317,9 @@ DAT_RETURN do_rdma_write_with_msg(void) LOGPRINTF("%d cno wait return evd_handle=%p\n", getpid(), evd); if (evd != h_dto_rcv_evd) { - fprintf(stderr, - "%d Error waiting on h_dto_cno: " - "evd != h_dto_rcv_evd\n", getpid()); - return (ret); + /* CNO timeout, already on EVD */ + if (evd != NULL) + return (ret); } } /* use wait to dequeue */ @@ -1446,10 +1443,9 @@ DAT_RETURN do_rdma_read_with_msg(void) LOGPRINTF("%d cno wait return evd_handle=%p\n", getpid(), evd); if (evd != h_dto_req_evd) { - fprintf(stderr, - "%d Error waiting on h_dto_cno: evd != h_dto_req_evd\n", - getpid()); - return (DAT_ABORT); + /* CNO timeout, already on EVD */ + if (evd != NULL) + return (ret); } } /* use wait to dequeue */ @@ -1501,6 +1497,15 @@ DAT_RETURN do_rdma_read_with_msg(void) */ printf("%d Sending RDMA read completion message\n", getpid()); + /* give remote chance to process read completes */ + if (use_cno) { +#if defined(_WIN32) || defined(_WIN64) + Sleep(1000); +#else + sleep(1); +#endif + } + ret = send_msg(&rmr_send_msg, sizeof(DAT_RMR_TRIPLET), lmr_context_send_msg, @@ -1525,14 +1530,14 @@ DAT_RETURN do_rdma_read_with_msg(void) LOGPRINTF("%d waiting for message receive event\n", getpid()); if (use_cno) { DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; - ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); + + ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); LOGPRINTF("%d cno wait return evd_handle=%p\n", getpid(), evd); if (evd != h_dto_rcv_evd) { - fprintf(stderr, - "%d Error waiting on h_dto_cno: evd != h_dto_rcv_evd\n", - getpid()); - return (ret); + /* CNO timeout, already on EVD */ + if (evd != NULL) + return (ret); } } /* use wait to dequeue */ @@ -1693,10 +1698,9 @@ DAT_RETURN do_ping_pong_msg() LOGPRINTF("%d cno wait return evd_handle=%p\n", getpid(), evd); if (evd != h_dto_rcv_evd) { - fprintf(stderr, - "%d Error waiting on h_dto_cno: evd != h_dto_rcv_evd\n", - getpid()); - return (ret); + /* CNO timeout, already on EVD */ + if (evd != NULL) + return (ret); } } /* use wait to dequeue */ -- 1.5.2.5 From hnrose at comcast.net Sun Aug 2 17:14:45 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 20:14:45 -0400 Subject: [ofa-general] [PATCH] opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, indicate failed attribute Message-ID: <20090803001444.GA26324@comcast.net> Display attribute name when appropriate Also, cosmetic changes in other log messages Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c index eeec51c..135c666 100644 --- a/opensm/opensm/osm_sa_mad_ctrl.c +++ b/opensm/opensm/osm_sa_mad_ctrl.c @@ -213,7 +213,7 @@ static void sa_mad_ctrl_process(IN osm_sa_mad_ctrl_t * p_ctrl, default: OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A01: " - "Unsupported attribute = 0x%X\n", + "Unsupported attribute 0x%X\n", cl_ntoh16(p_sa_mad->attr_id)); osm_dump_sa_mad(p_ctrl->p_log, p_sa_mad, OSM_LOG_ERROR); } @@ -233,9 +233,10 @@ static void sa_mad_ctrl_process(IN osm_sa_mad_ctrl_t * p_ctrl, if (status != CL_SUCCESS) { OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A02: " - "Dispatcher post message failed (%s) for attribute = 0x%X\n", + "Dispatcher post message failed (%s) for attribute 0x%X (%s)\n", CL_STATUS_MSG(status), - cl_ntoh16(p_sa_mad->attr_id)); + cl_ntoh16(p_sa_mad->attr_id), + ib_get_sa_attr_str(p_sa_mad->attr_id)); osm_mad_pool_put(p_ctrl->p_mad_pool, p_madw); goto Exit; diff --git a/opensm/opensm/osm_sm_mad_ctrl.c b/opensm/opensm/osm_sm_mad_ctrl.c index f941748..791c848 100644 --- a/opensm/opensm/osm_sm_mad_ctrl.c +++ b/opensm/opensm/osm_sm_mad_ctrl.c @@ -254,7 +254,7 @@ static void sm_mad_ctrl_process_get_resp(IN osm_sm_mad_ctrl_t * p_ctrl, default: cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown); OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3103: " - "Unsupported attribute = 0x%X\n", + "Unsupported attribute 0x%X\n", cl_ntoh16(p_smp->attr_id)); osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR); goto Exit; @@ -276,8 +276,9 @@ static void sm_mad_ctrl_process_get_resp(IN osm_sm_mad_ctrl_t * p_ctrl, if (status != CL_SUCCESS) { OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3104: " - "Dispatcher post message failed (%s) for attribute = 0x%X\n", - CL_STATUS_MSG(status), cl_ntoh16(p_smp->attr_id)); + "Dispatcher post message failed (%s) for attribute 0x%X (%s)\n", + CL_STATUS_MSG(status), cl_ntoh16(p_smp->attr_id), + ib_get_sm_attr_str(p_smp->attr_id)); goto Exit; } @@ -316,7 +317,7 @@ static void sm_mad_ctrl_process_get(IN osm_sm_mad_ctrl_t * p_ctrl, default: cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown); OSM_LOG(p_ctrl->p_log, OSM_LOG_VERBOSE, - "Ignoring SubnGet MAD - unsupported attribute = 0x%X\n", + "Ignoring SubnGet MAD - unsupported attribute 0x%X\n", cl_ntoh16(p_smp->attr_id)); break; } @@ -393,7 +394,7 @@ static void sm_mad_ctrl_process_set(IN osm_sm_mad_ctrl_t * p_ctrl, default: cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown); OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3107: " - "Unsupported attribute = 0x%X\n", + "Unsupported attribute 0x%X\n", cl_ntoh16(p_smp->attr_id)); osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR); break; @@ -480,7 +481,7 @@ static void sm_mad_ctrl_process_trap(IN osm_sm_mad_ctrl_t * p_ctrl, default: cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown); OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3109: " - "Unsupported attribute = 0x%X\n", + "Unsupported attribute 0x%X\n", cl_ntoh16(p_smp->attr_id)); osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR); break; @@ -555,7 +556,7 @@ static void sm_mad_ctrl_process_trap_repress(IN osm_sm_mad_ctrl_t * p_ctrl, default: cl_atomic_inc(&p_ctrl->p_stats->qp0_mads_rcvd_unknown); OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3105: " - "Unsupported attribute = 0x%X\n", + "Unsupported attribute 0x%X\n", cl_ntoh16(p_smp->attr_id)); osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR); break; @@ -724,7 +725,9 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw) p_smp->attr_id == IB_MAD_ATTR_SWITCH_INFO || p_smp->attr_id == IB_MAD_ATTR_LIN_FWD_TBL)) { OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3119: " - "Set method failed\n"); + "Set method failed for attribute 0x%X (%s)\n", + cl_ntoh16(p_smp->attr_id), + ib_get_sm_attr_str(p_smp->attr_id)); p_ctrl->p_subn->subnet_initialization_error = TRUE; } From hnrose at comcast.net Sun Aug 2 17:15:32 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 2 Aug 2009 20:15:32 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/ibnetdiscover.8: Add max hops option Message-ID: <20090803001532.GB26324@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 index 692994b..5841d8e 100644 --- a/infiniband-diags/man/ibnetdiscover.8 +++ b/infiniband-diags/man/ibnetdiscover.8 @@ -1,11 +1,11 @@ -.TH IBNETDISCOVER 8 "January 3, 2008" "OpenIB" "OpenIB Diagnostics" +.TH IBNETDISCOVER 8 "May 13, 2009" "OpenIB" "OpenIB Diagnostics" .SH NAME ibnetdiscover \- discover InfiniBand topology .SH SYNOPSIS .B ibnetdiscover -[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] +[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-m(ax_hops)] [\-h(elp)] [] .SH DESCRIPTION .PP @@ -47,6 +47,9 @@ names. See file format below. Obtain a ports report which is a list of connected ports with relevant information (like LID, portnum, GUID, width, speed, and NodeDescription). +.TP +\fB\-m\fR, \fB\-\-max_hops\fR +Report max hops discovered. .SH COMMON OPTIONS From eli at mellanox.co.il Mon Aug 3 02:25:29 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 3 Aug 2009 12:25:29 +0300 Subject: [ofa-general] [PATCH] cma: fix access to freed memory Message-ID: <20090803092528.GA25528@mtls03> rdma_join_multicast() allocates struct cma_multicast and then proceeds to join to a multicast address. However, the join operation completes in another context and the allocated struct could be released if the user destroys either the rdma_id object or decides to leave the multicast group while the join is in progress. This patch uses reference counting to to avoid such situation. It also protects removal from id_priv->mc_list in cma_leave_mc_groups(). Signed-off-by: Eli Cohen --- drivers/infiniband/core/cma.c | 23 +++++++++++++++++++---- 1 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 851de83..8fee477 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -157,6 +157,7 @@ struct cma_multicast { struct list_head list; void *context; struct sockaddr_storage addr; + atomic_t refcount; }; struct cma_work { @@ -290,6 +291,12 @@ static inline void cma_deref_dev(struct cma_device *cma_dev) complete(&cma_dev->comp); } +void cma_deref_mc(struct cma_multicast *mc) +{ + if (atomic_dec_and_test(&mc->refcount)) + kfree(mc); +} + static void cma_detach_from_dev(struct rdma_id_private *id_priv) { list_del(&id_priv->list); @@ -822,13 +829,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) { struct cma_multicast *mc; + spin_lock_irq(&id_priv->lock); while (!list_empty(&id_priv->mc_list)) { mc = container_of(id_priv->mc_list.next, struct cma_multicast, list); list_del(&mc->list); + spin_unlock_irq(&id_priv->lock); ib_sa_free_multicast(mc->multicast.ib); - kfree(mc); + cma_deref_mc(mc); + spin_lock_irq(&id_priv->lock); } + spin_unlock_irq(&id_priv->lock); } void rdma_destroy_id(struct rdma_cm_id *id) @@ -2643,7 +2654,7 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) id_priv = mc->id_priv; if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) && cma_disable_callback(id_priv, CMA_ADDR_RESOLVED)) - return 0; + goto out; mutex_lock(&id_priv->qp_mutex); if (!status && id_priv->id.qp) @@ -2669,10 +2680,12 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) cma_exch(id_priv, CMA_DESTROYING); mutex_unlock(&id_priv->handler_mutex); rdma_destroy_id(&id_priv->id); - return 0; + goto out; } mutex_unlock(&id_priv->handler_mutex); +out: + cma_deref_mc(mc); return 0; } @@ -2759,11 +2772,13 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, memcpy(&mc->addr, addr, ip_addr_size(addr)); mc->context = context; mc->id_priv = id_priv; + atomic_set(&mc->refcount, 1); spin_lock(&id_priv->lock); list_add(&mc->list, &id_priv->mc_list); spin_unlock(&id_priv->lock); + atomic_inc(&mc->refcount); switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: ret = cma_join_ib_multicast(id_priv, mc); @@ -2800,7 +2815,7 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) &mc->multicast.ib->rec.mgid, mc->multicast.ib->rec.mlid); ib_sa_free_multicast(mc->multicast.ib); - kfree(mc); + cma_deref_mc(mc); return; } } -- 1.6.3.3 From vlad at lists.openfabrics.org Mon Aug 3 03:23:45 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 3 Aug 2009 03:23:45 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090803-0200 daily build status Message-ID: <20090803102346.185DC102020F@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg': /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink' /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_recvmsg': /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2120: error: too many arguments to function 'skb_unlink' /home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:2127: error: too many arguments to function 'skb_unlink' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090803-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Mon Aug 3 05:04:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 3 Aug 2009 15:04:24 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, indicate failed attribute In-Reply-To: <20090803001444.GA26324@comcast.net> References: <20090803001444.GA26324@comcast.net> Message-ID: <20090803120424.GY5287@me> On 20:14 Sun 02 Aug , Hal Rosenstock wrote: > > Display attribute name when appropriate > Also, cosmetic changes in other log messages > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Mon Aug 3 05:05:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 3 Aug 2009 15:05:35 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibnetdiscover.8: Add max hops option In-Reply-To: <20090803001532.GB26324@comcast.net> References: <20090803001532.GB26324@comcast.net> Message-ID: <20090803120535.GZ5287@me> On 20:15 Sun 02 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Mon Aug 3 06:04:29 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 03 Aug 2009 16:04:29 +0300 Subject: [ofa-general] [PATCH] opensm: do not configure MFTs when mcast support is disabled Message-ID: <4A76E05D.9070705@dev.mellanox.co.il> Hi Sasha, I noticed that when MCast support in OSM is disabled (command line option '-d3'), MFTs on the switches are still getting configured. Turns out that MFTs configuration was disabled only in heavy sweep, but it was still working at idle time - the following patch fixes it. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_state_mgr.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 90bef87..185c700 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1377,8 +1377,10 @@ static void do_process_mgrp_queue(osm_sm_t * sm) { if (sm->p_subn->sm_state != IB_SMINFO_STATE_MASTER) return; - osm_mcast_mgr_process_mgroups(sm); - wait_for_pending_transactions(&sm->p_subn->p_osm->stats); + if (!sm->p_subn->opt.disable_multicast) { + osm_mcast_mgr_process_mgroups(sm); + wait_for_pending_transactions(&sm->p_subn->p_osm->stats); + } } void osm_state_mgr_process(IN osm_sm_t * sm, IN osm_signal_t signal) -- 1.5.1.4 From bart.vanassche at gmail.com Mon Aug 3 06:21:21 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 3 Aug 2009 15:21:21 +0200 Subject: [ofa-general] [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed Message-ID: Issuing a SCSI reset command on an SRP initiator after the SRP connection has been closed triggers a NULL pointer dereference. The patch below fixes this NULL pointer dereference. See also http://bugzilla.kernel.org/show_bug.cgi?id=13893. Signed-off-by: Cc: Roland Dreier Cc: Sean Hefty Cc: Hal Rosenstock --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c 2009-08-03 12:13:11.000000000 +0200 +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c 2009-08-03 14:58:36.000000000 +0200 @@ -1330,6 +1330,8 @@ static int srp_send_tsk_mgmt(struct srp_ struct srp_iu *iu; struct srp_tsk_mgmt *tsk_mgmt; + BUG_ON(!req->scmnd->device); + spin_lock_irq(target->scsi_host->host_lock); if (target->state == SRP_TARGET_DEAD || @@ -1429,6 +1431,8 @@ static int srp_reset_device(struct scsi_ return FAILED; if (req->tsk_status) return FAILED; + if (!req->scmnd->device) + return FAILED; spin_lock_irq(target->scsi_host->host_lock); From sebastien.dugue at bull.net Mon Aug 3 06:40:01 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Mon, 3 Aug 2009 15:40:01 +0200 Subject: [ofa-general] [PATCH] libmlx4 - mmap needs some includes Message-ID: <20090803154001.32fdab08@frecb007965> Hi Roland, Add errno.h and sys/mman.h includes in buf.c to get mmap() support, otherwise we cannot build as is. Those includes were removed in your cleanup. Sorry for not noticing earlier. Signed-off-by: Sebastien Dugue --- src/buf.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/src/buf.c b/src/buf.c index bbaff12..a80bcb1 100644 --- a/src/buf.c +++ b/src/buf.c @@ -35,6 +35,8 @@ #endif /* HAVE_CONFIG_H */ #include +#include +#include #include "mlx4.h" -- 1.6.3.1 From sebastien.dugue at bull.net Mon Aug 3 07:37:36 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Mon, 3 Aug 2009 16:37:36 +0200 Subject: [ofa-general] [PATCH] libmlx4: use dynamic archive name when building an rpm Message-ID: <20090803163736.1fde4c74@frecb007965> There is a discrepancy between the tar.gz source archive name and the library version. rpmbuild then fails to find its source files. Fix this by dynamically setting the package version into the archive name . Signed-off-by: Sebastien Dugue --- libmlx4.spec.in | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libmlx4.spec.in b/libmlx4.spec.in index 869032f..7c6c841 100644 --- a/libmlx4.spec.in +++ b/libmlx4.spec.in @@ -6,7 +6,7 @@ Summary: Mellanox ConnectX InfiniBand HCA Userspace Driver Group: System Environment/Libraries License: GPLv2 or BSD Url: http://openfabrics.org/ -Source: http://openfabrics.org/downloads/mlx4/libmlx4-1.0.tar.gz +Source: http://openfabrics.org/downloads/mlx4/libmlx4-%{version}.tar.gz BuildRoot: %(mktemp -ud %{_tmppath}/%{name}-%{version}-%{release}-XXXXXX) BuildRequires: libibverbs-devel >= 1.1-0.1.rc2 -- 1.6.0.4 From yosefe at voltaire.com Mon Aug 3 10:03:14 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Mon, 03 Aug 2009 20:03:14 +0300 Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes In-Reply-To: References: <4A6DDFCE.9060009@voltaire.com> <4A703DA4.9080300@Voltaire.COM> <4A705B3A.7060404@Voltaire.COM> <4A731818.3060500@voltaire.com> <4A733D24.3040201@voltaire.com> <4A742E94.2070002@gmail.com> Message-ID: <4A771852.1010606@voltaire.com> On 02/08/09 14:26, Hal Rosenstock wrote: > > By handled correctly, you mean that the ARP request gets to the remote > node, is responded to, and the response makes it back and that is > treated as valid path indication, right ? > > If so, is the original ARP request unicast or broadcast ? > > If the request is unicast, couldn't it be sent using the wrong static > rate as isn't it using the original path parameters ? > > Even if it is broadcast, if the original path parameters are still > used (like rate, etc.) at the local node, doesn't this assume a > homogeneous subnet ? By handled correctly, I mean that: - If the LID is not changed, the mechanism will not trigger path refresh. (The first patch without any LMC handling does not satisfy this) - If the LID is changed, the mechanism will trigger a path refresh (eventually) The ARP stuff works this way: Remote LID changes. In some point, either the remote node will send an ARP reply (gratuitous), or (more likely) the local network stack will start sending solicited ARPs, unicast, using the invalid path. They will fail, so the stack will send broadcast ARP. Then, the remote node will answer, and IPoIB will see a different slid than expected. This will trigger path refresh. The broadcast ARP is sent with the AH of the broadcast group (which is joined when IPoIB interface goes up), and not the parameters of the path to any specific node. --Yossi From yosefe at voltaire.com Mon Aug 3 10:10:07 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Mon, 03 Aug 2009 20:10:07 +0300 Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes In-Reply-To: <20090731194003.GV30626@obsidianresearch.com> References: <20090727192938.GD5794@obsidianresearch.com> <4A6ECF6F.4000008@Voltaire.COM> <4A70154F.7080300@gmail.com> <4A703DA4.9080300@Voltaire.COM> <4A705B3A.7060404@Voltaire.COM> <4A731818.3060500@voltaire.com> <20090731194003.GV30626@obsidianresearch.com> Message-ID: <4A7719EF.4080902@voltaire.com> On 31/07/09 22:40, Jason Gunthorpe wrote: > On Fri, Jul 31, 2009 at 07:13:12PM +0300, Yossi Etigin wrote: > >> What if we query the remote port LMC once, when the path is >> resolved, and then use it to mask the LID until the path is >> refreshed again? > > What are you trying to fix here? Most SMs have a persistent LID > stability feature, so why would the LID change very often anyhow? > > Jason > We have customers with large fabrics and different machines/operation systems, where the LID does not always stay the same.They are experiencing loss of IPoIB connectivity. The patch above solved that. Besides, according to the IB spec, LIDs are not persistent and can change (although most SM today do try to keep them persistent). Regarding LMC, that is less likely to change, so if we handle constant LMC correctly and if LMC is changed the behaviour is as it was before the patch, I think it can be OK. What do you think? --Yossi From hnrose at comcast.net Mon Aug 3 10:59:46 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 3 Aug 2009 13:59:46 -0400 Subject: [ofa-general] [PATCH] opensm/osm_trap_rcv.c: Use proper flag name in comment Message-ID: <20090803175946.GA5981@comcast.net> Change force_single_heavy_sweep to force_heavy_sweep Other cosmetic commentary changes Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index 4578ebc..4a6d0ff 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -564,12 +564,12 @@ trap_rcv_process_request(IN osm_sm_t * sm, /* do a sweep if we received a trap */ if (sm->p_subn->opt.sweep_on_trap) { - /* if this is trap number 128 or run_heavy_sweep is TRUE - update the - force_single_heavy_sweep flag of the subnet. - Sweep also on traps 144/145 - these traps signal a change of a certain - port capability/system image guid. - TODO: In the future we can change this to just getting PortInfo on - this port instead of sweeping the entire subnet. */ + /* if this is trap number 128 or run_heavy_sweep is TRUE - + update the force_heavy_sweep flag of the subnet. + Sweep also on traps 144/145 - these traps signal a change of + certain port capabilities/system image guid. + TODO: In the future this can be changed to just getting + PortInfo on this port instead of sweeping the entire subnet. */ if (ib_notice_is_generic(p_ntci) && (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || From rdreier at cisco.com Mon Aug 3 13:31:37 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Aug 2009 13:31:37 -0700 Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory In-Reply-To: <20090803092528.GA25528@mtls03> (Eli Cohen's message of "Mon, 3 Aug 2009 12:25:29 +0300") References: <20090803092528.GA25528@mtls03> Message-ID: > rdma_join_multicast() allocates struct cma_multicast and then proceeds to join > to a multicast address. However, the join operation completes in another > context and the allocated struct could be released if the user destroys either > the rdma_id object or decides to leave the multicast group while the join is in > progress. This patch uses reference counting to to avoid such situation. It > also protects removal from id_priv->mc_list in cma_leave_mc_groups(). Is this all in response to problems seen in practice, or just from reading over the code? > + atomic_t refcount; I think this would be clearer if you used struct kref here. > @@ -822,13 +829,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) > { > struct cma_multicast *mc; > > + spin_lock_irq(&id_priv->lock); I didn't follow how this change is connected to the reference counting. What is this synchronizing against? Is it an independent change of the reference counting? - R. From rdreier at cisco.com Mon Aug 3 13:36:02 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Aug 2009 13:36:02 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: (Bart Van Assche's message of "Mon, 3 Aug 2009 15:21:21 +0200") References: Message-ID: > Issuing a SCSI reset command on an SRP initiator after the SRP connection has > been closed triggers a NULL pointer dereference. The patch below fixes this > NULL pointer dereference. > > See also http://bugzilla.kernel.org/show_bug.cgi?id=13893. Thanks for debugging this... a couple of questions: > + BUG_ON(!req->scmnd->device); Why BUG_ON() here? Can we return failure or something, rather than crashing the whole system? > + if (!req->scmnd->device) > + return FAILED; How do we end up in srp_reset_device() with req->scmnd->device == NULL? Presumably req->scmnd should match scmnd if I am understanding the code properly -- and then scmnd->device == NULL?? - R. From jgunthorpe at obsidianresearch.com Mon Aug 3 13:35:59 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 3 Aug 2009 14:35:59 -0600 Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes In-Reply-To: <4A7719EF.4080902@voltaire.com> References: <4A70154F.7080300@gmail.com> <4A703DA4.9080300@Voltaire.COM> <4A705B3A.7060404@Voltaire.COM> <4A731818.3060500@voltaire.com> <20090731194003.GV30626@obsidianresearch.com> <4A7719EF.4080902@voltaire.com> Message-ID: <20090803203559.GJ24282@obsidianresearch.com> On Mon, Aug 03, 2009 at 08:10:07PM +0300, Yossi Etigin wrote: > We have customers with large fabrics and different machines/operation systems, > where the LID does not always stay the same.They are experiencing loss of > IPoIB connectivity. The patch above solved that. Besides, according to the > IB spec, LIDs are not persistent and can change (although most SM today do > try to keep them persistent). Hmm, have you considered changing the IPoIB QPN when the LID changes? This would provide a clear signal to anyone with a cached ARP entry that it is wrong. But even so, IPoIB implicitly assumes that the LID doesn't change, by design. That SA really has to try to make that true when IPoIB is used. Jason From hnrose at comcast.net Mon Aug 3 13:39:57 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 3 Aug 2009 16:39:57 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/ibsendtrap.c: Fill in capability mask on trap 144 Message-ID: <20090803203957.GA23640@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index ac8dcf4..38305a2 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -63,6 +63,16 @@ static uint16_t get_node_type(ib_portid_t *port) return node_type; } +static uint32_t get_cap_mask(ib_portid_t *port) +{ + uint8_t data[IB_SMP_DATA_SIZE]; + uint32_t cap_mask = 0; + + if (smp_query_via(data, port, IB_ATTR_PORT_INFO, 0, 0, srcport)) + cap_mask = (uint32_t)mad_get_field(data, 0, IB_PORT_CAPMASK_F); + return cap_mask; +} + static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) { n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; @@ -70,6 +80,7 @@ static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) n->g_or_v.generic.trap_num = cl_hton16(144); n->issuer_lid = cl_hton16((uint16_t) port->lid); n->data_details.ntc_144.lid = n->issuer_lid; + n->data_details.ntc_144.new_cap_mask = cl_hton32(get_cap_mask(port)); n->data_details.ntc_144.local_changes = TRAP_144_MASK_OTHER_LOCAL_CHANGES; n->data_details.ntc_144.change_flgs = From rdreier at cisco.com Mon Aug 3 13:46:22 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Aug 2009 13:46:22 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4 - mmap needs some includes In-Reply-To: <20090803154001.32fdab08@frecb007965> (sebastien dugue's message of "Mon, 3 Aug 2009 15:40:01 +0200") References: <20090803154001.32fdab08@frecb007965> Message-ID: thanks ... actually they weren't removed as part of my cleanups, but as part of the incompetent way I applied the patch. same end result anyway. From rdreier at cisco.com Mon Aug 3 13:49:35 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Aug 2009 13:49:35 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: use dynamic archive name when building an rpm In-Reply-To: <20090803163736.1fde4c74@frecb007965> (sebastien dugue's message of "Mon, 3 Aug 2009 16:37:36 +0200") References: <20090803163736.1fde4c74@frecb007965> Message-ID: > There is a discrepancy between the tar.gz source archive name and the library > version. rpmbuild then fails to find its source files. > Fix this by dynamically setting the package version into the archive name . Thanks, good catch. I fixed this by just changing to 1.0.1, since otherwise things run into trouble if the version number is 1.0.2-rc1 or something like that (RPM version should be different than 1.0.2-rc1 in that cfase) From abenjamin at sgi.com Mon Aug 3 19:49:23 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Mon, 03 Aug 2009 19:49:23 -0700 Subject: [ofa-general] [PATCH v3] mthca: Distinguish multiple IB cards in /proc/interrupts Message-ID: <4A77A1B3.3020603@sgi.com> When the mthca driver calls request_irq() to allocate interrupt resources, it uses the fixed device name string "ib_mthca". When multiple IB cards are present in the system, every instance of the resource is named "ib_mthca" in /proc/interrupts. This can make it very confusing trying to work out exactly where IB interrupts are going and why. The mthca driver has been modified to use the PCI name of the IB card for the purpose of allocating interrupt resources. Signed-off-by: Arputham Benjamin --- diff -rup a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h --- a/drivers/infiniband/hw/mthca/mthca_dev.h 2009-08-03 15:44:44.408580749 -0700 +++ b/drivers/infiniband/hw/mthca/mthca_dev.h 2009-08-03 15:45:25.451110249 -0700 @@ -357,6 +357,7 @@ struct mthca_dev { struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; u8 rate[MTHCA_MAX_PORTS]; + char irq_name[MTHCA_NUM_EQ][IB_DEVICE_NAME_MAX]; }; #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG diff -rup a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c --- a/drivers/infiniband/hw/mthca/mthca_eq.c 2009-08-03 15:44:44.416581242 -0700 +++ b/drivers/infiniband/hw/mthca/mthca_eq.c 2009-08-03 15:45:11.098225651 -0700 @@ -835,21 +835,27 @@ int mthca_init_eq_table(struct mthca_dev }; for (i = 0; i < MTHCA_NUM_EQ; ++i) { + snprintf(dev->irq_name[i], IB_DEVICE_NAME_MAX, + "%s at pci:%s", eq_name[i], + pci_name(dev->pdev)); err = request_irq(dev->eq_table.eq[i].msi_x_vector, mthca_is_memfree(dev) ? mthca_arbel_msi_x_interrupt : mthca_tavor_msi_x_interrupt, - 0, eq_name[i], dev->eq_table.eq + i); + 0, dev->irq_name[i], + dev->eq_table.eq + i); if (err) goto err_out_cmd; dev->eq_table.eq[i].have_irq = 1; } } else { + snprintf(dev->irq_name[0], IB_DEVICE_NAME_MAX, + DRV_NAME "@pci:%s", pci_name(dev->pdev)); err = request_irq(dev->pdev->irq, mthca_is_memfree(dev) ? mthca_arbel_interrupt : mthca_tavor_interrupt, - IRQF_SHARED, DRV_NAME, dev); + IRQF_SHARED, dev->irq_name[0], dev); if (err) goto err_out_cmd; dev->eq_table.have_irq = 1; From abenjamin at sgi.com Mon Aug 3 20:00:00 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Mon, 03 Aug 2009 20:00:00 -0700 Subject: [ofa-general] [PATCH v2] mlx4_core: Distinguish multiple IB cards in /proc/interrupts Message-ID: <4A77A430.2020106@sgi.com> When the mlx4_core driver calls request_irq() to allocate interrupt resources, it uses the fixed device name string "mlx4_core". When multiple IB cards are present in the system, every instance of the resource is named "mlx4_core" in /proc/interrupts. This can make it very confusing trying to work out exactly where IB interrupts are going and why. The mlx4_core driver has been modified to use the PCI name of the IB card for the purpose of allocating interrupt resources. Signed-off-by: Arputham Benjamin --- diff -rup a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c --- a/drivers/net/mlx4/eq.c 2009-08-03 19:42:18.737707766 -0700 +++ b/drivers/net/mlx4/eq.c 2009-08-03 19:42:48.175515414 -0700 @@ -615,7 +615,8 @@ int mlx4_init_eq_table(struct mlx4_dev * priv->eq_table.clr_int = priv->clr_base + (priv->eq_table.inta_pin < 32 ? 4 : 0); - priv->eq_table.irq_names = kmalloc(16 * dev->caps.num_comp_vectors, GFP_KERNEL); + priv->eq_table.irq_names = kmalloc(DEVICE_NAME_MAX * + (dev->caps.num_comp_vectors + 1), GFP_KERNEL); if (!priv->eq_table.irq_names) { err = -ENOMEM; goto err_out_bitmap; @@ -638,17 +639,25 @@ int mlx4_init_eq_table(struct mlx4_dev * goto err_out_comp; if (dev->flags & MLX4_FLAG_MSI_X) { - static const char async_eq_name[] = "mlx4-async"; const char *eq_name; for (i = 0; i < dev->caps.num_comp_vectors + 1; ++i) { if (i < dev->caps.num_comp_vectors) { - snprintf(priv->eq_table.irq_names + i * 16, 16, - "mlx4-comp-%d", i); - eq_name = priv->eq_table.irq_names + i * 16; - } else - eq_name = async_eq_name; + snprintf(priv->eq_table.irq_names + + i * DEVICE_NAME_MAX, + DEVICE_NAME_MAX, + "mlx4-comp-%d at pci:%s", i, + pci_name(dev->pdev)); + } else { + snprintf(priv->eq_table.irq_names + + i * DEVICE_NAME_MAX, + DEVICE_NAME_MAX, + "mlx4-async at pci:%s", + pci_name(dev->pdev)); + } + eq_name = priv->eq_table.irq_names + + i * DEVICE_NAME_MAX; err = request_irq(priv->eq_table.eq[i].irq, mlx4_msi_x_interrupt, 0, eq_name, priv->eq_table.eq + i); @@ -658,8 +667,12 @@ int mlx4_init_eq_table(struct mlx4_dev * priv->eq_table.eq[i].have_irq = 1; } } else { + snprintf(priv->eq_table.irq_names, + DEVICE_NAME_MAX, + DRV_NAME "@pci:%s", + pci_name(dev->pdev)); err = request_irq(dev->pdev->irq, mlx4_interrupt, - IRQF_SHARED, DRV_NAME, dev); + IRQF_SHARED, priv->eq_table.irq_names, dev); if (err) goto err_out_async; diff -rup a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h --- a/drivers/net/mlx4/mlx4.h 2009-08-03 19:42:18.737707766 -0700 +++ b/drivers/net/mlx4/mlx4.h 2009-08-03 19:43:01.532335625 -0700 @@ -198,6 +198,8 @@ struct mlx4_cq_table { struct mlx4_icm_table cmpt_table; }; +#define DEVICE_NAME_MAX 64 + struct mlx4_eq_table { struct mlx4_bitmap bitmap; char *irq_names; From eli at dev.mellanox.co.il Mon Aug 3 20:32:21 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 4 Aug 2009 06:32:21 +0300 Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory In-Reply-To: References: <20090803092528.GA25528@mtls03> Message-ID: <20090804033221.GA30949@mtls03> On Mon, Aug 03, 2009 at 01:31:37PM -0700, Roland Dreier wrote: > > Is this all in response to problems seen in practice, or just from > reading over the code? I did not see a problem in practice with the current code, but playing arround rdma_join_multicast() adding another case to the switch statement revealed this problem which I think exists also in the current code. > > > + atomic_t refcount; > > I think this would be clearer if you used struct kref here. > Certainly. I will post another patch. > > @@ -822,13 +829,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) > > { > > struct cma_multicast *mc; > > > > + spin_lock_irq(&id_priv->lock); > > I didn't follow how this change is connected to the reference counting. > What is this synchronizing against? Is it an independent change of the > reference counting? > Maybe it's just a loose connection but yet, it seems to me that operations on id_priv->mc_list should be protected. Should I send a different patch? From jgunthorpe at obsidianresearch.com Mon Aug 3 21:56:47 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 3 Aug 2009 22:56:47 -0600 Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes In-Reply-To: <4A771852.1010606@voltaire.com> References: <4A705B3A.7060404@Voltaire.COM> <4A731818.3060500@voltaire.com> <4A733D24.3040201@voltaire.com> <4A742E94.2070002@gmail.com> <4A771852.1010606@voltaire.com> Message-ID: <20090804045647.GK24282@obsidianresearch.com> On Mon, Aug 03, 2009 at 08:03:14PM +0300, Yossi Etigin wrote: > The ARP stuff works this way: Remote LID changes. In some point, either the remote > node will send an ARP reply (gratuitous), or (more likely) the local network stack > will start sending solicited ARPs, unicast, using the invalid path. They will fail, > so the stack will send broadcast ARP. Erm.. Maybe a little tighter integration with the ARP/ND layer is in order. If it knows unicast isn't working thats a pretty damn good clue to discard the PR. Jason From sebastien.dugue at bull.net Mon Aug 3 23:49:43 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Tue, 4 Aug 2009 08:49:43 +0200 Subject: [ofa-general] Re: [PATCH] libmlx4: use dynamic archive name when building an rpm In-Reply-To: References: <20090803163736.1fde4c74@frecb007965> Message-ID: <20090804084943.4bcc38cd@frecb007965> On Mon, 03 Aug 2009 13:49:35 -0700 Roland Dreier wrote: > > > There is a discrepancy between the tar.gz source archive name and the library > > version. rpmbuild then fails to find its source files. > > > Fix this by dynamically setting the package version into the archive name . > > Thanks, good catch. I fixed this by just changing to 1.0.1, since > otherwise things run into trouble if the version number is 1.0.2-rc1 or > something like that (RPM version should be different than 1.0.2-rc1 in > that cfase) > Thanks, haven't thought of the -rc issue. Sebastien. From bart.vanassche at gmail.com Tue Aug 4 00:48:22 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Tue, 4 Aug 2009 09:48:22 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Mon, Aug 3, 2009 at 10:36 PM, Roland Dreier wrote: > >  > Issuing a SCSI reset command on an SRP initiator after the SRP connection has >  > been closed triggers a NULL pointer dereference. The patch below fixes this >  > NULL pointer dereference. >  > >  > See also http://bugzilla.kernel.org/show_bug.cgi?id=13893. > > Thanks for debugging this... a couple of questions: > >  > +    BUG_ON(!req->scmnd->device); > > Why BUG_ON() here?  Can we return failure or something, rather than > crashing the whole system? The function srp_send_tsk_mgmt() contains a.o. the following statement: "tsk_mgmt->lun = cpu_to_be64((u64) req->scmnd->device->lun << 48);". This is the statement that triggered the NULL pointer dereference. Whether or not a BUG_ON() is appropriate here depends on which of the following two alternatives is preferred: should the caller guarantee that req->scmnd->device != NULL or should srp_send_tsk_mgmt() should handle the condition req->scmnd->device == NULL itself ? >  > +    if (!req->scmnd->device) >  > +            return FAILED; > > How do we end up in srp_reset_device() with req->scmnd->device == NULL? > Presumably req->scmnd should match scmnd if I am understanding the code > properly -- and then scmnd->device == NULL?? Good question. I did not yet analyze why this happens. But before I started developing a patch I had first verified that scmnd->device is NULL at that point by inserting the statement WARN_ON(!scmnd->device). A clue might be that without the above patch the BUG message on the initiator system is triggered just after the "SRP reset_device called" message has been logged. Bart. From kliteyn at dev.mellanox.co.il Tue Aug 4 01:29:06 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 04 Aug 2009 11:29:06 +0300 Subject: [ofa-general] [PATCH] opensm/osm_helper.c: fix printing trap 258 details Message-ID: <4A77F152.2030506@dev.mellanox.co.il> Hi Sasha, Fixing some issues with printing trap 258 details. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_helper.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 57de0d4..07b1e5a 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -1828,10 +1828,11 @@ void osm_dump_notice(IN osm_log_t * p_log, lid2), cl_ntoh32(p_ntci->data_details.ntc_257_258.key), cl_ntoh32(p_ntci->data_details.ntc_257_258. - qp1) >> 24, + qp1) >> 28, cl_ntoh32(p_ntci->data_details.ntc_257_258. qp1) & 0xffffff, - cl_ntoh32(p_ntci->data_details.ntc_257_258.qp2), + cl_ntoh32(p_ntci->data_details.ntc_257_258. + qp2) & 0xffffff, inet_ntop(AF_INET6, p_ntci->data_details. ntc_257_258.gid1.raw, gid_str, sizeof gid_str), -- 1.5.1.4 From vlad at lists.openfabrics.org Tue Aug 4 02:59:47 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 4 Aug 2009 02:59:47 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090804-0200 daily build status Message-ID: <20090804095948.2C1D3E61D1B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090804-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From Robert at saq.co.uk Tue Aug 4 02:58:24 2009 From: Robert at saq.co.uk (Robert Dunkley) Date: Tue, 4 Aug 2009 10:58:24 +0100 Subject: [ofa-general] OFED on Centos with 2.6.30.4 generic kernel Message-ID: I'm a bit of newbie to kernel building but work on my first custom kernel seems to be going well so far. The issue I have is the systems this kernel is destined for are using Mellanox infiniband cards, IPOIB (CM), RDMA and Subnet Manager (Systems are direct cabled to each other). I noticed support seems to be built-in to the kernel for all but the subnet manager. Should I use the built-in kernel support and install the Subnet manager separately? Or build a kernel with no infiniband support and then try to install OFED? Will OFED even likely install with this sort of setup? Thanks, Rob The SAQ Group Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ SAQ is the trading name of SEMTEC Limited. Registered in England & Wales Company Number: 06481952 http://www.saqnet.co.uk AS29219 SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. ISPA Member From ofedrnicuser at yahoo.com Tue Aug 4 03:08:43 2009 From: ofedrnicuser at yahoo.com (Bill N) Date: Tue, 4 Aug 2009 03:08:43 -0700 (PDT) Subject: [ofa-general] perftest for Chelsio RNIC adapters Message-ID: <351317.84709.qm@web111212.mail.gq1.yahoo.com> Hi, Is performance tests of the perftest-1.2 supported for Chelsio and other RNIC adapters? Regards, Bill From jackm at dev.mellanox.co.il Tue Aug 4 03:25:05 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 4 Aug 2009 13:25:05 +0300 Subject: [ofa-general] OFED on Centos with 2.6.30.4 generic kernel In-Reply-To: References: Message-ID: <200908041325.05450.jackm@dev.mellanox.co.il> On Tuesday 04 August 2009 12:58, Robert Dunkley wrote: > I'm a bit of newbie to kernel building but work on my first custom > kernel seems to be going well so far. > > The issue I have is the systems this kernel is destined for are using > Mellanox infiniband cards, IPOIB (CM), RDMA and Subnet Manager (Systems > are direct cabled to each other). I noticed support seems to be built-in > to the kernel for all but the subnet manager. > > Should I use the built-in kernel support and install the Subnet manager > separately? Or build a kernel with no infiniband support and then try to > install OFED? Will OFED even likely install with this sort of setup? OFED 1.4 will not install (it supports up to kernel 2.6.27 only). OFED 1.5 is currently under development, supporting up to kernel 2.6.30. An alpha version (lightly tested only) is currently available. I do not know how much testing the built-in kernel support has undergone. -Jack > > Thanks, > > Rob > > The SAQ Group > > Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ > SAQ is the trading name of SEMTEC Limited. Registered in England & Wales > Company Number: 06481952 > > http://www.saqnet.co.uk AS29219 > > SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. > > Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. > > ISPA Member > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Tue Aug 4 05:32:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 15:32:53 +0300 Subject: [ofa-general] Re: [PATCH] opensm: do not configure MFTs when mcast support is disabled In-Reply-To: <4A76E05D.9070705@dev.mellanox.co.il> References: <4A76E05D.9070705@dev.mellanox.co.il> Message-ID: <20090804123253.GA7993@me> On 16:04 Mon 03 Aug , Yevgeny Kliteynik wrote: > Hi Sasha, > > I noticed that when MCast support in OSM is disabled (command line > option '-d3'), MFTs on the switches are still getting configured. > > Turns out that MFTs configuration was disabled only in heavy sweep, > but it was still working at idle time - the following patch fixes it. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Tue Aug 4 05:33:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 15:33:08 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Use proper flag name in comment In-Reply-To: <20090803175946.GA5981@comcast.net> References: <20090803175946.GA5981@comcast.net> Message-ID: <20090804123308.GB7993@me> On 13:59 Mon 03 Aug , Hal Rosenstock wrote: > > Change force_single_heavy_sweep to force_heavy_sweep > Other cosmetic commentary changes > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Aug 4 05:35:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 15:35:27 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsendtrap.c: Fill in capability mask on trap 144 In-Reply-To: <20090803203957.GA23640@comcast.net> References: <20090803203957.GA23640@comcast.net> Message-ID: <20090804123527.GC7993@me> On 16:39 Mon 03 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Aug 4 05:38:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 15:38:21 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: fix printing trap 258 details In-Reply-To: <4A77F152.2030506@dev.mellanox.co.il> References: <4A77F152.2030506@dev.mellanox.co.il> Message-ID: <20090804123821.GD7993@me> On 11:29 Tue 04 Aug , Yevgeny Kliteynik wrote: > Hi Sasha, > > Fixing some issues with printing trap 258 details. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From hnrose at comcast.net Tue Aug 4 05:47:17 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 4 Aug 2009 08:47:17 -0400 Subject: [ofa-general] [PATCH] opensm/osm_trap_rcv.c: Validate trap is 144 before checking for NodeDescription changed Message-ID: <20090804124717.GA12236@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index bf39926..925cb27 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm, } } - /* Check for node description update. IB Spec v1.2.1 pg 823 */ - if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && - p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { - OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n"); - - if (p_physp) { - CL_PLOCK_ACQUIRE(sm->p_lock); - osm_req_get_node_desc(sm, p_physp); - CL_PLOCK_RELEASE(sm->p_lock); - } else { - OSM_LOG(sm->p_log, OSM_LOG_ERROR, - "ERR 3812: No physical port found for " - "trap 144: \"node description update\"\n"); + if (ib_notice_is_generic(p_ntci)) { + /* Check for node description update. IB Spec v1.2.1 pg 823 */ + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) { + if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && + p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { + OSM_LOG(sm->p_log, OSM_LOG_INFO, + "Trap 144 Node description update\n"); + + if (p_physp) { + CL_PLOCK_ACQUIRE(sm->p_lock); + osm_req_get_node_desc(sm, p_physp); + CL_PLOCK_RELEASE(sm->p_lock); + } else + OSM_LOG(sm->p_log, OSM_LOG_ERROR, + "ERR 3812: No physical port found for " + "trap 144: \"node description update\"\n"); + } } - } - /* do a sweep if we received a trap */ - if (sm->p_subn->opt.sweep_on_trap) { - /* if this is trap number 128 or run_heavy_sweep is TRUE - - update the force_heavy_sweep flag of the subnet. - Sweep also on traps 144/145 - these traps signal a change of - certain port capabilities/system image guid. - TODO: In the future this can be changed to just getting - PortInfo on this port instead of sweeping the entire subnet. */ - if (ib_notice_is_generic(p_ntci) && - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || - run_heavy_sweep)) { - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, - "Forcing heavy sweep. Received trap:%u\n", - cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); + /* do a sweep if we received a trap */ + if (sm->p_subn->opt.sweep_on_trap) { + /* if this is trap number 128 or run_heavy_sweep is + TRUE - update the force_heavy_sweep flag of the + subnet. Also, sweep also on traps 144/145 - + these traps signal a change of certain port + capabilities/system image guid. + TODO: In the future this can be changed to just + getting PortInfo on this port instead of sweeping + the entire subnet. */ + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || + run_heavy_sweep) { + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, + "Forcing heavy sweep. Received trap:%u\n", + cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); - sm->p_subn->force_heavy_sweep = TRUE; + sm->p_subn->force_heavy_sweep = TRUE; + } + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); } - osm_sm_signal(sm, OSM_SIGNAL_SWEEP); } /* If we reached here due to trap 129/130/131 - do not need to do From hnrose at comcast.net Tue Aug 4 05:50:09 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 4 Aug 2009 08:50:09 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/ibsendtrap.c: Add support for link_speed_enabled_change trap Message-ID: <20090804125009.GB12236@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index 38305a2..c8c7ee8 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -87,6 +87,20 @@ static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) TRAP_144_MASK_NODE_DESCRIPTION_CHANGE; } +static void build_trap144_2(ib_mad_notice_attr_t * n, ib_portid_t *port) +{ + n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; + n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); + n->g_or_v.generic.trap_num = cl_hton16(144); + n->issuer_lid = cl_hton16((uint16_t) port->lid); + n->data_details.ntc_144.lid = n->issuer_lid; + n->data_details.ntc_144.new_cap_mask = cl_hton32(get_cap_mask(port)); + n->data_details.ntc_144.local_changes = + TRAP_144_MASK_OTHER_LOCAL_CHANGES; + n->data_details.ntc_144.change_flgs = + TRAP_144_MASK_LINK_SPEED_ENABLE_CHANGE; +} + static void build_trap129(ib_mad_notice_attr_t * n, ib_portid_t *port) { n->generic_type = 0x80 | IB_NOTICE_TYPE_URGENT; @@ -134,6 +148,7 @@ typedef struct _trap_def { trap_def_t traps[3] = { {"node_desc_change", build_trap144}, + {"link_speed_enabled_change", build_trap144_2}, {"local_link_integrity", build_trap129}, {NULL, NULL} }; From hnrose at comcast.net Tue Aug 4 06:18:36 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 4 Aug 2009 09:18:36 -0400 Subject: [ofa-general] [PATCH] opensm: Add initial support for optimized SLtoVLMappingTable programming Message-ID: <20090804131836.GA15226@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 6c20de8..8443763 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -4,6 +4,7 @@ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -204,6 +205,7 @@ typedef struct osm_subn_opt { boolean_t daemon; boolean_t sm_inactive; boolean_t babbling_port_policy; + boolean_t use_optimized_slvl; osm_qos_options_t qos_options; osm_qos_options_t qos_ca_options; osm_qos_options_t qos_sw0_options; @@ -428,6 +430,10 @@ typedef struct osm_subn_opt { * babbling_port_policy * OpenSM will enforce its "babbling" port policy. * +* use_optimized_slvl +* Use optimized SLtoVLMappingTable programming if +* device indicates it supports this. +* * perfmgr * Enable or disable the performance manager * diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index e3dfb58..592e082 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2006-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -150,7 +151,7 @@ static ib_api_status_t vlarb_update(osm_sm_t * sm, osm_physp_t * p, static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, uint8_t in_port, uint8_t out_port, - unsigned force_update, + unsigned optimize, unsigned force_update, const ib_slvl_table_t * sl2vl_table) { osm_madw_context_t context; @@ -177,10 +178,18 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, !memcmp(p_tbl, &tbl, sizeof(tbl))) return IB_SUCCESS; + /* both input port and output port wildcarded */ + if (optimize && (in_port != 1 || out_port != 1)) + return IB_SUCCESS; + context.slvl_context.node_guid = osm_node_get_node_guid(p_node); context.slvl_context.port_guid = osm_physp_get_port_guid(p); context.slvl_context.set_method = TRUE; - attr_mod = in_port << 8 | out_port; + if (optimize) + /* both input port and output port wildcarded */ + attr_mod = 0x30000; + else + attr_mod = in_port << 8 | out_port; return osm_req_set(sm, osm_physp_get_dr_path_ptr(p), (uint8_t *) & tbl, sizeof(tbl), IB_MAD_ATTR_SLVL_TABLE, cl_hton32(attr_mod), @@ -189,14 +198,17 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port, osm_physp_t * p, uint8_t port_num, - unsigned force_update, + unsigned optimize, unsigned force_update, const struct qos_config *qcfg) { ib_api_status_t status; uint8_t i, num_ports; osm_physp_t *p_physp; + osm_node_t *p_node; + unsigned optimizesl2vl = 0; - if (osm_node_get_type(osm_physp_get_node_ptr(p)) == IB_NODE_TYPE_SWITCH) { + p_node = osm_physp_get_node_ptr(p); + if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) { if (ib_port_info_get_vl_cap(&p->port_info) == 1) { /* Check port 0's capability mask */ p_physp = p_port->p_physp; @@ -205,7 +217,8 @@ static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port, capability_mask & IB_PORT_CAP_HAS_SL_MAP)) return IB_SUCCESS; } - num_ports = osm_node_get_num_physp(osm_physp_get_node_ptr(p)); + num_ports = osm_node_get_num_physp(p_node); + optimizesl2vl = ib_switch_info_get_opt_sl2vlmapping(&p_node->sw->switch_info) & optimize; } else { if (!(p->port_info.capability_mask & IB_PORT_CAP_HAS_SL_MAP)) return IB_SUCCESS; @@ -213,8 +226,8 @@ static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port, } for (i = 0; i < num_ports; i++) { - status = sl2vl_update_table(sm, p, i, port_num, force_update, - &qcfg->sl2vl); + status = sl2vl_update_table(sm, p, i, port_num, optimizesl2vl, + force_update, &qcfg->sl2vl); if (status != IB_SUCCESS) return status; } @@ -224,7 +237,8 @@ static ib_api_status_t sl2vl_update(osm_sm_t * sm, osm_port_t * p_port, static int qos_physp_setup(osm_log_t * p_log, osm_sm_t * sm, osm_port_t * p_port, osm_physp_t * p, - uint8_t port_num, unsigned force_update, + uint8_t port_num, unsigned optimize, + unsigned force_update, const struct qos_config *qcfg) { ib_api_status_t status; @@ -245,7 +259,8 @@ static int qos_physp_setup(osm_log_t * p_log, osm_sm_t * sm, } /* setup SL2VL tables */ - status = sl2vl_update(sm, p_port, p, port_num, force_update, qcfg); + status = sl2vl_update(sm, p_port, p, port_num, optimize, force_update, + qcfg); if (status != IB_SUCCESS) { OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 6203 : " "failed to update SL2VLMapping tables " @@ -307,6 +322,7 @@ int osm_qos_setup(osm_opensm_t * p_osm) p_osm->subn.need_update; if (qos_physp_setup(&p_osm->log, &p_osm->sm, p_port, p_physp, i, + p_osm->subn.opt.use_optimized_slvl, force_update, &swe_config)) ret = -1; } @@ -327,7 +343,7 @@ int osm_qos_setup(osm_opensm_t * p_osm) force_update = p_physp->need_update || p_osm->subn.need_update; if (qos_physp_setup(&p_osm->log, &p_osm->sm, p_port, p_physp, - 0, force_update, cfg)) + 0, 0, force_update, cfg)) ret = -1; } diff --git a/opensm/opensm/osm_slvl_map_rcv.c b/opensm/opensm/osm_slvl_map_rcv.c index 9c37442..67c71bd 100644 --- a/opensm/opensm/osm_slvl_map_rcv.c +++ b/opensm/opensm/osm_slvl_map_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -72,7 +73,9 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data) osm_slvl_context_t *p_context; ib_net64_t port_guid; ib_net64_t node_guid; - uint8_t out_port_num, in_port_num; + uint32_t attr_mod; + uint8_t out_port_num, in_port_num, startinport, startoutport, + endinport, endoutport; CL_ASSERT(sm); @@ -111,6 +114,9 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data) (uint8_t) cl_ntoh32(p_smp->attr_mod & 0xFF000000); in_port_num = (uint8_t) cl_ntoh32((p_smp->attr_mod & 0x00FF0000) << 8); + attr_mod = cl_ntoh32(p_smp->attr_mod); + if (attr_mod & 0x30000) + goto opt_sl2vl; p_physp = osm_node_get_physp_ptr(p_node, out_port_num); } else { p_physp = p_port->p_physp; @@ -123,7 +129,7 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data) all we want is to update the subnet. */ OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, - "Got SLtoVL get response in_port_num %u out_port_num %u with " + "Received SLtoVL GetResp in_port_num %u out_port_num %u with " "GUID 0x%" PRIx64 " for parent node GUID 0x%" PRIx64 ", TID 0x%" PRIx64 "\n", in_port_num, out_port_num, cl_ntoh64(port_guid), cl_ntoh64(node_guid), cl_ntoh64(p_smp->trans_id)); @@ -142,6 +148,39 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data) out_port_num, p_slvl_tbl, OSM_LOG_DEBUG); osm_physp_set_slvl_tbl(p_physp, p_slvl_tbl, in_port_num); + goto Exit; + +opt_sl2vl: + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, + "Got optimized SLtoVL get response in_port_num %u out_port_num " + "%u with GUID 0x%" PRIx64 " for parent node GUID 0x%" PRIx64 + ", TID 0x%" PRIx64 "\n", in_port_num, out_port_num, + cl_ntoh64(port_guid), cl_ntoh64(node_guid), + cl_ntoh64(p_smp->trans_id)); + + osm_dump_slvl_map_table(sm->p_log, port_guid, in_port_num, + out_port_num, p_slvl_tbl, OSM_LOG_DEBUG); + + if (attr_mod & 0x10000) { + startoutport = ib_switch_info_is_enhanced_port0(&p_node->sw->switch_info) ? 0 : 1; + endoutport = osm_node_get_num_physp(p_node); + } else + endoutport = startoutport = out_port_num; + if (attr_mod & 0x20000) { + startinport = ib_switch_info_is_enhanced_port0(&p_node->sw->switch_info) ? 0 : 1; + endinport = osm_node_get_num_physp(p_node); + } else + endinport = startinport = in_port_num; + + for (out_port_num = startoutport; out_port_num < endoutport; + out_port_num++) { + p_physp = osm_node_get_physp_ptr(p_node, out_port_num); + if (!p_physp) + continue; + for (in_port_num = startinport; in_port_num < endinport; + in_port_num++) + osm_physp_set_slvl_tbl(p_physp, p_slvl_tbl, in_port_num); + } Exit: cl_plock_release(sm->p_lock); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 0d11811..540165a 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -4,6 +4,7 @@ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -352,6 +353,7 @@ static const opt_rec_t opt_tbl[] = { { "daemon", OPT_OFFSET(daemon), opts_parse_boolean, NULL, 0 }, { "sm_inactive", OPT_OFFSET(sm_inactive), opts_parse_boolean, NULL, 1 }, { "babbling_port_policy", OPT_OFFSET(babbling_port_policy), opts_parse_boolean, NULL, 1 }, + {"use_optimized_slvl", OPT_OFFSET(use_optimized_slvl), opts_parse_boolean, NULL, 1 }, #ifdef ENABLE_OSM_PERF_MGR { "perfmgr", OPT_OFFSET(perfmgr), opts_parse_boolean, NULL, 0 }, { "perfmgr_redir", OPT_OFFSET(perfmgr_redir), opts_parse_boolean, NULL, 0 }, @@ -715,6 +717,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->daemon = FALSE; p_opt->sm_inactive = FALSE; p_opt->babbling_port_policy = FALSE; + p_opt->use_optimized_slvl = FALSE; #ifdef ENABLE_OSM_PERF_MGR p_opt->perfmgr = FALSE; p_opt->perfmgr_redir = TRUE; @@ -1501,10 +1504,13 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) "# SM Inactive\n" "sm_inactive %s\n\n" "# Babbling Port Policy\n" - "babbling_port_policy %s\n\n", + "babbling_port_policy %s\n\n" + "# Use Optimized SLtoVLMapping programming if supported by device\n" + "use_optimized_slvl %s\n\n", p_opts->daemon ? "TRUE" : "FALSE", p_opts->sm_inactive ? "TRUE" : "FALSE", - p_opts->babbling_port_policy ? "TRUE" : "FALSE"); + p_opts->babbling_port_policy ? "TRUE" : "FALSE", + p_opts->use_optimized_slvl ? "TRUE" : "FALSE"); #ifdef ENABLE_OSM_PERF_MGR fprintf(out, From eli at mellanox.co.il Tue Aug 4 06:24:08 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 4 Aug 2009 16:24:08 +0300 Subject: [ofa-general] [PATCH v2] cma: fix access to freed memory Message-ID: <20090804132408.GA11545@mtls03> rdma_join_multicast() allocates struct cma_multicast and then proceeds to join to a multicast address. However, the join operation completes in another context and the allocated struct could be released if the user destroys either the rdma_id object or decides to leave the multicast group while the join operation is in progress. This patch uses a kref object to maintain reference counting to avoid such situation. Signed-off-by: Eli Cohen --- Changes from previous version: I removed the protection of mc list manipulation using spinlocks becuase - a. In order to break into different patches b. I have doubts as for the necessity of this protection. drivers/infiniband/core/cma.c | 20 ++++++++++++++++---- 1 files changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 851de83..aa62101 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -157,6 +157,7 @@ struct cma_multicast { struct list_head list; void *context; struct sockaddr_storage addr; + struct kref mcref; }; struct cma_work { @@ -290,6 +291,13 @@ static inline void cma_deref_dev(struct cma_device *cma_dev) complete(&cma_dev->comp); } +void release_mc(struct kref *kref) +{ + struct cma_multicast *mc = container_of(kref, struct cma_multicast, mcref); + + kfree(mc); +} + static void cma_detach_from_dev(struct rdma_id_private *id_priv) { list_del(&id_priv->list); @@ -827,7 +835,7 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) struct cma_multicast, list); list_del(&mc->list); ib_sa_free_multicast(mc->multicast.ib); - kfree(mc); + kref_put(&mc->mcref, release_mc); } } @@ -2643,7 +2651,7 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) id_priv = mc->id_priv; if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) && cma_disable_callback(id_priv, CMA_ADDR_RESOLVED)) - return 0; + goto out; mutex_lock(&id_priv->qp_mutex); if (!status && id_priv->id.qp) @@ -2669,10 +2677,12 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) cma_exch(id_priv, CMA_DESTROYING); mutex_unlock(&id_priv->handler_mutex); rdma_destroy_id(&id_priv->id); - return 0; + goto out; } mutex_unlock(&id_priv->handler_mutex); +out: + kref_put(&mc->mcref, release_mc); return 0; } @@ -2759,11 +2769,13 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, memcpy(&mc->addr, addr, ip_addr_size(addr)); mc->context = context; mc->id_priv = id_priv; + kref_init(&mc->mcref); spin_lock(&id_priv->lock); list_add(&mc->list, &id_priv->mc_list); spin_unlock(&id_priv->lock); + kref_get(&mc->mcref); switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: ret = cma_join_ib_multicast(id_priv, mc); @@ -2800,7 +2812,7 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) &mc->multicast.ib->rec.mgid, mc->multicast.ib->rec.mlid); ib_sa_free_multicast(mc->multicast.ib); - kfree(mc); + kref_put(&mc->mcref, release_mc); return; } } -- 1.6.3.3 From hnrose at comcast.net Tue Aug 4 06:54:26 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 4 Aug 2009 09:54:26 -0400 Subject: [ofa-general] [PATCH] perftest/README: Add SL option Message-ID: <20090804135426.GA15784@comcast.net> perftest/README: Add SL option Signed-off-by: Hal Rosenstock --- diff --git a/README b/README index 8c0d558..e0acf2d 100755 --- a/README +++ b/README @@ -124,6 +124,7 @@ Common Options to all tests: -a, --all run sizes from 2 till 2^23 -t, --tx-depth= size of tx queue (default: 50) -n, --iters= number of exchanges (at least 100, default: 1000) + -S, --sl= SL (default 0) -C, --report-cycles report times in cpu cycle units (default: microseconds) -H, --report-histogram print out all results From hal.rosenstock at gmail.com Tue Aug 4 07:00:20 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 4 Aug 2009 10:00:20 -0400 Subject: [ofa-general] perftest for Chelsio RNIC adapters In-Reply-To: <351317.84709.qm@web111212.mail.gq1.yahoo.com> References: <351317.84709.qm@web111212.mail.gq1.yahoo.com> Message-ID: On Tue, Aug 4, 2009 at 6:08 AM, Bill N wrote: > Hi, > > Is performance tests of the perftest-1.2 supported for Chelsio and other > RNIC adapters? I'm not sure what 1.2 is exactly but I'm pretty sure the answer is currently no although it shouldn't be much work to add. I recently saw a patch for this supporting a gid option but it looked to me like it was implemented as IBxOE specific rather than also accomodating IB/iWARP. -- Hal > > > Regards, > Bill > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Aug 4 07:03:17 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 4 Aug 2009 10:03:17 -0400 Subject: [ofa-general] umad SLID and LMC In-Reply-To: <5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org> References: <356B6978-3308-4EE9-8C00-00199558BDEA@redhat.com> <200907231121.00140.jackm@dev.mellanox.co.il> <5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org> Message-ID: On Sun, Aug 2, 2009 at 5:45 PM, Todd Rimmer wrote: > What is the proper way to control the SLID used for outgoing umad sends? > > For example, when using LMC>0, the PathRecord returned from the SM for > talking to a given remove node may have a SLID which is not the BaseLid for > the sender. How does the sender ensure the correct SLID is used for the > outgoing mad? > > In reviewing the API it seems like the only way to do this is: > void *umad = umad_alloc(...); > > // call various umad calls to initialize address and contents > umad_get_mad_addr(umad)->path_bits = lower LMC bits of SLID; > > umad_send(..., umad, ...); > > Was path_bits an intentional omission in the API? No; it was an unintentional omission AFAIT. > It would seem that a function which could update the ib_mad_addr in a umad > given a path record would seem appropriate. Seems reasonable to me. Care to supply a patch ? -- Hal > > Todd Rimmer > Chief Architect > QLogic Network Systems Group > Voice: 610-233-4852 Fax: 610-233-4777 > Todd.Rimmer at QLogic.com www.QLogic.com > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chien.tin.tung at intel.com Tue Aug 4 07:25:00 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Tue, 4 Aug 2009 07:25:00 -0700 Subject: [ofa-general] perftest for Chelsio RNIC adapters In-Reply-To: <351317.84709.qm@web111212.mail.gq1.yahoo.com> References: <351317.84709.qm@web111212.mail.gq1.yahoo.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383035F7A95D5@azsmsx501.amr.corp.intel.com> >Is performance tests of the perftest-1.2 supported for Chelsio >and other RNIC adapters? You can run ib_rdma_bw and ib_rdma_lat over iWarp adapters with -c flag (use RDMA CM). Chien From kliteyn at dev.mellanox.co.il Tue Aug 4 07:32:56 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 04 Aug 2009 17:32:56 +0300 Subject: [ofa-general] [PATCH] opensm: fixing handling of opt.max_wire_smps Message-ID: <4A784698.10803@dev.mellanox.co.il> opt.max_wire_smps is uint32, but then when it's propagated into the VL15 poller it's casted to int32. Fixing the parameter handling to protect it from wrong values. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/main.c | 2 +- opensm/opensm/osm_subnet.c | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 296d5d5..9cb9990 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -722,7 +722,7 @@ int main(int argc, char *argv[]) case 'n': opt.max_wire_smps = strtol(optarg, NULL, 0); - if (opt.max_wire_smps <= 0) + if (opt.max_wire_smps > 0x7FFFFFFF) opt.max_wire_smps = 0x7FFFFFFF; printf(" Max wire smp's = %d\n", opt.max_wire_smps); break; diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index ec15f8a..c07d823 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1066,6 +1066,13 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK; } + if (p_opts->max_wire_smps > 0x7FFFFFFF) { + log_report(" Invalid Cached Option Value: max_wire_smps = %u," + " Using Default: %u\n", + p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE); + p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; + } + if (strcmp(p_opts->console, OSM_DISABLE_CONSOLE) && strcmp(p_opts->console, OSM_LOCAL_CONSOLE) #ifdef ENABLE_OSM_CONSOLE_SOCKET -- 1.5.1.4 From hnrose at comcast.net Tue Aug 4 08:13:37 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 4 Aug 2009 11:13:37 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Added error numbers to some error log messages Message-ID: <20090804151337.GA6037@comcast.net> Also, made routine local which didn't need to be global Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 2715fe7..6210477 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -811,7 +811,7 @@ static int lash_core(lash_t * p_lash) OSM_LOG_ENTER(p_log); if (p_lash->p_osm->subn.opt.do_mesh_analysis && osm_do_mesh_analysis(p_lash)) { - OSM_LOG(p_log, OSM_LOG_ERROR, "Mesh analysis failed\n"); + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D05: Mesh analysis failed\n"); goto Exit; } @@ -820,7 +820,7 @@ static int lash_core(lash_t * p_lash) shortest_path(p_lash, i); if (generate_routing_func_for_mst(p_lash, i, &dests)) { status = -1; - OSM_LOG(p_log, OSM_LOG_ERROR, + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D06: " "generate_routing_func_for_mst failed\n"); goto Exit; } @@ -969,7 +969,7 @@ static unsigned get_lash_id(osm_switch_t * p_sw) return ((switch_t *) p_sw->priv)->id; } -int get_next_port(switch_t *sw, int link) +static int get_next_port(switch_t *sw, int link) { link_t *l = sw->node->links[link]; int port = l->next_port++; From sashak at voltaire.com Tue Aug 4 08:27:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 18:27:00 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090730232848.GA22660@comcast.net> References: <20090730232848.GA22660@comcast.net> Message-ID: <20090804152700.GF7993@me> Hi, On 19:28 Thu 30 Jul , Hal Rosenstock wrote: > > Currently, MADs are pipelined to a single switch at a time which > effectively serializes these requests due to processing at the SMA. > This patch pipelines (stripes) them across the switches first before > proceeding with successive blocks. As a result of this striping, > multiple switches can process the set and respond concurrently > which results in an improvement to the subnet initialization time. The idea is nice. However I have some initial comments about an implementation. BTW should there be a reason for an option to preserve the current behavior? (I don't know, just asking) > This patch also introduces a new config option (max_smps_per_node) > which indicates how deep the per node pipeline is (current default is 4). > This also has the effect of limiting the number of times that the switch > list is traversed. Maybe this embellishment is unnecessary. Then why is it needed? > All unicast routing protocols are updated for this with the exception > of file. > > A similar subsequent change will do this for MFTs. > > Yevgeny Kliteynik wrote: > > With a small cluster of 17 IS4 switches and 11 HCAs and > to artificially increase the cluster, LMC of 7 was used > including EnhancedSwitchPort 0 LMC. > > With the new code, LFT configuration is more than twice as > fast as with the old code :) > Current ucast manager ran on avarage for ~250msec, with the > new code - 110-120msec. > > Routing calculation phase of the ucast manager took ~1200 usec, > the rest was sending the blocks and waiting for no more pending > transactions. > > No noticeable difference between various max_smps_per_node values > was observed. What is the reason? And what was value of 'max_wire_smps'? > Here are some detailed results of different executions (the > number on the left is timer value in usec): > > Current ucast manager (w/o the optimization): > > 000000 [LFT]: osm_ucast_mgr_process() - START > 001131 [LFT]: ucast_mgr_process_tbl() - START > 032251 [LFT]: ucast_mgr_process_tbl() - END > 032263 [LFT]: osm_ucast_mgr_process() - END > 253416 [LFT]: Done wait_for_pending_transactions() > > New code, max_smps_per_node=0: > > 001417 [LFT]: osm_ucast_mgr_process() - START (0 max_smps_per_node) > 002690 [LFT]: ucast_mgr_process_tbl() - START > 032946 [LFT]: ucast_mgr_process_tbl() - END > 032948 [LFT]: osm_ucast_pipeline_tbl() - START > 033846 [LFT]: osm_ucast_pipeline_tbl() - END > 033858 [LFT]: osm_ucast_mgr_process() - END > 108203 [LFT]: Done wait_for_pending_transactions() > > New code, max_smps_per_node=1: > > 007474 [LFT]: osm_ucast_mgr_process() - START (1 max_smps_per_node) > 008735 [LFT]: ucast_mgr_process_tbl() - START > 040071 [LFT]: ucast_mgr_process_tbl() - END > 040074 [LFT]: osm_ucast_pipeline_tbl() - START > 040103 [LFT]: osm_ucast_pipeline_tbl() - END > 040114 [LFT]: osm_ucast_mgr_process() - END > 120097 [LFT]: Done wait_for_pending_transactions() > > New code, max_smps_per_node=4: > > 004137 [LFT]: osm_ucast_mgr_process() - START (4 max_smps_per_node) > 005380 [LFT]: ucast_mgr_process_tbl() - START > 037436 [LFT]: ucast_mgr_process_tbl() - END > 037439 [LFT]: osm_ucast_pipeline_tbl() - START > 037495 [LFT]: osm_ucast_pipeline_tbl() - END > 037506 [LFT]: osm_ucast_mgr_process() - END > 114983 [LFT]: Done wait_for_pending_transactions() > > > With IS3 based Qlogic switches, which do not handle DR packets forwarding > in HW, with a fabric of ~1100 HCAs, ~280 switches: > > Current OSM configures LFTs in ~2 seconds. > New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up), > depending on the max_smps_per_node value. > > As in case of IS4 switches, the shortest config time was obtained with > max_smps_per_node=0, which is unlimited pipeline. > > > Signed-off-by: Hal Rosenstock > --- > Changes since v1: > Added Yevgeny's performance data to patch description above > No change to actual patch > > diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h > index 0537002..617e8a9 100644 > --- a/opensm/include/opensm/osm_base.h > +++ b/opensm/include/opensm/osm_base.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. > * > @@ -449,6 +449,18 @@ BEGIN_C_DECLS > */ > #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4 > /***********/ > +/****d* OpenSM: Base/OSM_DEFAULT_SMP_MAX_PER_NODE > +* NAME > +* OSM_DEFAULT_SMP_MAX_PER_NODE > +* > +* DESCRIPTION > +* Specifies the default number of VL15 SMP MADs allowed > +* per node for certain attributes. > +* > +* SYNOPSIS > +*/ > +#define OSM_DEFAULT_SMP_MAX_PER_NODE 4 > +/***********/ > /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE > * NAME > * OSM_SM_DEFAULT_QP0_RCV_SIZE > diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h > index cc8321d..1776380 100644 > --- a/opensm/include/opensm/osm_sm.h > +++ b/opensm/include/opensm/osm_sm.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -130,6 +130,7 @@ typedef struct osm_sm { > osm_sm_mad_ctrl_t mad_ctrl; > osm_lid_mgr_t lid_mgr; > osm_ucast_mgr_t ucast_mgr; > + boolean_t lfts_updated; The name is unclear - actually this means "update in progress". > cl_disp_reg_handle_t sweep_fail_disp_h; > cl_disp_reg_handle_t ni_disp_h; > cl_disp_reg_handle_t pi_disp_h; > @@ -524,6 +525,45 @@ osm_resp_send(IN osm_sm_t * sm, > * > *********/ > > +/****f* OpenSM: SM/osm_sm_set_next_lft_block > +* NAME > +* osm_sm_set_next_lft_block > +* > +* DESCRIPTION > +* Set the next LFT (LinearForwardingTable) block in the indicated switch. > +* > +* SYNOPSIS > +*/ > +void > +osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw, > + IN uint8_t *p_block, IN osm_dr_path_t *p_path, > + IN osm_madw_context_t *p_context); Why should it be in osm_sm.[ch]? osm_ucast_mgr.c or osm_switch.c seem much more appropriate place for this. > +/* > +* PARAMETERS > +* p_sm > +* [in] Pointer to an osm_sm_t object. > +* > +* p_switch > +* [in] Pointer to the switch object. > +* > +* p_block > +* [in] Pointer to the forwarding table block. > +* > +* p_path > +* [in] Pointer to a directed route path object. > +* > +* p_context > +* [in] Mad wrapper context structure to be copied into the wrapper > +* context, and thus visible to the recipient of the response. > +* > +* RETURN VALUES > +* None > +* > +* NOTES > +* > +* SEE ALSO > +*********/ > + > /****f* OpenSM: SM/osm_sm_mcgrp_join > * NAME > * osm_sm_mcgrp_join > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index 59a32ad..f12afae 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. > * > @@ -147,6 +147,7 @@ typedef struct osm_subn_opt { > uint32_t sweep_interval; > uint32_t max_wire_smps; > uint32_t transaction_timeout; > + uint32_t max_smps_per_node; > uint8_t sm_priority; > uint8_t lmc; > boolean_t lmc_esp0; > diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h > index 7ce28c5..e12113f 100644 > --- a/opensm/include/opensm/osm_switch.h > +++ b/opensm/include/opensm/osm_switch.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -102,6 +102,7 @@ typedef struct osm_switch { > osm_port_profile_t *p_prof; > uint8_t *lft; > uint8_t *new_lft; > + uint16_t lft_block_id_ho; > osm_mcast_tbl_t mcast_tbl; > unsigned endport_links; > unsigned need_update; > diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h > index a040476..fdea49a 100644 > --- a/opensm/include/opensm/osm_ucast_mgr.h > +++ b/opensm/include/opensm/osm_ucast_mgr.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -233,17 +233,42 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN struct osm_sm * sm); > * osm_ucast_mgr_destroy > *********/ > > -/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_table > +/****f* OpenSM: Unicast Manager/osm_ucast_pipeline_tbl > * NAME > -* osm_ucast_mgr_set_fwd_table > +* osm_ucast_pipeline_tbl > * > * DESCRIPTION > -* Setup forwarding table for the switch (from prepared new_lft). > +* The osm_ucast_pipeline_tbl function pipelines the LFT > +* (LinearForwardingTable) sets across the switches > +* (from prepared new_lft). > * > * SYNOPSIS > */ > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, > - IN osm_switch_t * const p_sw); > +void osm_ucast_pipeline_tbl(IN osm_ucast_mgr_t * p_mgr); > +/* > +* PARAMETERS > +* p_mgr > +* [in] Pointer to an osm_ucast_mgr_t object. > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +*********/ > + > +/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_tbl_top > +* NAME > +* osm_ucast_mgr_set_fwd_tbl_top > +* > +* DESCRIPTION > +* Setup LinearFDBTop for the switch. > +* > +* SYNOPSIS > +*/ > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * const p_mgr, > + IN osm_switch_t * const p_sw); I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and osm_ucast_pipeline_tbl). Why to not use a single function and update all routing engines appropriately (you need to do it anyway), so that this will only fill up new_lfts table? > /* > * PARAMETERS > * p_mgr > diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c > index 2edb8d3..cb131b4 100644 > --- a/opensm/opensm/osm_lin_fwd_rcv.c > +++ b/opensm/opensm/osm_lin_fwd_rcv.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -36,7 +36,7 @@ > /* > * Abstract: > * Implementation of osm_lft_rcv_t. > - * This object represents the NodeDescription Receiver object. > + * This object represents the Linear Forwarding Table Receiver object. > * This object is part of the opensm family of objects. > */ > > @@ -55,6 +55,7 @@ void osm_lft_rcv_process(IN void *context, IN void *data) > { > osm_sm_t *sm = context; > osm_madw_t *p_madw = data; > + osm_dr_path_t *p_path; > ib_smp_t *p_smp; > uint32_t block_num; > osm_switch_t *p_sw; > @@ -62,6 +63,8 @@ void osm_lft_rcv_process(IN void *context, IN void *data) > uint8_t *p_block; > ib_net64_t node_guid; > ib_api_status_t status; > + uint8_t block[IB_SMP_DATA_SIZE]; > + osm_madw_context_t mad_context; > > CL_ASSERT(sm); > > @@ -94,6 +97,16 @@ void osm_lft_rcv_process(IN void *context, IN void *data) > "\n\t\t\t\tSwitch 0x%" PRIx64 "\n", > ib_get_err_str(status), cl_ntoh64(node_guid)); > } > + > + p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); > + > + mad_context.lft_context.node_guid = node_guid; > + mad_context.lft_context.set_method = TRUE; > + > + osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path, > + &mad_context); > + > + p_sw->lft_block_id_ho++; Wouldn't it be simpler to encode block_id in a mad context? > } > > CL_PLOCK_RELEASE(sm->p_lock); > diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c > index daa60ff..4e0fd2a 100644 > --- a/opensm/opensm/osm_sm.c > +++ b/opensm/opensm/osm_sm.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. > * > @@ -441,6 +441,45 @@ Exit: > > /********************************************************************** > **********************************************************************/ > +void osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw, > + IN uint8_t *p_block, IN osm_dr_path_t *p_path, > + IN osm_madw_context_t *context) > +{ > + ib_api_status_t status; > + > + for (; > + osm_switch_get_lft_block(p_sw, p_sw->lft_block_id_ho, p_block); > + p_sw->lft_block_id_ho++) { > + if (!p_sw->need_update && !p_sm->p_subn->need_update && > + !memcmp(p_block, > + p_sw->new_lft + p_sw->lft_block_id_ho * IB_SMP_DATA_SIZE, > + IB_SMP_DATA_SIZE)) > + continue; > + > + p_sm->lfts_updated = 1; > + > + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, > + "Writing FT block %u to switch 0x%" PRIx64 "\n", > + p_sw->lft_block_id_ho, > + cl_ntoh64(context->lft_context.node_guid)); > + > + status = osm_req_set(p_sm, p_path, > + p_sw->new_lft + > + p_sw->lft_block_id_ho * IB_SMP_DATA_SIZE, > + IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL, > + cl_hton32(p_sw->lft_block_id_ho), > + CL_DISP_MSGID_NONE, context); > + > + if (status != IB_SUCCESS) > + OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E11: " > + "Sending linear fwd. tbl. block failed (%s)\n", > + ib_get_err_str(status)); > + break; > + } > +} > + > +/********************************************************************** > + **********************************************************************/ > static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm, > IN osm_mgrp_t * p_mgrp) > { > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index ec15f8a..1964b7f 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. > * > @@ -295,6 +295,7 @@ static const opt_rec_t opt_tbl[] = { > { "m_key_lease_period", OPT_OFFSET(m_key_lease_period), opts_parse_net16, NULL, 1 }, > { "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32, NULL, 1 }, > { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32, NULL, 1 }, > + { "max_smps_per_node", OPT_OFFSET(max_smps_per_node), opts_parse_uint32, NULL, 1 }, > { "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 }, > { "console_port", OPT_OFFSET(console_port), opts_parse_uint16, NULL, 0 }, > { "transaction_timeout", OPT_OFFSET(transaction_timeout), opts_parse_uint32, NULL, 1 }, > @@ -671,6 +672,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) > p_opt->m_key_lease_period = 0; > p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS; > p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; > + p_opt->max_smps_per_node = OSM_DEFAULT_SMP_MAX_PER_NODE; > p_opt->console = strdup(OSM_DEFAULT_CONSOLE); > p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT; > p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; > @@ -1461,6 +1463,10 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) > "max_wire_smps %u\n\n" > "# The maximum time in [msec] allowed for a transaction to complete\n" > "transaction_timeout %u\n\n" > + "# Maximum number of SMPs per node sent in parallel\n" > + "# (0 means unlimited)\n" > + "# Only applies to certain attributes\n" > + "max_smps_per_node %u\n\n" > "# Maximal time in [msec] a message can stay in the incoming message queue.\n" > "# If there is more than one message in the queue and the last message\n" > "# stayed in the queue more than this value, any SA request will be\n" > @@ -1470,6 +1476,7 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) > "single_thread %s\n\n", > p_opts->max_wire_smps, > p_opts->transaction_timeout, > + p_opts->max_smps_per_node, > p_opts->max_msg_fifo_timeout, > p_opts->single_thread ? "TRUE" : "FALSE"); > > diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c > index 216b496..31c930b 100644 > --- a/opensm/opensm/osm_ucast_cache.c > +++ b/opensm/opensm/osm_ucast_cache.c > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -1085,9 +1085,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr) > memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); > } > > - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); > + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw); > } > > + osm_ucast_pipeline_tbl(p_mgr); > + > return 0; > } > > diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c > index 2505c46..099e8ba 100644 > --- a/opensm/opensm/osm_ucast_file.c > +++ b/opensm/opensm/osm_ucast_file.c > @@ -168,8 +168,8 @@ static int do_ucast_file_load(void *context) > "routing algorithm\n"); > } else if (!strncmp(p, "Unicast lids", 12)) { > if (p_sw) > - osm_ucast_mgr_set_fwd_table(&p_osm->sm. > - ucast_mgr, p_sw); > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm. > + ucast_mgr, p_sw); > q = strstr(p, " guid 0x"); > if (!q) { > OSM_LOG(&p_osm->log, OSM_LOG_ERROR, > @@ -247,7 +247,7 @@ static int do_ucast_file_load(void *context) > } > > if (p_sw) > - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw); > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw); > > fclose(file); > return 0; I suppose that this breaks 'file' routing engine (did you test it?) - instead of switch LFTs setup this will only update its TOPs. > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index bde6dbd..d65c685 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -2,7 +2,7 @@ > * Copyright (c) 2009 Simula Research Laboratory. All rights reserved. > * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -1905,8 +1905,8 @@ static void set_sw_fwd_table(IN cl_map_item_t * const p_map_item, > ftree_fabric_t *p_ftree = (ftree_fabric_t *) context; > > p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid; > - osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, > - p_sw->p_osm_sw); > + osm_ucast_mgr_set_fwd_tbl_top(&p_ftree->p_osm->sm.ucast_mgr, > + p_sw->p_osm_sw); > } > > /*************************************************** > @@ -4005,6 +4005,8 @@ static int do_routing(IN void *context) > /* for each switch, set its fwd table */ > cl_qmap_apply_func(&p_ftree->sw_tbl, set_sw_fwd_table, (void *)p_ftree); > > + osm_ucast_pipeline_tbl(&p_ftree->p_osm->sm.ucast_mgr); > + > /* write out hca ordering file */ > fabric_dump_hca_ordering(p_ftree); > > diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c > index 12b5e34..adf5f6c 100644 > --- a/opensm/opensm/osm_ucast_lash.c > +++ b/opensm/opensm/osm_ucast_lash.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. > * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. > @@ -1045,8 +1045,11 @@ static void populate_fwd_tbls(lash_t * p_lash) > physical_egress_port); > } > } /* for */ > - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw); > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw); > } > + > + osm_ucast_pipeline_tbl(&p_osm->sm.ucast_mgr); > + > OSM_LOG_EXIT(p_log); > } > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 78a7031..86d1c98 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -315,16 +315,14 @@ Exit: > > /********************************************************************** > **********************************************************************/ > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr, > - IN osm_switch_t * p_sw) > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, > + IN osm_switch_t * p_sw) > { > osm_node_t *p_node; > osm_dr_path_t *p_path; > osm_madw_context_t context; > ib_api_status_t status; > ib_switch_info_t si; > - uint16_t block_id_ho = 0; > - uint8_t block[IB_SMP_DATA_SIZE]; > boolean_t set_swinfo_require = FALSE; > uint16_t lin_top; > uint8_t life_state; > @@ -382,48 +380,8 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr, > ib_get_err_str(status)); > } > > - /* > - Send linear forwarding table blocks to the switch > - as long as the switch indicates it has blocks needing > - configuration. > - */ > - > - context.lft_context.node_guid = osm_node_get_node_guid(p_node); > - context.lft_context.set_method = TRUE; > - > - if (!p_sw->new_lft) { > - /* any routing should provide the new_lft */ > - CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && > - p_mgr->cache_valid && !p_sw->need_update); > - goto Exit; > - } > - > - for (block_id_ho = 0; > - osm_switch_get_lft_block(p_sw, block_id_ho, block); > - block_id_ho++) { > - if (!p_sw->need_update && !p_mgr->p_subn->need_update && > - !memcmp(block, > - p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, > - IB_SMP_DATA_SIZE)) > - continue; > - > - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > - "Writing FT block %u\n", block_id_ho); > - > - status = osm_req_set(p_mgr->sm, p_path, > - p_sw->new_lft + > - block_id_ho * IB_SMP_DATA_SIZE, > - sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL, > - cl_hton32(block_id_ho), CL_DISP_MSGID_NONE, > - &context); > + p_sw->lft_block_id_ho = 0; > > - if (status != IB_SUCCESS) > - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: " > - "Sending linear fwd. tbl. block failed (%s)\n", > - ib_get_err_str(status)); > - } > - > -Exit: > OSM_LOG_EXIT(p_mgr->p_log); > return 0; > } > @@ -508,7 +466,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, > } > } > > - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); > + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw); > > if (p_mgr->p_subn->opt.lmc) > free_ports_priv(p_mgr); > @@ -516,6 +474,47 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, > OSM_LOG_EXIT(p_mgr->p_log); > } > > +static void ucast_mgr_pipeline_tbl(IN osm_switch_t *p_sw, > + IN osm_ucast_mgr_t *p_mgr) > +{ > + osm_dr_path_t *p_path; > + osm_madw_context_t mad_context; > + uint8_t block[IB_SMP_DATA_SIZE]; > + > + OSM_LOG_ENTER(p_mgr->p_log); > + > + CL_ASSERT(p_sw && p_sw->p_node); > + > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > + "Processing switch 0x%" PRIx64 "\n", > + cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); > + > + /* > + Send linear forwarding table blocks to the switch > + as long as the switch indicates it has blocks needing > + configuration. > + */ > + if (!p_sw->new_lft) { > + /* any routing should provide the new_lft */ > + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && > + p_mgr->cache_valid && !p_sw->need_update); > + goto Exit; > + } > + > + p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); > + > + mad_context.lft_context.node_guid = osm_node_get_node_guid(p_sw->p_node); > + mad_context.lft_context.set_method = TRUE; > + > + osm_sm_set_next_lft_block(p_mgr->sm, p_sw, &block[0], p_path, > + &mad_context); > + > + p_sw->lft_block_id_ho++; > + > +Exit: > + OSM_LOG_EXIT(p_mgr->p_log); > +} > + > /********************************************************************** > **********************************************************************/ > static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item, > @@ -870,6 +869,28 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t * m) > add_sw_endports_to_order_list(s[i], m); > } > > +void osm_ucast_pipeline_tbl(osm_ucast_mgr_t * p_mgr) > +{ > + cl_qmap_t *p_sw_tbl; > + osm_switch_t *p_sw; > + int i; > + > + for (i = 0; > + !p_mgr->p_subn->opt.max_smps_per_node || > + i < p_mgr->p_subn->opt.max_smps_per_node; > + i++) { > + p_mgr->sm->lfts_updated = 0; > + p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; > + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { > + ucast_mgr_pipeline_tbl(p_sw, p_mgr); > + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); > + } > + if (!p_mgr->sm->lfts_updated) > + break; > + } > +} Is it possible (for example in case of send errors) that "partial" LFT blocks sending will trigger wait_for_pending_transaction() completion? Sasha > + > static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) > { > cl_qlist_init(&p_mgr->port_order_list); > @@ -904,6 +925,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl, > p_mgr); > > + osm_ucast_pipeline_tbl(p_mgr); > + > cl_qlist_remove_all(&p_mgr->port_order_list); > > return 0; > From sashak at voltaire.com Tue Aug 4 08:30:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 18:30:20 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Added error numbers to some error log messages In-Reply-To: <20090804151337.GA6037@comcast.net> References: <20090804151337.GA6037@comcast.net> Message-ID: <20090804153020.GG7993@me> On 11:13 Tue 04 Aug , Hal Rosenstock wrote: > > Also, made routine local which didn't need to be global > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Aug 4 08:35:09 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 18:35:09 +0300 Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <4A784698.10803@dev.mellanox.co.il> References: <4A784698.10803@dev.mellanox.co.il> Message-ID: <20090804153509.GH7993@me> On 17:32 Tue 04 Aug , Yevgeny Kliteynik wrote: > opt.max_wire_smps is uint32, but then when it's propagated > into the VL15 poller it's casted to int32. Fixing the > parameter handling to protect it from wrong values. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/main.c | 2 +- > opensm/opensm/osm_subnet.c | 7 +++++++ > 2 files changed, 8 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 296d5d5..9cb9990 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -722,7 +722,7 @@ int main(int argc, char *argv[]) > > case 'n': > opt.max_wire_smps = strtol(optarg, NULL, 0); Then you likely want to use strtoul(). > - if (opt.max_wire_smps <= 0) > + if (opt.max_wire_smps > 0x7FFFFFFF) > opt.max_wire_smps = 0x7FFFFFFF; What about opt.max_wire_smps == 0? Sasha > printf(" Max wire smp's = %d\n", opt.max_wire_smps); > break; > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index ec15f8a..c07d823 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -1066,6 +1066,13 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) > p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK; > } > > + if (p_opts->max_wire_smps > 0x7FFFFFFF) { > + log_report(" Invalid Cached Option Value: max_wire_smps = %u," > + " Using Default: %u\n", > + p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE); > + p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; > + } Ditto. Sasha From rdreier at cisco.com Tue Aug 4 09:05:13 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Aug 2009 09:05:13 -0700 Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory In-Reply-To: <20090804033221.GA30949@mtls03> (Eli Cohen's message of "Tue, 4 Aug 2009 06:32:21 +0300") References: <20090803092528.GA25528@mtls03> <20090804033221.GA30949@mtls03> Message-ID: > Maybe it's just a loose connection but yet, it seems to me that > operations on id_priv->mc_list should be protected. Should I send a > different patch? "seems ... should be" is very weak justification for locking. What should they be protected from? - R. From bart.vanassche at gmail.com Tue Aug 4 09:07:31 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Tue, 4 Aug 2009 18:07:31 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Mon, Aug 3, 2009 at 10:36 PM, Roland Dreier wrote: > How do we end up in srp_reset_device() with req->scmnd->device == NULL? > Presumably req->scmnd should match scmnd if I am understanding the code > properly -- and then scmnd->device == NULL?? An update: apparently it is possible to trigger scmnd->device == NULL even without triggering a prior IB CM disconnect. The following shell commands are sufficient to trigger the WARN_ON statement in the patch below: rmmod ib_srp modprobe ib_srp ibsrpdm -c | while read target_info; do echo "${target_info}"; echo "${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target; done sg_reset -d ${srp_device} So it should be analyzed why scmnd->device can be NULL before applying any patches to fix the NULL pointer dereference. Bart. --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c 2009-08-03 12:13:11.000000000 +0200 +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c 2009-08-04 17:25:27.000000000 +0200 @@ -1330,6 +1330,8 @@ static int srp_send_tsk_mgmt(struct srp_ struct srp_iu *iu; struct srp_tsk_mgmt *tsk_mgmt; + BUG_ON(!req->scmnd->device); + spin_lock_irq(target->scsi_host->host_lock); if (target->state == SRP_TARGET_DEAD || @@ -1425,6 +1427,8 @@ static int srp_reset_device(struct scsi_ return FAILED; if (srp_find_req(target, scmnd, &req)) return FAILED; + if (WARN_ON(!req->scmnd->device)) + return FAILED; if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET)) return FAILED; if (req->tsk_status) From rdreier at cisco.com Tue Aug 4 09:27:23 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Aug 2009 09:27:23 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: (Bart Van Assche's message of "Tue, 4 Aug 2009 18:07:31 +0200") References: Message-ID: > An update: apparently it is possible to trigger scmnd->device == NULL even > without triggering a prior IB CM disconnect. The following shell commands > are sufficient to trigger the WARN_ON statement in the patch below: > rmmod ib_srp > modprobe ib_srp > ibsrpdm -c | while read target_info; do echo "${target_info}"; echo > "${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target; > done > sg_reset -d ${srp_device} So in other words, just sg_reset on an SRP device triggers the warning? From hal.rosenstock at gmail.com Tue Aug 4 09:29:08 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 4 Aug 2009 12:29:08 -0400 Subject: [ofa-general] umad SLID and LMC In-Reply-To: References: <356B6978-3308-4EE9-8C00-00199558BDEA@redhat.com> <200907231121.00140.jackm@dev.mellanox.co.il> <5AEC2602AE03EB46BFC16C6B9B200DA81653EF696B@MNEXMB2.qlogic.org> Message-ID: On Tue, Aug 4, 2009 at 10:03 AM, Hal Rosenstock wrote: > > > On Sun, Aug 2, 2009 at 5:45 PM, Todd Rimmer wrote: > >> What is the proper way to control the SLID used for outgoing umad sends? >> >> For example, when using LMC>0, the PathRecord returned from the SM for >> talking to a given remove node may have a SLID which is not the BaseLid for >> the sender. How does the sender ensure the correct SLID is used for the >> outgoing mad? >> >> In reviewing the API it seems like the only way to do this is: >> void *umad = umad_alloc(...); >> >> // call various umad calls to initialize address and contents >> umad_get_mad_addr(umad)->path_bits = lower LMC bits of SLID; >> >> umad_send(..., umad, ...); >> >> Was path_bits an intentional omission in the API? > > > No; it was an unintentional omission AFAIT. > > >> It would seem that a function which could update the ib_mad_addr in a >> umad given a path record would seem appropriate. > > > Seems reasonable to me. > On second thought, umad is lower level than this and knows nothing of path records (that at higher level). Some API like umad_set_addr handling path bits would be another alternative for this. -- Hal > Care to supply a patch ? > > -- Hal > > >> >> Todd Rimmer >> Chief Architect >> QLogic Network Systems Group >> Voice: 610-233-4852 Fax: 610-233-4777 >> Todd.Rimmer at QLogic.com www.QLogic.com >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bart.vanassche at gmail.com Tue Aug 4 09:30:18 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Tue, 4 Aug 2009 18:30:18 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Tue, Aug 4, 2009 at 6:27 PM, Roland Dreier wrote: > >  > An update: apparently it is possible to trigger scmnd->device == NULL even >  > without triggering a prior IB CM disconnect. The following shell commands >  > are sufficient to trigger the WARN_ON statement in the patch below: > >  > rmmod ib_srp >  > modprobe ib_srp >  > ibsrpdm -c | while read target_info; do echo "${target_info}"; echo >  > "${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target; >  > done >  > sg_reset -d ${srp_device} > > So in other words, just sg_reset on an SRP device triggers the warning? Yes, but only if no I/O has been performed after the ${srp_device} has been created and before the sg_reset has been issued. When e.g. the command dd if=${srp_device} of=/dev/null iflag=direct bs=1M is inserted just before the sg_reset command, the kernel warning is not triggered. Bart. From hal.rosenstock at gmail.com Tue Aug 4 09:45:05 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 4 Aug 2009 12:45:05 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090804152700.GF7993@me> References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> Message-ID: On Tue, Aug 4, 2009 at 11:27 AM, Sasha Khapyorsky wrote: > Hi, > > On 19:28 Thu 30 Jul , Hal Rosenstock wrote: > > > > Currently, MADs are pipelined to a single switch at a time which > > effectively serializes these requests due to processing at the SMA. > > This patch pipelines (stripes) them across the switches first before > > proceeding with successive blocks. As a result of this striping, > > multiple switches can process the set and respond concurrently > > which results in an improvement to the subnet initialization time. > > The idea is nice. However I have some initial comments about an > implementation. > > BTW should there be a reason for an option to preserve the current > behavior? (I don't know, just asking) I asked this in an email on the thread on this. It's up to you. I don't see a need but if we want to be conservative, it can be added. > > > > This patch also introduces a new config option (max_smps_per_node) > > which indicates how deep the per node pipeline is (current default is 4). > > This also has the effect of limiting the number of times that the switch > > list is traversed. Maybe this embellishment is unnecessary. > > Then why is it needed? Also, as was discussed in the thread on this, it gives a way to control possible VL15 overflow. > > > > All unicast routing protocols are updated for this with the exception > > of file. > > > > A similar subsequent change will do this for MFTs. > > > > Yevgeny Kliteynik wrote: > > > > With a small cluster of 17 IS4 switches and 11 HCAs and > > to artificially increase the cluster, LMC of 7 was used > > including EnhancedSwitchPort 0 LMC. > > > > With the new code, LFT configuration is more than twice as > > fast as with the old code :) > > Current ucast manager ran on avarage for ~250msec, with the > > new code - 110-120msec. > > > > Routing calculation phase of the ucast manager took ~1200 usec, > > the rest was sending the blocks and waiting for no more pending > > transactions. > > > > No noticeable difference between various max_smps_per_node values > > was observed. > > What is the reason? I think the reason was max_wire_smps may have kicked in but Yevgeny is best to elaborate on this. > And what was value of 'max_wire_smps'? > > Here are some detailed results of different executions (the > number on the left is timer value in usec): > > Current ucast manager (w/o the optimization): > > 000000 [LFT]: osm_ucast_mgr_process() - START > 001131 [LFT]: ucast_mgr_process_tbl() - START > 032251 [LFT]: ucast_mgr_process_tbl() - END > 032263 [LFT]: osm_ucast_mgr_process() - END > 253416 [LFT]: Done wait_for_pending_transactions() > > New code, max_smps_per_node=0: > > 001417 [LFT]: osm_ucast_mgr_process() - START (0 max_smps_per_node) > 002690 [LFT]: ucast_mgr_process_tbl() - START > 032946 [LFT]: ucast_mgr_process_tbl() - END > 032948 [LFT]: osm_ucast_pipeline_tbl() - START > 033846 [LFT]: osm_ucast_pipeline_tbl() - END > 033858 [LFT]: osm_ucast_mgr_process() - END > 108203 [LFT]: Done wait_for_pending_transactions() > > New code, max_smps_per_node=1: > > 007474 [LFT]: osm_ucast_mgr_process() - START (1 max_smps_per_node) > 008735 [LFT]: ucast_mgr_process_tbl() - START > 040071 [LFT]: ucast_mgr_process_tbl() - END > 040074 [LFT]: osm_ucast_pipeline_tbl() - START > 040103 [LFT]: osm_ucast_pipeline_tbl() - END > 040114 [LFT]: osm_ucast_mgr_process() - END > 120097 [LFT]: Done wait_for_pending_transactions() > > New code, max_smps_per_node=4: > > 004137 [LFT]: osm_ucast_mgr_process() - START (4 max_smps_per_node) > 005380 [LFT]: ucast_mgr_process_tbl() - START > 037436 [LFT]: ucast_mgr_process_tbl() - END > 037439 [LFT]: osm_ucast_pipeline_tbl() - START > 037495 [LFT]: osm_ucast_pipeline_tbl() - END > 037506 [LFT]: osm_ucast_mgr_process() - END > 114983 [LFT]: Done wait_for_pending_transactions() > > > With IS3 based Qlogic switches, which do not handle DR packets forwarding > in HW, with a fabric of ~1100 HCAs, ~280 switches: > > Current OSM configures LFTs in ~2 seconds. > New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up), > depending on the max_smps_per_node value. > > As in case of IS4 switches, the shortest config time was obtained with > max_smps_per_node=0, which is unlimited pipeline. > > > Signed-off-by: Hal Rosenstock > --- > Changes since v1: > Added Yevgeny's performance data to patch description above > No change to actual patch > > diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h > index 0537002..617e8a9 100644 > --- a/opensm/include/opensm/osm_base.h > +++ b/opensm/include/opensm/osm_base.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. > * > @@ -449,6 +449,18 @@ BEGIN_C_DECLS > */ > #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4 > /***********/ > +/****d* OpenSM: Base/OSM_DEFAULT_SMP_MAX_PER_NODE > +* NAME > +* OSM_DEFAULT_SMP_MAX_PER_NODE > +* > +* DESCRIPTION > +* Specifies the default number of VL15 SMP MADs allowed > +* per node for certain attributes. > +* > +* SYNOPSIS > +*/ > +#define OSM_DEFAULT_SMP_MAX_PER_NODE 4 > +/***********/ > /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE > * NAME > * OSM_SM_DEFAULT_QP0_RCV_SIZE > diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h > index cc8321d..1776380 100644 > --- a/opensm/include/opensm/osm_sm.h > +++ b/opensm/include/opensm/osm_sm.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -130,6 +130,7 @@ typedef struct osm_sm { > osm_sm_mad_ctrl_t mad_ctrl; > osm_lid_mgr_t lid_mgr; > osm_ucast_mgr_t ucast_mgr; > + boolean_t lfts_updated; The name is unclear - actually this means "update in progress". OK. > > > > cl_disp_reg_handle_t sweep_fail_disp_h; > > cl_disp_reg_handle_t ni_disp_h; > > cl_disp_reg_handle_t pi_disp_h; > > @@ -524,6 +525,45 @@ osm_resp_send(IN osm_sm_t * sm, > > * > > *********/ > > > > +/****f* OpenSM: SM/osm_sm_set_next_lft_block > > +* NAME > > +* osm_sm_set_next_lft_block > > +* > > +* DESCRIPTION > > +* Set the next LFT (LinearForwardingTable) block in the indicated > switch. > > +* > > +* SYNOPSIS > > +*/ > > +void > > +osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw, > > + IN uint8_t *p_block, IN osm_dr_path_t *p_path, > > + IN osm_madw_context_t *p_context); > > Why should it be in osm_sm.[ch]? osm_ucast_mgr.c or osm_switch.c seem > much more appropriate place for this. OK. > > > > +/* > > +* PARAMETERS > > +* p_sm > > +* [in] Pointer to an osm_sm_t object. > > +* > > +* p_switch > > +* [in] Pointer to the switch object. > > +* > > +* p_block > > +* [in] Pointer to the forwarding table block. > > +* > > +* p_path > > +* [in] Pointer to a directed route path object. > > +* > > +* p_context > > +* [in] Mad wrapper context structure to be copied into the > wrapper > > +* context, and thus visible to the recipient of the response. > > +* > > +* RETURN VALUES > > +* None > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +*********/ > > + > > /****f* OpenSM: SM/osm_sm_mcgrp_join > > * NAME > > * osm_sm_mcgrp_join > > diff --git a/opensm/include/opensm/osm_subnet.h > b/opensm/include/opensm/osm_subnet.h > > index 59a32ad..f12afae 100644 > > --- a/opensm/include/opensm/osm_subnet.h > > +++ b/opensm/include/opensm/osm_subnet.h > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. > > * > > @@ -147,6 +147,7 @@ typedef struct osm_subn_opt { > > uint32_t sweep_interval; > > uint32_t max_wire_smps; > > uint32_t transaction_timeout; > > + uint32_t max_smps_per_node; > > uint8_t sm_priority; > > uint8_t lmc; > > boolean_t lmc_esp0; > > diff --git a/opensm/include/opensm/osm_switch.h > b/opensm/include/opensm/osm_switch.h > > index 7ce28c5..e12113f 100644 > > --- a/opensm/include/opensm/osm_switch.h > > +++ b/opensm/include/opensm/osm_switch.h > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > @@ -102,6 +102,7 @@ typedef struct osm_switch { > > osm_port_profile_t *p_prof; > > uint8_t *lft; > > uint8_t *new_lft; > > + uint16_t lft_block_id_ho; > > osm_mcast_tbl_t mcast_tbl; > > unsigned endport_links; > > unsigned need_update; > > diff --git a/opensm/include/opensm/osm_ucast_mgr.h > b/opensm/include/opensm/osm_ucast_mgr.h > > index a040476..fdea49a 100644 > > --- a/opensm/include/opensm/osm_ucast_mgr.h > > +++ b/opensm/include/opensm/osm_ucast_mgr.h > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > @@ -233,17 +233,42 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const > p_mgr, IN struct osm_sm * sm); > > * osm_ucast_mgr_destroy > > *********/ > > > > -/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_table > > +/****f* OpenSM: Unicast Manager/osm_ucast_pipeline_tbl > > * NAME > > -* osm_ucast_mgr_set_fwd_table > > +* osm_ucast_pipeline_tbl > > * > > * DESCRIPTION > > -* Setup forwarding table for the switch (from prepared new_lft). > > +* The osm_ucast_pipeline_tbl function pipelines the LFT > > +* (LinearForwardingTable) sets across the switches > > +* (from prepared new_lft). > > * > > * SYNOPSIS > > */ > > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, > > - IN osm_switch_t * const p_sw); > > +void osm_ucast_pipeline_tbl(IN osm_ucast_mgr_t * p_mgr); > > +/* > > +* PARAMETERS > > +* p_mgr > > +* [in] Pointer to an osm_ucast_mgr_t object. > > +* > > +* RETURN VALUES > > +* None. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +*********/ > > + > > +/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_tbl_top > > +* NAME > > +* osm_ucast_mgr_set_fwd_tbl_top > > +* > > +* DESCRIPTION > > +* Setup LinearFDBTop for the switch. > > +* > > +* SYNOPSIS > > +*/ > > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * const p_mgr, > > + IN osm_switch_t * const p_sw); > > I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and > osm_ucast_pipeline_tbl). Why not ? What's the matter with doing this ? > Why to not use a single function and update all > routing engines appropriately (you need to do it anyway), so that this > will only fill up new_lfts table? I'm not following what you're describing. set_fwd_tbl_top sets LinearFDBTop whereas pipeline_tbl starts the cascade of LFT sets based on max_smps_per_node. > > > > /* > > * PARAMETERS > > * p_mgr > > diff --git a/opensm/opensm/osm_lin_fwd_rcv.c > b/opensm/opensm/osm_lin_fwd_rcv.c > > index 2edb8d3..cb131b4 100644 > > --- a/opensm/opensm/osm_lin_fwd_rcv.c > > +++ b/opensm/opensm/osm_lin_fwd_rcv.c > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > @@ -36,7 +36,7 @@ > > /* > > * Abstract: > > * Implementation of osm_lft_rcv_t. > > - * This object represents the NodeDescription Receiver object. > > + * This object represents the Linear Forwarding Table Receiver object. > > * This object is part of the opensm family of objects. > > */ > > > > @@ -55,6 +55,7 @@ void osm_lft_rcv_process(IN void *context, IN void > *data) > > { > > osm_sm_t *sm = context; > > osm_madw_t *p_madw = data; > > + osm_dr_path_t *p_path; > > ib_smp_t *p_smp; > > uint32_t block_num; > > osm_switch_t *p_sw; > > @@ -62,6 +63,8 @@ void osm_lft_rcv_process(IN void *context, IN void > *data) > > uint8_t *p_block; > > ib_net64_t node_guid; > > ib_api_status_t status; > > + uint8_t block[IB_SMP_DATA_SIZE]; > > + osm_madw_context_t mad_context; > > > > CL_ASSERT(sm); > > > > @@ -94,6 +97,16 @@ void osm_lft_rcv_process(IN void *context, IN void > *data) > > "\n\t\t\t\tSwitch 0x%" PRIx64 "\n", > > ib_get_err_str(status), > cl_ntoh64(node_guid)); > > } > > + > > + p_path = > osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); > > + > > + mad_context.lft_context.node_guid = node_guid; > > + mad_context.lft_context.set_method = TRUE; > > + > > + osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path, > > + &mad_context); > > + > > + p_sw->lft_block_id_ho++; > > Wouldn't it be simpler to encode block_id in a mad context? Why simpler ? I think it complicates the receiver code to do that (assuming max_smps_per_node remains). > > > > } > > > > CL_PLOCK_RELEASE(sm->p_lock); > > diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c > > index daa60ff..4e0fd2a 100644 > > --- a/opensm/opensm/osm_sm.c > > +++ b/opensm/opensm/osm_sm.c > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. > > * > > @@ -441,6 +441,45 @@ Exit: > > > > /********************************************************************** > > **********************************************************************/ > > +void osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw, > > + IN uint8_t *p_block, IN osm_dr_path_t > *p_path, > > + IN osm_madw_context_t *context) > > +{ > > + ib_api_status_t status; > > + > > + for (; > > + osm_switch_get_lft_block(p_sw, p_sw->lft_block_id_ho, > p_block); > > + p_sw->lft_block_id_ho++) { > > + if (!p_sw->need_update && !p_sm->p_subn->need_update && > > + !memcmp(p_block, > > + p_sw->new_lft + p_sw->lft_block_id_ho * > IB_SMP_DATA_SIZE, > > + IB_SMP_DATA_SIZE)) > > + continue; > > + > > + p_sm->lfts_updated = 1; > > + > > + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, > > + "Writing FT block %u to switch 0x%" PRIx64 "\n", > > + p_sw->lft_block_id_ho, > > + cl_ntoh64(context->lft_context.node_guid)); > > + > > + status = osm_req_set(p_sm, p_path, > > + p_sw->new_lft + > > + p_sw->lft_block_id_ho * > IB_SMP_DATA_SIZE, > > + IB_SMP_DATA_SIZE, > IB_MAD_ATTR_LIN_FWD_TBL, > > + cl_hton32(p_sw->lft_block_id_ho), > > + CL_DISP_MSGID_NONE, context); > > + > > + if (status != IB_SUCCESS) > > + OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E11: " > > + "Sending linear fwd. tbl. block failed > (%s)\n", > > + ib_get_err_str(status)); > > + break; > > + } > > +} > > + > > +/********************************************************************** > > + **********************************************************************/ > > static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm, > > IN osm_mgrp_t * p_mgrp) > > { > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > > index ec15f8a..1964b7f 100644 > > --- a/opensm/opensm/osm_subnet.c > > +++ b/opensm/opensm/osm_subnet.c > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. > > * > > @@ -295,6 +295,7 @@ static const opt_rec_t opt_tbl[] = { > > { "m_key_lease_period", OPT_OFFSET(m_key_lease_period), > opts_parse_net16, NULL, 1 }, > > { "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32, > NULL, 1 }, > > { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32, > NULL, 1 }, > > + { "max_smps_per_node", OPT_OFFSET(max_smps_per_node), > opts_parse_uint32, NULL, 1 }, > > { "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 }, > > { "console_port", OPT_OFFSET(console_port), opts_parse_uint16, > NULL, 0 }, > > { "transaction_timeout", OPT_OFFSET(transaction_timeout), > opts_parse_uint32, NULL, 1 }, > > @@ -671,6 +672,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * > const p_opt) > > p_opt->m_key_lease_period = 0; > > p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS; > > p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; > > + p_opt->max_smps_per_node = OSM_DEFAULT_SMP_MAX_PER_NODE; > > p_opt->console = strdup(OSM_DEFAULT_CONSOLE); > > p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT; > > p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; > > @@ -1461,6 +1463,10 @@ int osm_subn_output_conf(FILE *out, IN > osm_subn_opt_t *const p_opts) > > "max_wire_smps %u\n\n" > > "# The maximum time in [msec] allowed for a transaction to > complete\n" > > "transaction_timeout %u\n\n" > > + "# Maximum number of SMPs per node sent in parallel\n" > > + "# (0 means unlimited)\n" > > + "# Only applies to certain attributes\n" > > + "max_smps_per_node %u\n\n" > > "# Maximal time in [msec] a message can stay in the > incoming message queue.\n" > > "# If there is more than one message in the queue and the > last message\n" > > "# stayed in the queue more than this value, any SA request > will be\n" > > @@ -1470,6 +1476,7 @@ int osm_subn_output_conf(FILE *out, IN > osm_subn_opt_t *const p_opts) > > "single_thread %s\n\n", > > p_opts->max_wire_smps, > > p_opts->transaction_timeout, > > + p_opts->max_smps_per_node, > > p_opts->max_msg_fifo_timeout, > > p_opts->single_thread ? "TRUE" : "FALSE"); > > > > diff --git a/opensm/opensm/osm_ucast_cache.c > b/opensm/opensm/osm_ucast_cache.c > > index 216b496..31c930b 100644 > > --- a/opensm/opensm/osm_ucast_cache.c > > +++ b/opensm/opensm/osm_ucast_cache.c > > @@ -1,5 +1,5 @@ > > /* > > - * Copyright (c) 2008 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights > reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -1085,9 +1085,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * > p_mgr) > > memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO > + 1); > > } > > > > - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); > > + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw); > > } > > > > + osm_ucast_pipeline_tbl(p_mgr); > > + > > return 0; > > } > > > > diff --git a/opensm/opensm/osm_ucast_file.c > b/opensm/opensm/osm_ucast_file.c > > index 2505c46..099e8ba 100644 > > --- a/opensm/opensm/osm_ucast_file.c > > +++ b/opensm/opensm/osm_ucast_file.c > > @@ -168,8 +168,8 @@ static int do_ucast_file_load(void *context) > > "routing algorithm\n"); > > } else if (!strncmp(p, "Unicast lids", 12)) { > > if (p_sw) > > - osm_ucast_mgr_set_fwd_table(&p_osm->sm. > > - ucast_mgr, > p_sw); > > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm. > > + ucast_mgr, > p_sw); > > q = strstr(p, " guid 0x"); > > if (!q) { > > OSM_LOG(&p_osm->log, OSM_LOG_ERROR, > > @@ -247,7 +247,7 @@ static int do_ucast_file_load(void *context) > > } > > > > if (p_sw) > > - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw); > > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw); > > > > fclose(file); > > return 0; > > I suppose that this breaks 'file' routing engine (did you test it?) - > instead of switch LFTs setup this will only update its TOPs. At this point, I don't recall. > > > > diff --git a/opensm/opensm/osm_ucast_ftree.c > b/opensm/opensm/osm_ucast_ftree.c > > index bde6dbd..d65c685 100644 > > --- a/opensm/opensm/osm_ucast_ftree.c > > +++ b/opensm/opensm/osm_ucast_ftree.c > > @@ -2,7 +2,7 @@ > > * Copyright (c) 2009 Simula Research Laboratory. All rights reserved. > > * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > @@ -1905,8 +1905,8 @@ static void set_sw_fwd_table(IN cl_map_item_t * > const p_map_item, > > ftree_fabric_t *p_ftree = (ftree_fabric_t *) context; > > > > p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid; > > - osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, > > - p_sw->p_osm_sw); > > + osm_ucast_mgr_set_fwd_tbl_top(&p_ftree->p_osm->sm.ucast_mgr, > > + p_sw->p_osm_sw); > > } > > > > /*************************************************** > > @@ -4005,6 +4005,8 @@ static int do_routing(IN void *context) > > /* for each switch, set its fwd table */ > > cl_qmap_apply_func(&p_ftree->sw_tbl, set_sw_fwd_table, (void > *)p_ftree); > > > > + osm_ucast_pipeline_tbl(&p_ftree->p_osm->sm.ucast_mgr); > > + > > /* write out hca ordering file */ > > fabric_dump_hca_ordering(p_ftree); > > > > diff --git a/opensm/opensm/osm_ucast_lash.c > b/opensm/opensm/osm_ucast_lash.c > > index 12b5e34..adf5f6c 100644 > > --- a/opensm/opensm/osm_ucast_lash.c > > +++ b/opensm/opensm/osm_ucast_lash.c > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * Copyright (c) 2007 Simula Research Laboratory. All rights > reserved. > > * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. > > @@ -1045,8 +1045,11 @@ static void populate_fwd_tbls(lash_t * p_lash) > > physical_egress_port); > > } > > } /* for */ > > - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw); > > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw); > > } > > + > > + osm_ucast_pipeline_tbl(&p_osm->sm.ucast_mgr); > > + > > OSM_LOG_EXIT(p_log); > > } > > > > diff --git a/opensm/opensm/osm_ucast_mgr.c > b/opensm/opensm/osm_ucast_mgr.c > > index 78a7031..86d1c98 100644 > > --- a/opensm/opensm/osm_ucast_mgr.c > > +++ b/opensm/opensm/osm_ucast_mgr.c > > @@ -1,6 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights > reserved. > > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > @@ -315,16 +315,14 @@ Exit: > > > > /********************************************************************** > > **********************************************************************/ > > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr, > > - IN osm_switch_t * p_sw) > > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, > > + IN osm_switch_t * p_sw) > > { > > osm_node_t *p_node; > > osm_dr_path_t *p_path; > > osm_madw_context_t context; > > ib_api_status_t status; > > ib_switch_info_t si; > > - uint16_t block_id_ho = 0; > > - uint8_t block[IB_SMP_DATA_SIZE]; > > boolean_t set_swinfo_require = FALSE; > > uint16_t lin_top; > > uint8_t life_state; > > @@ -382,48 +380,8 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * > p_mgr, > > ib_get_err_str(status)); > > } > > > > - /* > > - Send linear forwarding table blocks to the switch > > - as long as the switch indicates it has blocks needing > > - configuration. > > - */ > > - > > - context.lft_context.node_guid = osm_node_get_node_guid(p_node); > > - context.lft_context.set_method = TRUE; > > - > > - if (!p_sw->new_lft) { > > - /* any routing should provide the new_lft */ > > - CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && > > - p_mgr->cache_valid && !p_sw->need_update); > > - goto Exit; > > - } > > - > > - for (block_id_ho = 0; > > - osm_switch_get_lft_block(p_sw, block_id_ho, block); > > - block_id_ho++) { > > - if (!p_sw->need_update && !p_mgr->p_subn->need_update && > > - !memcmp(block, > > - p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, > > - IB_SMP_DATA_SIZE)) > > - continue; > > - > > - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > > - "Writing FT block %u\n", block_id_ho); > > - > > - status = osm_req_set(p_mgr->sm, p_path, > > - p_sw->new_lft + > > - block_id_ho * IB_SMP_DATA_SIZE, > > - sizeof(block), > IB_MAD_ATTR_LIN_FWD_TBL, > > - cl_hton32(block_id_ho), > CL_DISP_MSGID_NONE, > > - &context); > > + p_sw->lft_block_id_ho = 0; > > > > - if (status != IB_SUCCESS) > > - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: " > > - "Sending linear fwd. tbl. block failed > (%s)\n", > > - ib_get_err_str(status)); > > - } > > - > > -Exit: > > OSM_LOG_EXIT(p_mgr->p_log); > > return 0; > > } > > @@ -508,7 +466,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * > p_map_item, > > } > > } > > > > - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); > > + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw); > > > > if (p_mgr->p_subn->opt.lmc) > > free_ports_priv(p_mgr); > > @@ -516,6 +474,47 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * > p_map_item, > > OSM_LOG_EXIT(p_mgr->p_log); > > } > > > > +static void ucast_mgr_pipeline_tbl(IN osm_switch_t *p_sw, > > + IN osm_ucast_mgr_t *p_mgr) > > +{ > > + osm_dr_path_t *p_path; > > + osm_madw_context_t mad_context; > > + uint8_t block[IB_SMP_DATA_SIZE]; > > + > > + OSM_LOG_ENTER(p_mgr->p_log); > > + > > + CL_ASSERT(p_sw && p_sw->p_node); > > + > > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > > + "Processing switch 0x%" PRIx64 "\n", > > + cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); > > + > > + /* > > + Send linear forwarding table blocks to the switch > > + as long as the switch indicates it has blocks needing > > + configuration. > > + */ > > + if (!p_sw->new_lft) { > > + /* any routing should provide the new_lft */ > > + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && > > + p_mgr->cache_valid && !p_sw->need_update); > > + goto Exit; > > + } > > + > > + p_path = > osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); > > + > > + mad_context.lft_context.node_guid = > osm_node_get_node_guid(p_sw->p_node); > > + mad_context.lft_context.set_method = TRUE; > > + > > + osm_sm_set_next_lft_block(p_mgr->sm, p_sw, &block[0], p_path, > > + &mad_context); > > + > > + p_sw->lft_block_id_ho++; > > + > > +Exit: > > + OSM_LOG_EXIT(p_mgr->p_log); > > +} > > + > > /********************************************************************** > > **********************************************************************/ > > static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item, > > @@ -870,6 +869,28 @@ static void > sort_ports_by_switch_load(osm_ucast_mgr_t * m) > > add_sw_endports_to_order_list(s[i], m); > > } > > > > +void osm_ucast_pipeline_tbl(osm_ucast_mgr_t * p_mgr) > > +{ > > + cl_qmap_t *p_sw_tbl; > > + osm_switch_t *p_sw; > > + int i; > > + > > + for (i = 0; > > + !p_mgr->p_subn->opt.max_smps_per_node || > > + i < p_mgr->p_subn->opt.max_smps_per_node; > > + i++) { > > + p_mgr->sm->lfts_updated = 0; > > + p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; > > + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > > + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { > > + ucast_mgr_pipeline_tbl(p_sw, p_mgr); > > + p_sw = (osm_switch_t *) > cl_qmap_next(&p_sw->map_item); > > + } > > + if (!p_mgr->sm->lfts_updated) > > + break; > > + } > > +} > > Is it possible (for example in case of send errors) that "partial" LFT > blocks sending will trigger wait_for_pending_transaction() completion? I don't know. Is this different from the original algorithm in the case of send errors ? -- Hal > > > Sasha > > > + > > static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) > > { > > cl_qlist_init(&p_mgr->port_order_list); > > @@ -904,6 +925,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * > p_mgr) > > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, > ucast_mgr_process_tbl, > > p_mgr); > > > > + osm_ucast_pipeline_tbl(p_mgr); > > + > > cl_qlist_remove_all(&p_mgr->port_order_list); > > > > return 0; > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ofedrnicuser at yahoo.com Tue Aug 4 10:46:07 2009 From: ofedrnicuser at yahoo.com (Bill N) Date: Tue, 4 Aug 2009 10:46:07 -0700 (PDT) Subject: [ofa-general] perftest for Chelsio RNIC adapters In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA383035F7A95D5@azsmsx501.amr.corp.intel.com> Message-ID: <641544.10718.qm@web111213.mail.gq1.yahoo.com> yes. I am able to run them. Thanks a lot. Bill --- On Tue, 8/4/09, Tung, Chien Tin wrote: > From: Tung, Chien Tin > Subject: RE: [ofa-general] perftest for Chelsio RNIC adapters > To: "Bill N" , "OFED General" > Date: Tuesday, August 4, 2009, 2:25 PM > > >Is performance tests of the perftest-1.2 supported for > Chelsio > >and other RNIC adapters? > > You can run ib_rdma_bw and ib_rdma_lat over iWarp adapters > with -c flag (use RDMA CM). > > Chien From bart.vanassche at gmail.com Tue Aug 4 11:25:35 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Tue, 4 Aug 2009 20:25:35 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Tue, Aug 4, 2009 at 6:27 PM, Roland Dreier wrote: > >  > An update: apparently it is possible to trigger scmnd->device == NULL even >  > without triggering a prior IB CM disconnect. The following shell commands >  > are sufficient to trigger the WARN_ON statement in the patch below: > >  > rmmod ib_srp >  > modprobe ib_srp >  > ibsrpdm -c | while read target_info; do echo "${target_info}"; echo >  > "${target_info}" >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target; >  > done >  > sg_reset -d ${srp_device} > > So in other words, just sg_reset on an SRP device triggers the warning? By the way, Vladislav Bolkhovitin was so kind to inform me that this issue is not specific to the SRP initiator. For more information, see also http://thread.gmane.org/gmane.linux.scsi/26166. Bart. From eli at dev.mellanox.co.il Tue Aug 4 12:41:25 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 4 Aug 2009 22:41:25 +0300 Subject: [ofa-general] Re: [PATCH] cma: fix access to freed memory In-Reply-To: References: <20090803092528.GA25528@mtls03> <20090804033221.GA30949@mtls03> Message-ID: <20090804194125.GA29370@mtls03> On Tue, Aug 04, 2009 at 09:05:13AM -0700, Roland Dreier wrote: > > > Maybe it's just a loose connection but yet, it seems to me that > > operations on id_priv->mc_list should be protected. Should I send a > > different patch? > > "seems ... should be" is very weak justification for locking. What > should they be protected from? > What if rdma_join_multicast() is called when rdma_destroy_id() - for example from cma_ib_handler() due to error returned from the handler? In this case list_add(&mc->list, &id_priv->mc_list) in rdma_join_multicast() can may be executed along with the list manipulation done in cma_leave_mc_groups(). Generally, it looks strange that in some places list handling is protected with a spinlock and in other places not. From sashak at voltaire.com Tue Aug 4 13:15:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 23:15:05 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> Message-ID: <20090804201505.GI7993@me> On 12:45 Tue 04 Aug , Hal Rosenstock wrote: > > > > > This patch also introduces a new config option (max_smps_per_node) > > > which indicates how deep the per node pipeline is (current default is 4). > > > This also has the effect of limiting the number of times that the switch > > > list is traversed. Maybe this embellishment is unnecessary. > > > > Then why is it needed? > > > Also, as was discussed in the thread on this, it gives a way to control > possible VL15 overflow. VL15 overflow is controlled by max_wire_smps not by max_smps_per_node. > > I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and > > osm_ucast_pipeline_tbl). > > > Why not ? What's the matter with doing this ? To not expose this (LFTs setup) algorithm to routing engines. And to eliminate duplicated function calls. > > Why to not use a single function and update all > > routing engines appropriately (you need to do it anyway), so that this > > will only fill up new_lfts table? > > > I'm not following what you're describing. set_fwd_tbl_top sets LinearFDBTop > whereas pipeline_tbl starts the cascade of LFT sets based on > max_smps_per_node. You can setup new_lfts arrays in routing engines and at the end of cycle call single osm_*setup*_lfts() which will do everything - setup TOPs and start to run LFT blocks update. > > > + > > > + p_path = > > osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); > > > + > > > + mad_context.lft_context.node_guid = node_guid; > > > + mad_context.lft_context.set_method = TRUE; > > > + > > > + osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path, > > > + &mad_context); > > > + > > > + p_sw->lft_block_id_ho++; > > > > Wouldn't it be simpler to encode block_id in a mad context? > > > Why simpler ? I think it complicates the receiver code to do that (assuming > max_smps_per_node remains). Ok. > > I suppose that this breaks 'file' routing engine (did you test it?) - > > instead of switch LFTs setup this will only update its TOPs. > > At this point, I don't recall. You removed osm_ucast_mgr_set_fwd_table() calls and placed osm_ucast_mgr_set_fwd_tbl_top() instead - obviously nothing will run an actual LFT blocks setup. > > Is it possible (for example in case of send errors) that "partial" LFT > > blocks sending will trigger wait_for_pending_transaction() completion? > > > I don't know. Is this different from the original algorithm in the case of > send errors ? Yes, it is different - unlike the original code it leaves ucast mgr (and go to wait in wait_for_pending()) before all required LFT blocks update requests were sent. Sasha From hal.rosenstock at gmail.com Tue Aug 4 13:44:06 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 4 Aug 2009 16:44:06 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090804201505.GI7993@me> References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> Message-ID: On Tue, Aug 4, 2009 at 4:15 PM, Sasha Khapyorsky wrote: > On 12:45 Tue 04 Aug , Hal Rosenstock wrote: > > > > > > > This patch also introduces a new config option (max_smps_per_node) > > > > which indicates how deep the per node pipeline is (current default is > 4). > > > > This also has the effect of limiting the number of times that the > switch > > > > list is traversed. Maybe this embellishment is unnecessary. > > > > > > Then why is it needed? > > > > > > Also, as was discussed in the thread on this, it gives a way to control > > possible VL15 overflow. > > VL15 overflow is controlled by max_wire_smps not by max_smps_per_node. It's a different control on VL15 overflow. It can easily be eliminated if that's what you want. There's actually some minor simplification with doing this. > > > > > I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and > > > osm_ucast_pipeline_tbl). > > > > > > Why not ? What's the matter with doing this ? > > To not expose this (LFTs setup) algorithm to routing engines. And to > eliminate duplicated function calls. > > > > Why to not use a single function and update all > > > routing engines appropriately (you need to do it anyway), so that this > > > will only fill up new_lfts table? > > > > > > I'm not following what you're describing. set_fwd_tbl_top sets > LinearFDBTop > > whereas pipeline_tbl starts the cascade of LFT sets based on > > max_smps_per_node. > > You can setup new_lfts arrays in routing engines and at the end of cycle > call single osm_*setup*_lfts() which will do everything - setup TOPs and > start to run LFT blocks update. > > > > > + > > > > + p_path = > > > osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); > > > > + > > > > + mad_context.lft_context.node_guid = node_guid; > > > > + mad_context.lft_context.set_method = TRUE; > > > > + > > > > + osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path, > > > > + &mad_context); > > > > + > > > > + p_sw->lft_block_id_ho++; > > > > > > Wouldn't it be simpler to encode block_id in a mad context? > > > > > > Why simpler ? I think it complicates the receiver code to do that > (assuming > > max_smps_per_node remains). > > Ok. > > > > I suppose that this breaks 'file' routing engine (did you test it?) - > > > instead of switch LFTs setup this will only update its TOPs. > > > > At this point, I don't recall. > > You removed osm_ucast_mgr_set_fwd_table() calls and placed > osm_ucast_mgr_set_fwd_tbl_top() instead - obviously nothing will run an > actual LFT blocks setup. > > > > Is it possible (for example in case of send errors) that "partial" LFT > > > blocks sending will trigger wait_for_pending_transaction() completion? > > > > > > I don't know. Is this different from the original algorithm in the case > of > > send errors ? > > Yes, it is different - unlike the original code it leaves ucast mgr (and > go to wait in wait_for_pending()) before all required LFT blocks update > requests were sent. This goes away if there is no max_smps_per_node support. So do you want to also preserve the original behavior/algorithm or you have no preference ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Tue Aug 4 13:59:50 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Aug 2009 23:59:50 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> Message-ID: <20090804205950.GJ7993@me> On 16:44 Tue 04 Aug , Hal Rosenstock wrote: > > So do you want to also preserve the original behavior/algorithm or you have > no preference ? No need unless there is a reason for doing this. Sasha From rdreier at cisco.com Tue Aug 4 14:39:01 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Aug 2009 14:39:01 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: (Bart Van Assche's message of "Tue, 4 Aug 2009 20:25:35 +0200") References: Message-ID: > By the way, Vladislav Bolkhovitin was so kind to inform me that this > issue is not specific to the SRP initiator. For more information, see > also http://thread.gmane.org/gmane.linux.scsi/26166. I'm not sure I follow this exactly -- the idea is that sg_reset generates SCSI commands that are somehow different? What does the LLD have to do to handle them? Is the problem that we get a command with bogus host_scribble (since SRP never saw it before) and so srp_find_req() gets confused? - R. From hnrose at comcast.net Tue Aug 4 14:39:05 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 4 Aug 2009 17:39:05 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Directly call calloc/free rather than create/delete_cdg Message-ID: <20090804213905.GA23497@comcast.net> Reduce call stack by one call level Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 6210477..168a758 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -62,20 +62,6 @@ typedef struct _reachable_dest { struct _reachable_dest *next; } reachable_dest_t; -static cdg_vertex_t *create_cdg_vertex(unsigned num_switches) -{ - cdg_vertex_t *v; - - v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0])); - - return v; -} - -static void delete_cdg_vertex(cdg_vertex_t *v) -{ - free(v); -} - static void connect_switches(lash_t * p_lash, int sw1, int sw2, int phy_port_1) { osm_log_t *p_log = &p_lash->p_osm->log; @@ -207,7 +193,7 @@ static void remove_semipermanent_depend_for_sp(lash_t * p_lash, int sw, cdg_vertex_matrix[lane][sw][i_next_switch] = NULL; - delete_cdg_vertex(v); + free(v); } else { v->num_using_vertex--; if (i_next_switch != dest_switch) { @@ -352,7 +338,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch, while (sw != dest_switch) { if (cdg_vertex_matrix[lane][sw][next_switch] == NULL) { - v = create_cdg_vertex(num_switches); + v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0])); v->from = sw; v->to = next_switch; v->temp = 1; @@ -442,7 +428,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash, int sw, int dest_switch, if (v->temp == 1) { cdg_vertex_matrix[lane][sw][next_switch] = NULL; - delete_cdg_vertex(v); + free(v); } else { CL_ASSERT(v->num_temp_depend <= v->num_deps); v->num_deps = v->num_deps - v->num_temp_depend; @@ -684,7 +670,7 @@ static void free_lash_structures(lash_t * p_lash) for (j = 0; j < num_switches; j++) { for (k = 0; k < num_switches; k++) if (p_lash->cdg_vertex_matrix[i][j][k]) - delete_cdg_vertex(p_lash->cdg_vertex_matrix[i][j][k]); + free(p_lash->cdg_vertex_matrix[i][j][k]); if (p_lash->cdg_vertex_matrix[i][j]) free(p_lash->cdg_vertex_matrix[i][j]); } From hnrose at comcast.net Tue Aug 4 14:44:13 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 4 Aug 2009 17:44:13 -0400 Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_lin_fwd_rcv.c: Commentary change Message-ID: <20090804214413.GA24878@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c index 2edb8d3..ae40b0d 100644 --- a/opensm/opensm/osm_lin_fwd_rcv.c +++ b/opensm/opensm/osm_lin_fwd_rcv.c @@ -36,7 +36,7 @@ /* * Abstract: * Implementation of osm_lft_rcv_t. - * This object represents the NodeDescription Receiver object. + * This object represents the Linear Forwarding Table Receiver object. * This object is part of the opensm family of objects. */ From arlin.r.davis at intel.com Tue Aug 4 22:32:03 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 4 Aug 2009 22:32:03 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: CNO pre-triggered events not delivered during cno_wait Message-ID: <53ED9F5E1BB14E13BDA0BE594E3F43B6@amr.corp.intel.com> CNO events, once triggered will not be returned during the cno wait. Check for triggered state before going to sleep in cno_wait. Reset triggered EVD reference after reporting. diff --git a/dapl/udapl/dapl_cno_wait.c b/dapl/udapl/dapl_cno_wait.c index e89317d..6bbd249 100644 --- a/dapl/udapl/dapl_cno_wait.c +++ b/dapl/udapl/dapl_cno_wait.c @@ -82,6 +82,14 @@ DAT_RETURN DAT_API dapl_cno_wait(IN DAT_CNO_HANDLE cno_handle, /* cno_handle */ } dapl_os_lock(&cno_ptr->header.lock); + if (cno_ptr->cno_state == DAPL_CNO_STATE_TRIGGERED) { + cno_ptr->cno_state = DAPL_CNO_STATE_UNTRIGGERED; + *evd_handle = cno_ptr->cno_evd_triggered; + cno_ptr->cno_evd_triggered = NULL; + dapl_os_unlock(&cno_ptr->header.lock); + goto bail; + } + while (cno_ptr->cno_state == DAPL_CNO_STATE_UNTRIGGERED && DAT_GET_TYPE(dat_status) != DAT_TIMEOUT_EXPIRED) { cno_ptr->cno_waiters++; @@ -104,6 +112,7 @@ DAT_RETURN DAT_API dapl_cno_wait(IN DAT_CNO_HANDLE cno_handle, /* cno_handle */ dapl_os_assert(cno_ptr->cno_state == DAPL_CNO_STATE_TRIGGERED); cno_ptr->cno_state = DAPL_CNO_STATE_UNTRIGGERED; *evd_handle = cno_ptr->cno_evd_triggered; + cno_ptr->cno_evd_triggered = NULL; } else if (DAT_GET_TYPE(dat_status) == DAT_TIMEOUT_EXPIRED) { cno_ptr->cno_state = DAPL_CNO_STATE_UNTRIGGERED; *evd_handle = NULL; From arlin.r.davis at intel.com Tue Aug 4 22:32:06 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 4 Aug 2009 22:32:06 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: fix dtest to handle CNO events properly Message-ID: modify dtest.c to cleanup CNO wait code and consolidate into collect_event() call. After waking up from CNO wait the consumer must check all EVD's. The EVD's under the CNO could be dropped if already triggered or could come in any order. DT_RetToString changed to DT_RetToStr and DT_EventToSTr changed to DT_EventToStr for consistency. diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 739ccca..d868490 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -104,6 +104,7 @@ /* definitions */ #define SERVER_CONN_QUAL 45248 #define DTO_TIMEOUT (1000*1000*5) +#define CNO_TIMEOUT (1000*1000*1) #define DTO_FLUSH_TIMEOUT (1000*1000*2) #define CONN_TIMEOUT (1000*1000*10) #define SERVER_TIMEOUT DAT_TIMEOUT_INFINITE @@ -208,8 +209,8 @@ static int burst_msg_posted = 0; static int burst_msg_index = 0; /* forward prototypes */ -const char *DT_RetToString(DAT_RETURN ret_value); -const char *DT_EventToSTr(DAT_EVENT_NUMBER event_code); +const char *DT_RetToStr(DAT_RETURN ret_value); +const char *DT_EventToStr(DAT_EVENT_NUMBER event_code); void print_usage(void); double get_time(void); void init_data(void); @@ -262,6 +263,51 @@ void flush_evds(void) } } + +static inline DAT_RETURN +collect_event(DAT_EVD_HANDLE dto_evd, + DAT_EVENT *event, + DAT_TIMEOUT timeout, + int *counter) +{ + DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; + DAT_COUNT nmore; + DAT_RETURN ret = DAT_SUCCESS; + + if (use_cno) { +retry: + /* CNO wait could return EVD's in any order and + * may drop some EVD notification's if already + * triggered. Once woken, simply dequeue the + * Evd the caller wants to collect and return. + * If notification without EVD, retry. + */ + ret = dat_cno_wait(h_dto_cno, CNO_TIMEOUT, &evd); + if (dat_evd_dequeue(dto_evd, event) != DAT_SUCCESS) { + if (ret == DAT_SUCCESS) + printf(" WARNING: CNO notification:" + " without EVD?\n"); + goto retry; + } + ret = DAT_SUCCESS; /* cno timed out, but EVD dequeued */ + + } else if (!polling) { + + /* use wait to dequeue */ + ret = dat_evd_wait(dto_evd, timeout, 1, event, &nmore); + if (ret != DAT_SUCCESS) + fprintf(stderr, + "Error waiting on h_dto_evd %p: %s\n", + dto_evd, DT_RetToStr(ret)); + + } else { + while (dat_evd_dequeue(dto_evd, event) == DAT_QUEUE_EMPTY) + if (counter) + (*counter)++; + } + return (ret); +} + int main(int argc, char **argv) { int i, c; @@ -355,7 +401,7 @@ int main(int argc, char **argv) time.open += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: Error Adaptor open: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); exit(1); } else LOGPRINTF("%d Opened Interface Adaptor\n", getpid()); @@ -368,7 +414,7 @@ int main(int argc, char **argv) time.pzc += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error creating Protection Zone: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); exit(1); } else LOGPRINTF("%d Created Protection Zone\n", getpid()); @@ -378,7 +424,7 @@ int main(int argc, char **argv) ret = register_rdma_memory(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error registering RDMA memory: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else LOGPRINTF("%d Register RDMA memory done\n", getpid()); @@ -387,7 +433,7 @@ int main(int argc, char **argv) ret = create_events(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error creating events: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else { LOGPRINTF("%d Create events done\n", getpid()); @@ -419,7 +465,7 @@ int main(int argc, char **argv) time.total += time.epc; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_ep_create: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else LOGPRINTF("%d EP created %p \n", getpid(), h_ep); @@ -431,7 +477,7 @@ int main(int argc, char **argv) ret = connect_ep(hostname, SERVER_CONN_QUAL); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error connect_ep: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else LOGPRINTF("%d connect_ep complete\n", getpid()); @@ -440,7 +486,7 @@ int main(int argc, char **argv) ret = dat_ep_query(h_ep, DAT_EP_FIELD_ALL, &ep_param); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_ep_query: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else LOGPRINTF("%d EP queried %p \n", getpid(), h_ep); @@ -483,7 +529,7 @@ int main(int argc, char **argv) ret = do_rdma_write_with_msg(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error do_rdma_write_with_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else LOGPRINTF("%d do_rdma_write_with_msg complete\n", getpid()); @@ -492,7 +538,7 @@ int main(int argc, char **argv) ret = do_rdma_read_with_msg(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error do_rdma_read_with_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else LOGPRINTF("%d do_rdma_read_with_msg complete\n", getpid()); @@ -501,7 +547,7 @@ int main(int argc, char **argv) ret = do_ping_pong_msg(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error do_ping_pong_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); goto cleanup; } else { LOGPRINTF("%d do_ping_pong_msg complete\n", getpid()); @@ -528,7 +574,7 @@ complete: time.total += time.epf; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing EP: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else { LOGPRINTF("%d Freed EP\n", getpid()); h_ep = DAT_HANDLE_NULL; @@ -540,7 +586,7 @@ complete: ret = destroy_events(); if (ret != DAT_SUCCESS) fprintf(stderr, "%d Error destroy_events: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); else LOGPRINTF("%d destroy events done\n", getpid()); @@ -548,7 +594,7 @@ complete: LOGPRINTF("%d unregister_rdma_memory \n", getpid()); if (ret != DAT_SUCCESS) fprintf(stderr, "%d Error unregister_rdma_memory: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); else LOGPRINTF("%d unregister_rdma_memory done\n", getpid()); @@ -560,7 +606,7 @@ complete: time.pzf += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing PZ: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else { LOGPRINTF("%d Freed pz\n", getpid()); h_pz = NULL; @@ -574,7 +620,7 @@ complete: time.close += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: Error Adaptor close: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else LOGPRINTF("%d Closed Interface Adaptor\n", getpid()); @@ -652,7 +698,6 @@ send_msg(void *data, { DAT_LMR_TRIPLET iov; DAT_EVENT event; - DAT_COUNT nmore; DAT_RETURN ret; iov.lmr_context = context; @@ -669,47 +714,23 @@ send_msg(void *data, if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: ERROR: dat_ep_post_send() %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return ret; } if (!(flags & DAT_COMPLETION_SUPPRESS_FLAG)) { - if (polling) { - printf("%d Polling post send completion...\n", - getpid()); - while (dat_evd_dequeue(h_dto_req_evd, &event) == - DAT_QUEUE_EMPTY) ; - } else { - LOGPRINTF("%d waiting for post_send completion event\n", - getpid()); - if (use_cno) { - DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; - ret = - dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); - LOGPRINTF("%d cno wait return evd_handle=%p\n", - getpid(), evd); - if (evd != h_dto_req_evd) { - /* CNO timeout, already on EVD */ - if (evd != NULL) - return (ret); - } - } - /* use wait to dequeue */ - ret = - dat_evd_wait(h_dto_req_evd, DTO_TIMEOUT, 1, &event, - &nmore); - if (ret != DAT_SUCCESS) { - fprintf(stderr, - "%d: ERROR: DTO dat_evd_wait() %s\n", - getpid(), DT_RetToString(ret)); - return ret; - } - } + + if (collect_event(h_dto_req_evd, + &event, + DTO_TIMEOUT, + &poll_count) != DAT_SUCCESS) + return (DAT_ABORT); /* validate event number, len, cookie, and status */ if (event.event_number != DAT_DTO_COMPLETION_EVENT) { fprintf(stderr, "%d: ERROR: DTO event number %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), + DT_EventToStr(event.event_number)); return (DAT_ABORT); } @@ -730,7 +751,7 @@ send_msg(void *data, if (event.event_data.dto_completion_event_data.status != DAT_SUCCESS) { fprintf(stderr, "%d: ERROR: DTO event status %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (DAT_ABORT); } } @@ -772,7 +793,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error registering send msg buffer: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d Registered send Message Buffer %p \n", @@ -796,7 +817,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) ®istered_addr_recv_msg); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error registering recv msg buffer: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d Registered Receive Message Buffer %p\n", @@ -823,7 +844,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error registering recv msg buffer: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d Registered Receive Message Buffer %p\n", @@ -846,7 +867,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) h_cr_evd, DAT_PSP_CONSUMER_FLAG, &h_psp); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_psp_create: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d dat_psp_created for server listen\n", @@ -858,7 +879,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) ret = dat_evd_wait(h_cr_evd, SERVER_TIMEOUT, 1, &event, &nmore); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_wait: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d dat_evd_wait for cr_evd completed\n", @@ -866,7 +887,8 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) if (event.event_number != DAT_CONNECTION_REQUEST_EVENT) { fprintf(stderr, "%d Error unexpected cr event : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), + DT_EventToStr(event.event_number)); return (DAT_ABORT); } if ((event.event_data.cr_arrival_event_data.conn_qual != @@ -874,7 +896,8 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) || (event.event_data.cr_arrival_event_data.sp_handle. psp_handle != h_psp)) { fprintf(stderr, "%d Error wrong cr event data : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), + DT_EventToStr(event.event_number)); return (DAT_ABORT); } @@ -922,7 +945,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_cr_accept: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d dat_cr_accept completed\n", getpid()); @@ -966,7 +989,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) 0, DAT_CONNECT_DEFAULT_FLAG); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_ep_connect: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d dat_ep_connect completed\n", getpid()); @@ -990,7 +1013,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) #ifdef TEST_REJECT_WITH_PRIVATE_DATA if (event.event_number != DAT_CONNECTION_EVENT_PEER_REJECTED) { fprintf(stderr, "%d expected conn reject event : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), DT_EventToStr(event.event_number)); return (DAT_ABORT); } /* get the reject private data and validate */ @@ -1013,7 +1036,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) { fprintf(stderr, "%d Error unexpected conn event : 0x%x %s\n", getpid(), event.event_number, - DT_EventToSTr(event.event_number)); + DT_EventToStr(event.event_number)); return (DAT_ABORT); } @@ -1064,7 +1087,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error send_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else LOGPRINTF("%d send_msg completed\n", getpid()); @@ -1072,42 +1095,17 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) /* * Wait for remote RMR information for RDMA */ - if (polling) { - printf("%d Polling for remote to send RMR data\n", getpid()); - while (dat_evd_dequeue(h_dto_rcv_evd, &event) == - DAT_QUEUE_EMPTY) ; - } else { - printf("%d Waiting for remote to send RMR data\n", getpid()); - if (use_cno) { - DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; - ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); - LOGPRINTF("%d cno wait return evd_handle=%p\n", - getpid(), evd); - if (evd != h_dto_rcv_evd) { - /* CNO timeout, already on EVD */ - if (evd != NULL) - return (ret); - } - } - /* use wait to dequeue */ - ret = - dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore); - if (ret != DAT_SUCCESS) { - fprintf(stderr, - "%d Error waiting on h_dto_rcv_evd: %s\n", - getpid(), DT_RetToString(ret)); - return (ret); - } else { - LOGPRINTF("%d dat_evd_wait h_dto_rcv_evd completed\n", - getpid()); - } - } - + if (collect_event(h_dto_rcv_evd, + &event, + DTO_TIMEOUT, + &poll_count) != DAT_SUCCESS) + return (DAT_ABORT); + printf("%d remote RMR data arrived!\n", getpid()); if (event.event_number != DAT_DTO_COMPLETION_EVENT) { fprintf(stderr, "%d Error unexpected DTO event : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), DT_EventToStr(event.event_number)); return (DAT_ABORT); } if ((event.event_data.dto_completion_event_data.transfered_length != @@ -1162,7 +1160,7 @@ void disconnect_ep(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_ep_disconnect: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else { LOGPRINTF("%d dat_ep_disconnect completed\n", getpid()); @@ -1177,7 +1175,7 @@ void disconnect_ep(void) &nmore); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_wait: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else { LOGPRINTF("%d dat_evd_wait for h_conn_evd completed\n", getpid()); @@ -1189,7 +1187,7 @@ void disconnect_ep(void) ret = dat_psp_free(h_psp); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_psp_free: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else { LOGPRINTF("%d dat_psp_free completed\n", getpid()); } @@ -1203,7 +1201,7 @@ void disconnect_ep(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error deregistering send msg mr: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else { LOGPRINTF("%d Unregistered send message Buffer\n", getpid()); @@ -1219,7 +1217,7 @@ void disconnect_ep(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error deregistering recv msg mr: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); } else { LOGPRINTF("%d Unregistered recv message Buffer\n", getpid()); @@ -1232,7 +1230,6 @@ void disconnect_ep(void) DAT_RETURN do_rdma_write_with_msg(void) { DAT_EVENT event; - DAT_COUNT nmore; DAT_LMR_TRIPLET l_iov[MSG_IOV_COUNT]; DAT_RMR_TRIPLET r_iov; DAT_DTO_COOKIE cookie; @@ -1277,7 +1274,7 @@ DAT_RETURN do_rdma_write_with_msg(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: ERROR: dat_ep_post_rdma_write() %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (DAT_ABORT); } LOGPRINTF("%d rdma_write # %d completed\n", getpid(), i + 1); @@ -1296,41 +1293,19 @@ DAT_RETURN do_rdma_write_with_msg(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error send_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d send_msg completed\n", getpid()); } - /* - * Collect first event, write completion or the inbound recv - */ - if (polling) { - while (dat_evd_dequeue(h_dto_rcv_evd, &event) == - DAT_QUEUE_EMPTY) - rdma_wr_poll_count++; - } else { - LOGPRINTF("%d waiting for message receive event\n", getpid()); - if (use_cno) { - DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; - ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); - LOGPRINTF("%d cno wait return evd_handle=%p\n", - getpid(), evd); - if (evd != h_dto_rcv_evd) { - /* CNO timeout, already on EVD */ - if (evd != NULL) - return (ret); - } - } - /* use wait to dequeue */ - ret = - dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore); - if (ret != DAT_SUCCESS) { - fprintf(stderr, "%d: ERROR: DTO dat_evd_wait() %s\n", - getpid(), DT_RetToString(ret)); - return (ret); - } - } + /* inbound recv event, send completion's suppressed */ + if (collect_event(h_dto_rcv_evd, + &event, + DTO_TIMEOUT, + &rdma_wr_poll_count) != DAT_SUCCESS) + return (DAT_ABORT); + stop = get_time(); time.rdma_wr = ((stop - start) * 1.0e6); @@ -1338,7 +1313,7 @@ DAT_RETURN do_rdma_write_with_msg(void) printf("%d inbound rdma_write; send message arrived!\n", getpid()); if (event.event_number != DAT_DTO_COMPLETION_EVENT) { fprintf(stderr, "%d Error unexpected DTO event : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), DT_EventToStr(event.event_number)); return (DAT_ABORT); } @@ -1386,7 +1361,6 @@ DAT_RETURN do_rdma_write_with_msg(void) DAT_RETURN do_rdma_read_with_msg(void) { DAT_EVENT event; - DAT_COUNT nmore; DAT_LMR_TRIPLET l_iov; DAT_RMR_TRIPLET r_iov; DAT_DTO_COOKIE cookie; @@ -1425,44 +1399,21 @@ DAT_RETURN do_rdma_read_with_msg(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: ERROR: dat_ep_post_rdma_read() %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (DAT_ABORT); } - if (polling) { - while (dat_evd_dequeue(h_dto_req_evd, &event) == - DAT_QUEUE_EMPTY) - rdma_rd_poll_count[i]++; - } else { - LOGPRINTF("%d waiting for rdma_read completion event\n", - getpid()); - if (use_cno) { - DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; - ret = - dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); - LOGPRINTF("%d cno wait return evd_handle=%p\n", - getpid(), evd); - if (evd != h_dto_req_evd) { - /* CNO timeout, already on EVD */ - if (evd != NULL) - return (ret); - } - } - /* use wait to dequeue */ - ret = - dat_evd_wait(h_dto_req_evd, DTO_TIMEOUT, 1, &event, - &nmore); - if (ret != DAT_SUCCESS) { - fprintf(stderr, - "%d: ERROR: DTO dat_evd_wait() %s\n", - getpid(), DT_RetToString(ret)); - return ret; - } - } + /* RDMA read completion event */ + if (collect_event(h_dto_req_evd, + &event, + DTO_TIMEOUT, + &rdma_rd_poll_count[i]) != DAT_SUCCESS) + return (DAT_ABORT); + /* validate event number, len, cookie, and status */ if (event.event_number != DAT_DTO_COMPLETION_EVENT) { fprintf(stderr, "%d: ERROR: DTO event number %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), DT_EventToStr(event.event_number)); return (DAT_ABORT); } if ((event.event_data.dto_completion_event_data. @@ -1481,7 +1432,7 @@ DAT_RETURN do_rdma_read_with_msg(void) if (event.event_data.dto_completion_event_data.status != DAT_SUCCESS) { fprintf(stderr, "%d: ERROR: DTO event status %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (DAT_ABORT); } stop = get_time(); @@ -1513,48 +1464,25 @@ DAT_RETURN do_rdma_read_with_msg(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error send_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d send_msg completed\n", getpid()); } - /* - * Collect first event, write completion or the inbound recv with immed - */ printf("%d Waiting for inbound message....\n", getpid()); - if (polling) { - while (dat_evd_dequeue(h_dto_rcv_evd, &event) == - DAT_QUEUE_EMPTY) ; - } else { - LOGPRINTF("%d waiting for message receive event\n", getpid()); - if (use_cno) { - DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; - - ret = dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); - LOGPRINTF("%d cno wait return evd_handle=%p\n", - getpid(), evd); - if (evd != h_dto_rcv_evd) { - /* CNO timeout, already on EVD */ - if (evd != NULL) - return (ret); - } - } - /* use wait to dequeue */ - ret = - dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, &nmore); - if (ret != DAT_SUCCESS) { - fprintf(stderr, "%d: ERROR: DTO dat_evd_wait() %s\n", - getpid(), DT_RetToString(ret)); - return (ret); - } - } + + if (collect_event(h_dto_rcv_evd, + &event, + DTO_TIMEOUT, + &poll_count) != DAT_SUCCESS) + return (DAT_ABORT); /* validate event number and status */ printf("%d inbound rdma_read; send message arrived!\n", getpid()); if (event.event_number != DAT_DTO_COMPLETION_EVENT) { fprintf(stderr, "%d Error unexpected DTO event : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), DT_EventToStr(event.event_number)); return (DAT_ABORT); } @@ -1603,7 +1531,6 @@ DAT_RETURN do_rdma_read_with_msg(void) DAT_RETURN do_ping_pong_msg() { DAT_EVENT event; - DAT_COUNT nmore; DAT_DTO_COOKIE cookie; DAT_LMR_TRIPLET l_iov; DAT_RETURN ret; @@ -1635,7 +1562,7 @@ DAT_RETURN do_ping_pong_msg() if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error posting recv msg buffer: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Posted Receive Message Buffer %p\n", @@ -1673,47 +1600,21 @@ DAT_RETURN do_ping_pong_msg() if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error send_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d send_msg completed\n", getpid()); } } - /* Wait for recv message */ - if (polling) { - poll_count = 0; - LOGPRINTF("%d Polling for message receive event\n", - getpid()); - while (dat_evd_dequeue(h_dto_rcv_evd, &event) == - DAT_QUEUE_EMPTY) - poll_count++; - } else { - LOGPRINTF("%d waiting for message receive event\n", - getpid()); - if (use_cno) { - DAT_EVD_HANDLE evd = DAT_HANDLE_NULL; - ret = - dat_cno_wait(h_dto_cno, DTO_TIMEOUT, &evd); - LOGPRINTF("%d cno wait return evd_handle=%p\n", - getpid(), evd); - if (evd != h_dto_rcv_evd) { - /* CNO timeout, already on EVD */ - if (evd != NULL) - return (ret); - } - } - /* use wait to dequeue */ - ret = - dat_evd_wait(h_dto_rcv_evd, DTO_TIMEOUT, 1, &event, - &nmore); - if (ret != DAT_SUCCESS) { - fprintf(stderr, - "%d: ERROR: DTO dat_evd_wait() %s\n", - getpid(), DT_RetToString(ret)); - return (ret); - } - } + /* recv message, send completions suppressed */ + if (collect_event(h_dto_rcv_evd, + &event, + DTO_TIMEOUT, + &poll_count) != DAT_SUCCESS) + return (DAT_ABORT); + + /* start timer after first message arrives on server */ if (i == 0) { start = get_time(); @@ -1722,7 +1623,7 @@ DAT_RETURN do_ping_pong_msg() LOGPRINTF("%d inbound message; message arrived!\n", getpid()); if (event.event_number != DAT_DTO_COMPLETION_EVENT) { fprintf(stderr, "%d Error unexpected DTO event : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + getpid(), DT_EventToStr(event.event_number)); return (DAT_ABORT); } if ((event.event_data.dto_completion_event_data. @@ -1762,7 +1663,7 @@ DAT_RETURN do_ping_pong_msg() if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error send_msg: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d send_msg completed\n", getpid()); @@ -1805,7 +1706,7 @@ DAT_RETURN register_rdma_memory(void) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error registering Receive RDMA buffer: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Registered Receive RDMA Buffer %p\n", @@ -1827,7 +1728,7 @@ DAT_RETURN register_rdma_memory(void) ®istered_size_send, ®istered_addr_send); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error registering send RDMA buffer: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Registered Send RDMA Buffer %p\n", @@ -1854,7 +1755,7 @@ DAT_RETURN unregister_rdma_memory(void) time.total += time.unreg; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error deregistering recv mr: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Unregistered Recv Buffer\n", getpid()); @@ -1868,7 +1769,7 @@ DAT_RETURN unregister_rdma_memory(void) ret = dat_lmr_free(h_lmr_send); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error deregistering send mr: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Unregistered send Buffer\n", getpid()); @@ -1904,7 +1805,7 @@ DAT_RETURN create_events(void) time.total += time.cnoc; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_cno_create: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d cr_evd created, %p\n", getpid(), @@ -1922,7 +1823,7 @@ DAT_RETURN create_events(void) time.total += time.evdc; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_create: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d cr_evd created %p\n", getpid(), h_cr_evd); @@ -1935,7 +1836,7 @@ DAT_RETURN create_events(void) DAT_EVD_CONNECTION_FLAG, &h_conn_evd); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_create: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d con_evd created %p\n", getpid(), h_conn_evd); @@ -1947,7 +1848,7 @@ DAT_RETURN create_events(void) h_dto_cno, DAT_EVD_DTO_FLAG, &h_dto_req_evd); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_create REQ: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d dto_req_evd created %p\n", getpid(), @@ -1960,7 +1861,7 @@ DAT_RETURN create_events(void) h_dto_cno, DAT_EVD_DTO_FLAG, &h_dto_rcv_evd); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_create RCV: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d dto_rcv_evd created %p\n", getpid(), @@ -1971,7 +1872,7 @@ DAT_RETURN create_events(void) ret = dat_evd_query(h_dto_req_evd, DAT_EVD_FIELD_EVD_QLEN, ¶m); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_query request evd: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else if (param.evd_qlen < (MSG_BUF_COUNT + MAX_RDMA_RD + burst) * 2) { fprintf(stderr, "%d Error dat_evd qsize too small: %d < %d\n", @@ -2001,7 +1902,7 @@ DAT_RETURN destroy_events(void) ret = dat_evd_free(h_cr_evd); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing cr EVD: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Freed cr EVD\n", getpid()); @@ -2015,7 +1916,7 @@ DAT_RETURN destroy_events(void) ret = dat_evd_free(h_conn_evd); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing conn EVD: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Freed conn EVD\n", getpid()); @@ -2033,7 +1934,7 @@ DAT_RETURN destroy_events(void) time.total += time.evdf; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing dto EVD: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Freed dto EVD\n", getpid()); @@ -2047,7 +1948,7 @@ DAT_RETURN destroy_events(void) ret = dat_evd_free(h_dto_req_evd); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing dto EVD: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Freed dto EVD\n", getpid()); @@ -2065,7 +1966,7 @@ DAT_RETURN destroy_events(void) time.total += time.cnof; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing dto CNO: %s\n", - getpid(), DT_RetToString(ret)); + getpid(), DT_RetToStr(ret)); return (ret); } else { LOGPRINTF("%d Freed dto CNO\n", getpid()); @@ -2080,7 +1981,7 @@ DAT_RETURN destroy_events(void) * but don't assume the values are zero-based or contiguous. */ char errmsg[512] = { 0 }; -const char *DT_RetToString(DAT_RETURN ret_value) +const char *DT_RetToStr(DAT_RETURN ret_value) { const char *major_msg, *minor_msg; @@ -2096,7 +1997,7 @@ const char *DT_RetToString(DAT_RETURN ret_value) /* * Map DAT_EVENT_CODE values to readable strings */ -const char *DT_EventToSTr(DAT_EVENT_NUMBER event_code) +const char *DT_EventToStr(DAT_EVENT_NUMBER event_code) { unsigned int i; static struct { From arlin.r.davis at intel.com Tue Aug 4 22:40:18 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 4 Aug 2009 22:40:18 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: scm: transistion QP to error state when disconnecting instead of reset/init. Message-ID: SCM: Fix disconnect. QP's need to move to ERROR state in order to flush work requests and notify consumer. Moving to RESET removed all requests but did not notify consumer. diff --git a/dapl/openib_scm/cm.c b/dapl/openib_scm/cm.c index 164cc4e..416ee71 100644 --- a/dapl/openib_scm/cm.c +++ b/dapl/openib_scm/cm.c @@ -773,7 +773,7 @@ ud_bail: bail: /* close socket, and post error event */ - dapls_ib_reinit_ep(ep_ptr); /* reset QP state */ + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0); closesocket(cm_ptr->socket); cm_ptr->socket = DAPL_INVALID_SOCKET; dapl_evd_connection_callback(NULL, event, cm_ptr->p_data, ep_ptr); @@ -1107,7 +1107,7 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr, return DAT_SUCCESS; bail: dapls_ib_cm_free(cm_ptr, cm_ptr->ep); - dapls_ib_reinit_ep(ep_ptr); /* reset QP state */ + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0); return DAT_INTERNAL_ERROR; } @@ -1169,7 +1169,7 @@ void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) return; bail: - dapls_ib_reinit_ep(cm_ptr->ep); /* reset QP state */ + dapls_modify_qp_state(cm_ptr->ep->qp_handle, IBV_QPS_ERR, 0); dapls_ib_cm_free(cm_ptr, cm_ptr->ep); dapls_cr_callback(cm_ptr, IB_CME_DESTINATION_REJECT, NULL, cm_ptr->sp); } @@ -1236,9 +1236,9 @@ dapls_ib_disconnect(IN DAPL_EP * ep_ptr, IN DAT_CLOSE_FLAGS close_flags) dapl_dbg_log(DAPL_DBG_TYPE_EP, "dapls_ib_disconnect(ep_handle %p ....)\n", ep_ptr); - /* reinit to modify QP state */ - dapls_ib_reinit_ep(ep_ptr); - + /* Transition to error state to flush queue */ + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0); + if (ep_ptr->cm_handle == NULL || ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED) return DAT_SUCCESS; From yosefe at voltaire.com Tue Aug 4 23:06:18 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Wed, 05 Aug 2009 09:06:18 +0300 Subject: [ofa-general] [PATCH] ipoib: refresh path when remote lid changes In-Reply-To: <20090804045647.GK24282@obsidianresearch.com> References: <4A705B3A.7060404@Voltaire.COM> <4A731818.3060500@voltaire.com> <4A733D24.3040201@voltaire.com> <4A742E94.2070002@gmail.com> <4A771852.1010606@voltaire.com> <20090804045647.GK24282@obsidianresearch.com> Message-ID: <4A79215A.3030807@voltaire.com> On 04/08/09 07:56, Jason Gunthorpe wrote: > On Mon, Aug 03, 2009 at 08:03:14PM +0300, Yossi Etigin wrote: > >> The ARP stuff works this way: Remote LID changes. In some point, either the remote >> node will send an ARP reply (gratuitous), or (more likely) the local network stack >> will start sending solicited ARPs, unicast, using the invalid path. They will fail, >> so the stack will send broadcast ARP. > > Erm.. Maybe a little tighter integration with the ARP/ND layer is in > order. If it knows unicast isn't working thats a pretty damn good clue > to discard the PR. > > Jason I agree with that. If the network stack told ipoib when the neighbour became unreachable, life would have been a lot easier. Unfortunately, the closest thing now is neigh_cleanup, and this is only called when neighbour entry removed from the table (which may be quite some time after it becomes unreachable). --Yossi From sashak at voltaire.com Wed Aug 5 00:17:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 10:17:04 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Directly call calloc/free rather than create/delete_cdg In-Reply-To: <20090804213905.GA23497@comcast.net> References: <20090804213905.GA23497@comcast.net> Message-ID: <20090805071704.GK7993@me> On 17:39 Tue 04 Aug , Hal Rosenstock wrote: > > Reduce call stack by one call level > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Aug 5 00:17:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 10:17:33 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_lin_fwd_rcv.c: Commentary change In-Reply-To: <20090804214413.GA24878@comcast.net> References: <20090804214413.GA24878@comcast.net> Message-ID: <20090805071733.GL7993@me> On 17:44 Tue 04 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Aug 5 00:18:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 10:18:57 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsendtrap.c: Add support for link_speed_enabled_change trap In-Reply-To: <20090804125009.GB12236@comcast.net> References: <20090804125009.GB12236@comcast.net> Message-ID: <20090805071857.GM7993@me> On 08:50 Tue 04 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Wed Aug 5 00:25:00 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Aug 2009 10:25:00 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> Message-ID: <4A7933CC.6080503@dev.mellanox.co.il> Hal Rosenstock wrote: > > > Routing calculation phase of the ucast manager took ~1200 usec, > > the rest was sending the blocks and waiting for no more pending > > transactions. > > > > No noticeable difference between various max_smps_per_node values > > was observed. > > What is the reason? > > > I think the reason was max_wire_smps may have kicked in but Yevgeny is > best to elaborate on this. > Correct, this was because of max_wire_smps -- Yevgeny From sashak at voltaire.com Wed Aug 5 00:33:30 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 10:33:30 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <4A7933CC.6080503@dev.mellanox.co.il> References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <4A7933CC.6080503@dev.mellanox.co.il> Message-ID: <20090805073330.GN7993@me> On 10:25 Wed 05 Aug , Yevgeny Kliteynik wrote: > > > > > > No noticeable difference between various max_smps_per_node values > > > was observed. > > > > What is the reason? > > > > > > I think the reason was max_wire_smps may have kicked in but Yevgeny is > > best to elaborate on this. > > > > Correct, this was because of max_wire_smps What was 'max_wire_smps' value used in the tests? Sasha From kliteyn at dev.mellanox.co.il Wed Aug 5 00:37:00 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Aug 2009 10:37:00 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090805073330.GN7993@me> References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <4A7933CC.6080503@dev.mellanox.co.il> <20090805073330.GN7993@me> Message-ID: <4A79369C.4090108@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 10:25 Wed 05 Aug , Yevgeny Kliteynik wrote: >>> > >>> > No noticeable difference between various max_smps_per_node values >>> > was observed. >>> >>> What is the reason? >>> >>> >>> I think the reason was max_wire_smps may have kicked in but Yevgeny is >>> best to elaborate on this. >>> >> Correct, this was because of max_wire_smps > > What was 'max_wire_smps' value used in the tests? The numbers that I posted refer to default max_wire_smps, which is 4. I didn't try to bump it up, though I guess that it might improve LFT config time. -- Yevgeny > Sasha > From eli at dev.mellanox.co.il Wed Aug 5 01:27:51 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:27:51 +0300 Subject: [ofa-general] [PATCHv4 0/10] RDMAoE support Message-ID: <20090805082751.GA5599@mtls03> RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using Ethernet frames, enabling the deployment of IB semantics on lossless Ethernet fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned Ethertype, a GRH, unmodified IB transport headers and payload. IB subnet management and SA services are not required for RDMAoE operation; Ethernet management practices are used instead. RDMAoE encodes IP addresses into its GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs, standard IP to MAC mappings apply. To support RDMAoE, a new transport protocol was added to the IB core. An RDMA device can have ports with different transports, which are identified by a port transport attribute. The RDMA Verbs API is syntactically unmodified. When referring to RDMAoE ports, Address handles are required to contain GIDs while LID fields are ignored. The Ethernet L2 information is subsequently obtained by the vendor-specific driver (both in kernel- and user-space) while modifying QPs to RTR and creating address handles. As there is no SA in RDMAoE, the CMA code is modified to fill the necessary path record attributes locally before sending CM packets. Similarly, the CMA provides to the user the required address handle attributes when processing SIDR requests and joining multicast groups. In this patch set, an RDMAoE port is currently assigned a single GID, encoding the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE code temporarily uses IPv6 link-local addresses as GIDs instead of the IP address provided by the user, thereby supporting any IP address. In addition, multicast packets currently use the broadcast MAC. To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib drivers must be loaded, and the netdevice for the corresponding RDMAoE port must be running. Individual ports of a multi port HCA can be independently configured as Ethernet (with support for RDMAoE) or IB, as is already the case. We have successfully tested MPI, SDP, RDS, and native Verbs applications over RDMAoE. Following is a series of 10 patches based on version 2.6.30 of the Linux kernel. This new series reflects changes based on feedback from the community on the previous set of patches, and is tagged v4. Changes from v3: 1. RDMA transport is determined on a per-port basis instead of the link-type notion. 2. SA services are not provided for RDMAoE clients. CMA code is modified to support RDMAoE transport types. 3. For brevity, GID to MAC resolution is currently restricted to link local addresses. Signed-off-by: Eli Cohen --- b/drivers/infiniband/core/agent.c | 12 - b/drivers/infiniband/core/cm.c | 26 +- b/drivers/infiniband/core/cma.c | 54 ++-- b/drivers/infiniband/core/mad.c | 42 ++- b/drivers/infiniband/core/multicast.c | 5 b/drivers/infiniband/core/sa_query.c | 40 ++- b/drivers/infiniband/core/ucm.c | 8 b/drivers/infiniband/core/ucma.c | 2 b/drivers/infiniband/core/ud_header.c | 111 ++++++++++ b/drivers/infiniband/core/user_mad.c | 7 b/drivers/infiniband/core/uverbs.h | 1 b/drivers/infiniband/core/uverbs_cmd.c | 32 ++ b/drivers/infiniband/core/uverbs_main.c | 1 b/drivers/infiniband/core/verbs.c | 12 - b/drivers/infiniband/hw/mlx4/ah.c | 187 +++++++++++++--- b/drivers/infiniband/hw/mlx4/main.c | 309 +++++++++++++++++++++++++--- b/drivers/infiniband/hw/mlx4/mlx4_ib.h | 19 + b/drivers/infiniband/hw/mlx4/qp.c | 172 ++++++++++----- b/drivers/infiniband/ulp/ipoib/ipoib_main.c | 12 - b/drivers/net/mlx4/en_main.c | 15 + b/drivers/net/mlx4/en_port.c | 4 b/drivers/net/mlx4/en_port.h | 3 b/drivers/net/mlx4/fw.c | 3 b/drivers/net/mlx4/intf.c | 20 + b/drivers/net/mlx4/main.c | 6 b/drivers/net/mlx4/mlx4.h | 1 b/include/linux/mlx4/cmd.h | 1 b/include/linux/mlx4/device.h | 31 ++ b/include/linux/mlx4/driver.h | 16 + b/include/linux/mlx4/qp.h | 8 b/include/rdma/ib_addr.h | 87 +++++++ b/include/rdma/ib_pack.h | 26 ++ b/include/rdma/ib_user_verbs.h | 21 + b/include/rdma/ib_verbs.h | 9 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 b/net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 drivers/infiniband/core/cm.c | 2 drivers/infiniband/core/cma.c | 150 +++++++++++++ drivers/infiniband/core/mad.c | 55 +++- drivers/infiniband/core/ucm.c | 12 - drivers/infiniband/core/ucma.c | 25 +- drivers/infiniband/core/user_mad.c | 27 +- drivers/infiniband/core/verbs.c | 10 include/rdma/ib_verbs.h | 15 + 44 files changed, 1333 insertions(+), 271 deletions(-) From eli at mellanox.co.il Wed Aug 5 01:28:08 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:28:08 +0300 Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type Message-ID: <20090805082808.GB5599@mtls03> As a preparation to devices that, in general, support different transport protocol for each port, specifically RDMAoE, this patch defines transport type for each of a device's ports. As a result rdma_node_get_transport() has been unexported and is used internally by the implementation of the new API, rdma_port_get_transport() which gives the transport protocol of the queried port. All references to rdma_node_get_transport() are changed to to use rdma_port_get_transport(). Also, ib_port_attr is extended to contain enum rdma_transport_type. Signed-off-by: Eli Cohen --- drivers/infiniband/core/cm.c | 26 ++++++++----- drivers/infiniband/core/cma.c | 54 +++++++++++++++-------------- drivers/infiniband/core/mad.c | 42 +++++++++++++--------- drivers/infiniband/core/multicast.c | 5 +-- drivers/infiniband/core/sa_query.c | 40 ++++++++++++--------- drivers/infiniband/core/ucm.c | 8 +++- drivers/infiniband/core/ucma.c | 2 +- drivers/infiniband/core/user_mad.c | 7 ++-- drivers/infiniband/core/verbs.c | 12 +++++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 12 +++--- include/rdma/ib_verbs.h | 9 +++-- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 +- net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +- 13 files changed, 128 insertions(+), 94 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 5130fc5..f930f1d 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3678,9 +3678,7 @@ static void cm_add_one(struct ib_device *ib_device) unsigned long flags; int ret; u8 i; - - if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB) - return; + enum rdma_transport_type tt; cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) * ib_device->phys_port_cnt, GFP_KERNEL); @@ -3700,6 +3698,10 @@ static void cm_add_one(struct ib_device *ib_device) set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= ib_device->phys_port_cnt; i++) { + tt = rdma_port_get_transport(ib_device, i); + if (tt != RDMA_TRANSPORT_IB) + continue; + port = kzalloc(sizeof *port, GFP_KERNEL); if (!port) goto error1; @@ -3742,9 +3744,11 @@ error1: port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; while (--i) { port = cm_dev->port[i-1]; - ib_modify_port(ib_device, port->port_num, 0, &port_modify); - ib_unregister_mad_agent(port->mad_agent); - cm_remove_port_fs(port); + if (port) { + ib_modify_port(ib_device, port->port_num, 0, &port_modify); + ib_unregister_mad_agent(port->mad_agent); + cm_remove_port_fs(port); + } } device_unregister(cm_dev->device); kfree(cm_dev); @@ -3770,10 +3774,12 @@ static void cm_remove_one(struct ib_device *ib_device) for (i = 1; i <= ib_device->phys_port_cnt; i++) { port = cm_dev->port[i-1]; - ib_modify_port(ib_device, port->port_num, 0, &port_modify); - ib_unregister_mad_agent(port->mad_agent); - flush_workqueue(cm.wq); - cm_remove_port_fs(port); + if (port) { + ib_modify_port(ib_device, port->port_num, 0, &port_modify); + ib_unregister_mad_agent(port->mad_agent); + flush_workqueue(cm.wq); + cm_remove_port_fs(port); + } } device_unregister(cm_dev->device); kfree(cm_dev); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index aa62101..866ff7f 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -337,24 +337,26 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv) struct cma_device *cma_dev; union ib_gid gid; int ret = -ENODEV; - - switch (rdma_node_get_transport(dev_addr->dev_type)) { - case RDMA_TRANSPORT_IB: - ib_addr_get_sgid(dev_addr, &gid); - break; - case RDMA_TRANSPORT_IWARP: - iw_addr_get_sgid(dev_addr, &gid); - break; - default: - return -ENODEV; - } + int port; list_for_each_entry(cma_dev, &dev_list, list) { - ret = ib_find_cached_gid(cma_dev->device, &gid, - &id_priv->id.port_num, NULL); - if (!ret) { - cma_attach_to_dev(id_priv, cma_dev); - break; + for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port) { + switch (rdma_port_get_transport(cma_dev->device, port)) { + case RDMA_TRANSPORT_IB: + ib_addr_get_sgid(dev_addr, &gid); + break; + case RDMA_TRANSPORT_IWARP: + iw_addr_get_sgid(dev_addr, &gid); + break; + default: + return -ENODEV; + } + ret = ib_find_cached_gid(cma_dev->device, &gid, + &id_priv->id.port_num, NULL); + if (!ret) { + cma_attach_to_dev(id_priv, cma_dev); + return ret; + } } } return ret; @@ -605,7 +607,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, int ret = 0; id_priv = container_of(id, struct rdma_id_private, id); - switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps)) ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask); @@ -755,7 +757,7 @@ static inline int cma_user_data_offset(enum rdma_port_space ps) static void cma_cancel_route(struct rdma_id_private *id_priv) { - switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: if (id_priv->query) ib_sa_cancel_query(id_priv->query_id, id_priv->query); @@ -851,7 +853,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) mutex_lock(&lock); if (id_priv->cma_dev) { mutex_unlock(&lock); - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); @@ -1508,7 +1510,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog) id_priv->backlog = backlog; if (id->device) { - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); if (ret) @@ -1735,7 +1737,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) return -EINVAL; atomic_inc(&id_priv->refcount); - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; @@ -2415,7 +2417,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) id_priv->srq = conn_param->srq; } - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: if (cma_is_ud_ps(id->ps)) ret = cma_resolve_ib_udp(id_priv, conn_param); @@ -2528,7 +2530,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) id_priv->srq = conn_param->srq; } - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS, @@ -2589,7 +2591,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data, if (!cma_has_cm_dev(id_priv)) return -EINVAL; - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, @@ -2620,7 +2622,7 @@ int rdma_disconnect(struct rdma_cm_id *id) if (!cma_has_cm_dev(id_priv)) return -EINVAL; - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_modify_qp_err(id_priv); if (ret) @@ -2776,7 +2778,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, spin_unlock(&id_priv->lock); kref_get(&mc->mcref); - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_join_ib_multicast(id_priv, mc); break; diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index de922a0..7b737c4 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2905,9 +2905,7 @@ static int ib_mad_port_close(struct ib_device *device, int port_num) static void ib_mad_init_device(struct ib_device *device) { int start, end, i; - - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; + enum rdma_transport_type tt; if (device->node_type == RDMA_NODE_IB_SWITCH) { start = 0; @@ -2918,6 +2916,10 @@ static void ib_mad_init_device(struct ib_device *device) } for (i = start; i <= end; i++) { + tt = rdma_port_get_transport(device, i); + if (tt != RDMA_TRANSPORT_IB) + continue; + if (ib_mad_port_open(device, i)) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", device->name, i); @@ -2941,13 +2943,15 @@ error: i--; while (i >= start) { - if (ib_agent_port_close(device, i)) - printk(KERN_ERR PFX "Couldn't close %s port %d " - "for agents\n", - device->name, i); - if (ib_mad_port_close(device, i)) - printk(KERN_ERR PFX "Couldn't close %s port %d\n", - device->name, i); + if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) { + if (ib_agent_port_close(device, i)) + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, i); + if (ib_mad_port_close(device, i)) + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, i); + } i--; } } @@ -2955,6 +2959,7 @@ error: static void ib_mad_remove_device(struct ib_device *device) { int i, num_ports, cur_port; + enum rdma_transport_type tt; if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; @@ -2964,13 +2969,16 @@ static void ib_mad_remove_device(struct ib_device *device) cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - if (ib_agent_port_close(device, cur_port)) - printk(KERN_ERR PFX "Couldn't close %s port %d " - "for agents\n", - device->name, cur_port); - if (ib_mad_port_close(device, cur_port)) - printk(KERN_ERR PFX "Couldn't close %s port %d\n", - device->name, cur_port); + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB) { + if (ib_agent_port_close(device, cur_port)) + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + if (ib_mad_port_close(device, cur_port)) + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + } } } diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 107f170..3a4c6f8 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -788,10 +788,7 @@ static void mcast_add_one(struct ib_device *device) struct mcast_port *port; int i; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; - - dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + dev = kzalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, GFP_KERNEL); if (!dev) return; diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 1865049..834ea14 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -416,14 +416,16 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event struct ib_sa_port *port = &sa_dev->port[event->element.port_num - sa_dev->start_port]; - spin_lock_irqsave(&port->ah_lock, flags); - if (port->sm_ah) - kref_put(&port->sm_ah->ref, free_sm_ah); - port->sm_ah = NULL; - spin_unlock_irqrestore(&port->ah_lock, flags); - - schedule_work(&sa_dev->port[event->element.port_num - - sa_dev->start_port].update_task); + if (rdma_port_get_transport(handler->device, port->port_num) == RDMA_TRANSPORT_IB) { + spin_lock_irqsave(&port->ah_lock, flags); + if (port->sm_ah) + kref_put(&port->sm_ah->ref, free_sm_ah); + port->sm_ah = NULL; + spin_unlock_irqrestore(&port->ah_lock, flags); + + schedule_work(&sa_dev->port[event->element.port_num - + sa_dev->start_port].update_task); + } } } @@ -991,9 +993,6 @@ static void ib_sa_add_one(struct ib_device *device) struct ib_sa_device *sa_dev; int s, e, i; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; - if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { @@ -1001,7 +1000,7 @@ static void ib_sa_add_one(struct ib_device *device) e = device->phys_port_cnt; } - sa_dev = kmalloc(sizeof *sa_dev + + sa_dev = kzalloc(sizeof *sa_dev + (e - s + 1) * sizeof (struct ib_sa_port), GFP_KERNEL); if (!sa_dev) @@ -1011,6 +1010,9 @@ static void ib_sa_add_one(struct ib_device *device) sa_dev->end_port = e; for (i = 0; i <= e - s; ++i) { + if (rdma_port_get_transport(device, i + 1) != RDMA_TRANSPORT_IB) + continue; + sa_dev->port[i].sm_ah = NULL; sa_dev->port[i].port_num = i + s; spin_lock_init(&sa_dev->port[i].ah_lock); @@ -1039,13 +1041,15 @@ static void ib_sa_add_one(struct ib_device *device) goto err; for (i = 0; i <= e - s; ++i) - update_sm_ah(&sa_dev->port[i].update_task); + if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) + update_sm_ah(&sa_dev->port[i].update_task); return; err: while (--i >= 0) - ib_unregister_mad_agent(sa_dev->port[i].agent); + if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB) + ib_unregister_mad_agent(sa_dev->port[i].agent); kfree(sa_dev); @@ -1065,9 +1069,11 @@ static void ib_sa_remove_one(struct ib_device *device) flush_scheduled_work(); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { - ib_unregister_mad_agent(sa_dev->port[i].agent); - if (sa_dev->port[i].sm_ah) - kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + if (sa_dev->port[i].sm_ah) + kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + } } kfree(sa_dev); diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 51bd966..4f5096d 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -1239,11 +1239,15 @@ static DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); static void ib_ucm_add_one(struct ib_device *device) { struct ib_ucm_device *ucm_dev; + int i; - if (!device->alloc_ucontext || - rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + if (!device->alloc_ucontext || device->node_type == RDMA_NODE_IB_SWITCH) return; + for (i = 1; i <= device->phys_port_cnt; ++i) + if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) + return; + ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); if (!ucm_dev) return; diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 4346a24..24d9510 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -614,7 +614,7 @@ static ssize_t ucma_query_route(struct ucma_file *file, resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid; resp.port_num = ctx->cm_id->port_num; - switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) { + switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num)) { case RDMA_TRANSPORT_IB: ucma_copy_ib_route(&resp, &ctx->cm_id->route); break; diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 8c46f22..3e58fc0 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1113,9 +1113,6 @@ static void ib_umad_add_one(struct ib_device *device) struct ib_umad_device *umad_dev; int s, e, i; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; - if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; else { @@ -1123,6 +1120,10 @@ static void ib_umad_add_one(struct ib_device *device) e = device->phys_port_cnt; } + for (i = s; i <= e; ++i) + if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) + return; + umad_dev = kzalloc(sizeof *umad_dev + (e - s + 1) * sizeof (struct ib_umad_port), GFP_KERNEL); diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index a7da9be..3b2f00b 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -77,7 +77,7 @@ enum ib_rate mult_to_ib_rate(int mult) } EXPORT_SYMBOL(mult_to_ib_rate); -enum rdma_transport_type +static enum rdma_transport_type rdma_node_get_transport(enum rdma_node_type node_type) { switch (node_type) { @@ -92,7 +92,15 @@ rdma_node_get_transport(enum rdma_node_type node_type) return 0; } } -EXPORT_SYMBOL(rdma_node_get_transport); + +enum rdma_transport_type rdma_port_get_transport(struct ib_device *device, + u8 port_num) +{ + return device->get_port_transport ? + device->get_port_transport(device, port_num) : + rdma_node_get_transport(device->node_type); +} +EXPORT_SYMBOL(rdma_port_get_transport); /* Protection domains */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index ab2c192..39df0f7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1337,9 +1337,6 @@ static void ipoib_add_one(struct ib_device *device) struct ipoib_dev_priv *priv; int s, e, p; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; - dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; @@ -1355,6 +1352,9 @@ static void ipoib_add_one(struct ib_device *device) } for (p = s; p <= e; ++p) { + if (rdma_port_get_transport(device, p) != RDMA_TRANSPORT_IB) + continue; + dev = ipoib_add_port("ib%d", device, p); if (!IS_ERR(dev)) { priv = netdev_priv(dev); @@ -1370,12 +1370,12 @@ static void ipoib_remove_one(struct ib_device *device) struct ipoib_dev_priv *priv, *tmp; struct list_head *dev_list; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; - dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { + if (rdma_port_get_transport(device, priv->port) != RDMA_TRANSPORT_IB) + continue; + ib_unregister_event_handler(&priv->event_handler); rtnl_lock(); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index c179318..b557129 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -72,9 +72,6 @@ enum rdma_transport_type { RDMA_TRANSPORT_IWARP }; -enum rdma_transport_type -rdma_node_get_transport(enum rdma_node_type node_type) __attribute_const__; - enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), @@ -298,6 +295,7 @@ struct ib_port_attr { u8 active_width; u8 active_speed; u8 phys_state; + enum rdma_transport_type transport; }; enum ib_device_modify_flags { @@ -1003,6 +1001,8 @@ struct ib_device { int (*query_port)(struct ib_device *device, u8 port_num, struct ib_port_attr *port_attr); + enum rdma_transport_type (*get_port_transport)(struct ib_device *device, + u8 port_num); int (*query_gid)(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); @@ -1213,6 +1213,9 @@ int ib_query_device(struct ib_device *device, int ib_query_port(struct ib_device *device, u8 port_num, struct ib_port_attr *port_attr); +enum rdma_transport_type rdma_port_get_transport(struct ib_device *device, + u8 port_num); + int ib_query_gid(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 42a6f9f..769dc18 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -338,8 +338,7 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt, static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count) { if ((RDMA_TRANSPORT_IWARP == - rdma_node_get_transport(xprt->sc_cm_id-> - device->node_type)) + rdma_port_get_transport(xprt->sc_cm_id->device, xprt->sc_cm_id->port_num)) && sge_count > 1) return 1; else diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 5151f9f..a5a4162 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -976,7 +976,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) /* * Determine if a DMA MR is required and if so, what privs are required */ - switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) { + switch (rdma_port_get_transport(newxprt->sc_cm_id->device, newxprt->sc_cm_id->port_num)) { case RDMA_TRANSPORT_IWARP: newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV; if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) { -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:28:23 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:28:23 +0300 Subject: [ofa-general] [PATCHv4 02/10] ib_core: Add RDMAoE transport protocol Message-ID: <20090805082823.GC5599@mtls03> Add a new transport protocol, RDMAoE, used for transporting Infiniband traffic over Ethernet fabrics. Signed-off-by: Eli Cohen --- include/rdma/ib_verbs.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index b557129..4eec70f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -69,7 +69,8 @@ enum rdma_node_type { enum rdma_transport_type { RDMA_TRANSPORT_IB, - RDMA_TRANSPORT_IWARP + RDMA_TRANSPORT_IWARP, + RDMA_TRANSPORT_RDMAOE }; enum ib_device_cap_flags { -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:28:54 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:28:54 +0300 Subject: [ofa-general] [PATCHv4 03/10] ib_core: RDMAoE support only QP1 Message-ID: <20090805082854.GD5599@mtls03> Since RDMAoE is using Ethernet as its link layer, there is no need for QP0. QP1 is still needed since it handles communications between CM agents. This patch will create only QP1 for RDMAoE ports. Signed-off-by: Eli Cohen --- drivers/infiniband/core/agent.c | 12 +++++--- drivers/infiniband/core/mad.c | 55 +++++++++++++++++++++++++++++---------- 2 files changed, 49 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index ae7c288..c3f2048 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -48,6 +48,8 @@ struct ib_agent_port_private { struct list_head port_list; struct ib_mad_agent *agent[2]; + struct ib_device *device; + u8 port_num; }; static DEFINE_SPINLOCK(ib_agent_port_list_lock); @@ -58,11 +60,10 @@ __ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if (entry->agent[0]->device == device && - entry->agent[0]->port_num == port_num) + list_for_each_entry(entry, &ib_agent_port_list, port_list) + if (entry->device == device && entry->port_num == port_num) return entry; - } + return NULL; } @@ -175,6 +176,9 @@ int ib_agent_port_open(struct ib_device *device, int port_num) goto error3; } + port_priv->device = device; + port_priv->port_num = port_num; + spin_lock_irqsave(&ib_agent_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_agent_port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 7b737c4..de83c71 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -199,6 +199,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, unsigned long flags; u8 mgmt_class, vclass; + /* Validate device and port */ + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + + if (!port_priv->qp_info[qp_type].qp) + return NULL; + /* Validate parameters */ qpn = get_spl_qp_index(qp_type); if (qpn == -1) @@ -260,13 +270,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, goto error1; } - /* Validate device and port */ - port_priv = ib_get_mad_port(device, port_num); - if (!port_priv) { - ret = ERR_PTR(-ENODEV); - goto error1; - } - /* Allocate structures */ mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL); if (!mad_agent_priv) { @@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_snoop_private *mad_snoop_priv; + if (!mad_agent) + return 0; + /* If the TID is zero, the agent can only snoop. */ if (mad_agent->hi_tid) { mad_agent_priv = container_of(mad_agent, @@ -2602,6 +2608,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) struct ib_mad_private *recv; struct ib_mad_list_head *mad_list; + if (!qp_info->qp) + return; + while (!list_empty(&qp_info->recv_queue.list)) { mad_list = list_entry(qp_info->recv_queue.list.next, @@ -2643,6 +2652,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv) for (i = 0; i < IB_MAD_QPS_CORE; i++) { qp = port_priv->qp_info[i].qp; + if (!qp) + continue; + /* * PKey index for QP1 is irrelevant but * one is needed for the Reset to Init transition @@ -2684,6 +2696,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv) } for (i = 0; i < IB_MAD_QPS_CORE; i++) { + if (!port_priv->qp_info[i].qp) + continue; + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); if (ret) { printk(KERN_ERR PFX "Couldn't post receive WRs\n"); @@ -2762,6 +2777,9 @@ error: static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) { + if (!qp_info->qp) + return; + ib_destroy_qp(qp_info->qp); kfree(qp_info->snoop_table); } @@ -2777,6 +2795,7 @@ static int ib_mad_port_open(struct ib_device *device, struct ib_mad_port_private *port_priv; unsigned long flags; char name[sizeof "ib_mad123"]; + int has_smi; /* Create new device info */ port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); @@ -2793,6 +2812,10 @@ static int ib_mad_port_open(struct ib_device *device, init_mad_qp(port_priv, &port_priv->qp_info[1]); cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + has_smi = rdma_port_get_transport(device, port_num) == RDMA_TRANSPORT_IB; + if (has_smi) + cq_size *= 2; + port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, NULL, port_priv, cq_size, 0); @@ -2816,9 +2839,11 @@ static int ib_mad_port_open(struct ib_device *device, goto error5; } - ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); - if (ret) - goto error6; + if (has_smi) { + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + } ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); if (ret) goto error7; @@ -2852,7 +2877,8 @@ error9: error8: destroy_mad_qp(&port_priv->qp_info[1]); error7: - destroy_mad_qp(&port_priv->qp_info[0]); + if (has_smi) + destroy_mad_qp(&port_priv->qp_info[0]); error6: ib_dereg_mr(port_priv->mr); error5: @@ -2917,7 +2943,7 @@ static void ib_mad_init_device(struct ib_device *device) for (i = start; i <= end; i++) { tt = rdma_port_get_transport(device, i); - if (tt != RDMA_TRANSPORT_IB) + if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE) continue; if (ib_mad_port_open(device, i)) { @@ -2943,7 +2969,8 @@ error: i--; while (i >= start) { - if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) { + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) { if (ib_agent_port_close(device, i)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", @@ -2970,7 +2997,7 @@ static void ib_mad_remove_device(struct ib_device *device) } for (i = 0; i < num_ports; i++, cur_port++) { tt = rdma_port_get_transport(device, i); - if (tt == RDMA_TRANSPORT_IB) { + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) { if (ib_agent_port_close(device, cur_port)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:29:10 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:29:10 +0300 Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE ports Message-ID: <20090805082910.GE5599@mtls03> Initialize umad context for devices that have any of their ports either IB or RDMAoE so as to allow user space apps to send and receive MADs on QP1. Signed-off-by: Eli Cohen --- drivers/infiniband/core/user_mad.c | 27 ++++++++++++++++++++------- 1 files changed, 20 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 3e58fc0..2189e65 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1112,6 +1112,7 @@ static void ib_umad_add_one(struct ib_device *device) { struct ib_umad_device *umad_dev; int s, e, i; + enum rdma_transport_type tt; if (device->node_type == RDMA_NODE_IB_SWITCH) s = e = 0; @@ -1120,9 +1121,14 @@ static void ib_umad_add_one(struct ib_device *device) e = device->phys_port_cnt; } - for (i = s; i <= e; ++i) - if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) - return; + for (i = s; i <= e; ++i) { + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) + break; + } + + if (i > e) + return; umad_dev = kzalloc(sizeof *umad_dev + (e - s + 1) * sizeof (struct ib_umad_port), @@ -1147,8 +1153,11 @@ static void ib_umad_add_one(struct ib_device *device) return; err: - while (--i >= s) - ib_umad_kill_port(&umad_dev->port[i - s]); + while (--i >= s) { + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) + ib_umad_kill_port(&umad_dev->port[i - s]); + } kref_put(&umad_dev->ref, ib_umad_release_dev); } @@ -1157,12 +1166,16 @@ static void ib_umad_remove_one(struct ib_device *device) { struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); int i; + enum rdma_transport_type tt; if (!umad_dev) return; - for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) - ib_umad_kill_port(&umad_dev->port[i]); + for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) { + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) + ib_umad_kill_port(&umad_dev->port[i]); + } kref_put(&umad_dev->ref, ib_umad_release_dev); } -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:29:19 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:29:19 +0300 Subject: [ofa-general] [PATCHv4 05/10] ib/cm: Enable CM support for RDMAoE Message-ID: <20090805082919.GF5599@mtls03> CM messages can be transported on RDMAoE protocol ports so they are enabled here. Signed-off-by: Eli Cohen --- drivers/infiniband/core/cm.c | 2 +- drivers/infiniband/core/ucm.c | 12 +++++++++--- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index f930f1d..63d6de3 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3699,7 +3699,7 @@ static void cm_add_one(struct ib_device *ib_device) set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= ib_device->phys_port_cnt; i++) { tt = rdma_port_get_transport(ib_device, i); - if (tt != RDMA_TRANSPORT_IB) + if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE) continue; port = kzalloc(sizeof *port, GFP_KERNEL); diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 4f5096d..21c78f5 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -1240,13 +1240,19 @@ static void ib_ucm_add_one(struct ib_device *device) { struct ib_ucm_device *ucm_dev; int i; + enum rdma_transport_type tt; if (!device->alloc_ucontext || device->node_type == RDMA_NODE_IB_SWITCH) return; - for (i = 1; i <= device->phys_port_cnt; ++i) - if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) - return; + for (i = 1; i <= device->phys_port_cnt; ++i) { + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) + break; + } + + if (i > device->phys_port_cnt) + return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); if (!ucm_dev) -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:29:29 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:29:29 +0300 Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding Message-ID: <20090805082929.GG5599@mtls03> Add support for RDMAoE device binding and IP --> GID resolution. Path resolving and multicast joining are implemented within cma.c by filling the responses and pushing the callbacks to the cma work queue. IP->GID resolution always yield IPv6 link local addresses - remote GIDs are derived from the destination MAC address of the remote port. Multicast GIDs are always mapped to broadcast MAC (all FFs). Some helper functions are added to ib_addr.h. Signed-off-by: Eli Cohen --- drivers/infiniband/core/cma.c | 150 ++++++++++++++++++++++++++++++++++++++- drivers/infiniband/core/ucma.c | 25 +++++-- include/rdma/ib_addr.h | 87 +++++++++++++++++++++++ 3 files changed, 251 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 866ff7f..8f5675b 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -58,6 +58,7 @@ MODULE_LICENSE("Dual BSD/GPL"); #define CMA_CM_RESPONSE_TIMEOUT 20 #define CMA_MAX_CM_RETRIES 15 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) +#define RDMAOE_PACKET_LIFETIME 18 static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -174,6 +175,12 @@ struct cma_ndev_work { struct rdma_cm_event event; }; +struct rdmaoe_mcast_work { + struct work_struct work; + struct rdma_id_private *id; + struct cma_multicast *mc; +}; + union cma_ip_addr { struct in6_addr ip6; struct { @@ -348,6 +355,9 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv) case RDMA_TRANSPORT_IWARP: iw_addr_get_sgid(dev_addr, &gid); break; + case RDMA_TRANSPORT_RDMAOE: + rdmaoe_addr_get_sgid(dev_addr, &gid); + break; default: return -ENODEV; } @@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv, { struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; int ret; + u16 pkey; + + if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num) == + RDMA_TRANSPORT_IB) + pkey = ib_addr_get_pkey(dev_addr); + else + pkey = 0xffff; ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, - ib_addr_get_pkey(dev_addr), - &qp_attr->pkey_index); + pkey, &qp_attr->pkey_index); if (ret) return ret; @@ -609,6 +625,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, id_priv = container_of(id, struct rdma_id_private, id); switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps)) ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask); else @@ -836,7 +853,9 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) mc = container_of(id_priv->mc_list.next, struct cma_multicast, list); list_del(&mc->list); - ib_sa_free_multicast(mc->multicast.ib); + if (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num) == + RDMA_TRANSPORT_IB) + ib_sa_free_multicast(mc->multicast.ib); kref_put(&mc->mcref, release_mc); } } @@ -855,6 +874,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) mutex_unlock(&lock); switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -1512,6 +1532,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog) if (id->device) { switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: ret = cma_ib_listen(id_priv); if (ret) goto err; @@ -1727,6 +1748,65 @@ static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) return 0; } +static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv) +{ + struct rdma_route *route = &id_priv->id.route; + struct rdma_addr *addr = &route->addr; + struct cma_work *work; + int ret; + struct sockaddr_in *src_addr = (struct sockaddr_in *)&route->addr.src_addr; + struct sockaddr_in *dst_addr = (struct sockaddr_in *)&route->addr.dst_addr; + + if (src_addr->sin_family != dst_addr->sin_family) + return -EINVAL; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler); + + route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL); + if (!route->path_rec) { + ret = -ENOMEM; + goto err; + } + + route->num_paths = 1; + + rdmaoe_mac_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr); + rdmaoe_mac_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr); + + route->path_rec->hop_limit = 2; + route->path_rec->reversible = 1; + route->path_rec->pkey = cpu_to_be16(0xffff); + route->path_rec->mtu_selector = 2; + route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu); + route->path_rec->rate_selector = 2; + route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev); + route->path_rec->packet_life_time_selector = 2; + route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME; + + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + if (!route->path_rec->mtu || !route->path_rec->rate) { + work->event.event = RDMA_CM_EVENT_ROUTE_ERROR; + work->event.status = -1; + } else { + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + work->event.status = 0; + } + + queue_work(cma_wq, &work->work); + + return 0; + +err: + kfree(work); + return ret; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1744,6 +1824,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) case RDMA_TRANSPORT_IWARP: ret = cma_resolve_iw_route(id_priv, timeout_ms); break; + case RDMA_TRANSPORT_RDMAOE: + ret = cma_resolve_rdmaoe_route(id_priv); + break; default: ret = -ENOSYS; break; @@ -2419,6 +2502,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (cma_is_ud_ps(id->ps)) ret = cma_resolve_ib_udp(id_priv, conn_param); else @@ -2532,6 +2616,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS, conn_param->private_data, @@ -2593,6 +2678,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data, switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, private_data, private_data_len); @@ -2624,6 +2710,7 @@ int rdma_disconnect(struct rdma_cm_id *id) switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: ret = cma_modify_qp_err(id_priv); if (ret) goto out; @@ -2752,6 +2839,55 @@ static int cma_join_ib_multicast(struct rdma_id_private *id_priv, return 0; } + +static void rdmaoe_mcast_work_handler(struct work_struct *work) +{ + struct rdmaoe_mcast_work *mw = container_of(work, struct rdmaoe_mcast_work, work); + struct cma_multicast *mc = mw->mc; + struct ib_sa_multicast *m = mc->multicast.ib; + + mc->multicast.ib->context = mc; + cma_ib_mc_handler(0, m); + kfree(m); + kfree(mw); +} + +static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv, + struct cma_multicast *mc) +{ + struct rdmaoe_mcast_work *work; + struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; + + if (cma_zero_addr((struct sockaddr *)&mc->addr)) + return -EINVAL; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL); + if (!mc->multicast.ib) { + kfree(work); + return -ENOMEM; + } + + cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr, &mc->multicast.ib->rec.mgid); + mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff); + if (id_priv->id.ps == RDMA_PS_UDP) + mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY); + mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev); + mc->multicast.ib->rec.hop_limit = 1; + mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu); + rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid); + work->id = id_priv; + work->mc = mc; + INIT_WORK(&work->work, rdmaoe_mcast_work_handler); + + queue_work(cma_wq, &work->work); + + return 0; +} + int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, void *context) { @@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, case RDMA_TRANSPORT_IB: ret = cma_join_ib_multicast(id_priv, mc); break; + case RDMA_TRANSPORT_RDMAOE: + ret = cma_rdmaoe_join_multicast(id_priv, mc); + break; default: ret = -ENOSYS; break; @@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, spin_unlock_irq(&id_priv->lock); kfree(mc); } + return ret; } EXPORT_SYMBOL(rdma_join_multicast); @@ -2813,7 +2953,9 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) ib_detach_mcast(id->qp, &mc->multicast.ib->rec.mgid, mc->multicast.ib->rec.mlid); - ib_sa_free_multicast(mc->multicast.ib); + if (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num) == + RDMA_TRANSPORT_IB) + ib_sa_free_multicast(mc->multicast.ib); kref_put(&mc->mcref, release_mc); return; } diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 24d9510..c7c9e92 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -553,7 +553,8 @@ static ssize_t ucma_resolve_route(struct ucma_file *file, } static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, - struct rdma_route *route) + struct rdma_route *route, + enum rdma_transport_type tt) { struct rdma_dev_addr *dev_addr; @@ -561,10 +562,17 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, switch (route->num_paths) { case 0: dev_addr = &route->addr.dev_addr; - ib_addr_get_dgid(dev_addr, - (union ib_gid *) &resp->ib_route[0].dgid); - ib_addr_get_sgid(dev_addr, - (union ib_gid *) &resp->ib_route[0].sgid); + if (tt == RDMA_TRANSPORT_IB) { + ib_addr_get_dgid(dev_addr, + (union ib_gid *) &resp->ib_route[0].dgid); + ib_addr_get_sgid(dev_addr, + (union ib_gid *) &resp->ib_route[0].sgid); + } else { + rdmaoe_mac_to_ll((union ib_gid *) &resp->ib_route[0].dgid, + dev_addr->dst_dev_addr); + rdmaoe_addr_get_sgid(dev_addr, + (union ib_gid *) &resp->ib_route[0].sgid); + } resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); break; case 2: @@ -589,6 +597,7 @@ static ssize_t ucma_query_route(struct ucma_file *file, struct ucma_context *ctx; struct sockaddr *addr; int ret = 0; + enum rdma_transport_type tt; if (out_len < sizeof(resp)) return -ENOSPC; @@ -614,9 +623,11 @@ static ssize_t ucma_query_route(struct ucma_file *file, resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid; resp.port_num = ctx->cm_id->port_num; - switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num)) { + tt = rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num); + switch (tt) { case RDMA_TRANSPORT_IB: - ucma_copy_ib_route(&resp, &ctx->cm_id->route); + case RDMA_TRANSPORT_RDMAOE: + ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt); break; default: break; diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index 483057b..66a848e 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -39,6 +39,8 @@ #include #include #include +#include +#include struct rdma_addr_client { atomic_t refcount; @@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr *dev_addr, memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid); } +static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac) +{ + memset(gid->raw, 0, 16); + *((u32 *)gid->raw) = cpu_to_be32(0xfe800000); + gid->raw[12] = 0xfe; + gid->raw[11] = 0xff; + memcpy(gid->raw + 13, mac + 3, 3); + memcpy(gid->raw + 8, mac, 3); + gid->raw[8] ^= 2; +} + +static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr); +} + +static inline enum ib_mtu rdmaoe_get_mtu(int mtu) +{ + /* + * reduce IB headers from effective RDMAoE MTU. 28 stands for + * atomic header which is the biggest possible header after BTH + */ + mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28; + + if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096)) + return IB_MTU_4096; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048)) + return IB_MTU_2048; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024)) + return IB_MTU_1024; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512)) + return IB_MTU_512; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256)) + return IB_MTU_256; + else + return 0; +} + +static inline int rdmaoe_get_rate(struct net_device *dev) +{ + struct ethtool_cmd cmd; + + if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings || + dev->ethtool_ops->get_settings(dev, &cmd)) + return IB_RATE_PORT_CURRENT; + + if (cmd.speed >= 40000) + return IB_RATE_40_GBPS; + else if (cmd.speed >= 30000) + return IB_RATE_30_GBPS; + else if (cmd.speed >= 20000) + return IB_RATE_20_GBPS; + else if (cmd.speed >= 10000) + return IB_RATE_10_GBPS; + else + return IB_RATE_PORT_CURRENT; +} + +static inline int rdma_link_local_addr(struct in6_addr *addr) +{ + if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) && + addr->s6_addr32[1] == 0) + return 1; + else + return 0; +} + +static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac) +{ + memcpy(mac, &addr->s6_addr[8], 3); + memcpy(mac + 3, &addr->s6_addr[13], 3); + mac[0] ^= 2; +} + +static inline int rdma_is_multicast_addr(struct in6_addr *addr) +{ + return addr->s6_addr[0] == 0xff ? 1 : 0; +} + +static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac) +{ + memset(mac, 0xff, 6); +} + #endif /* IB_ADDR_H */ -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:29:37 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:29:37 +0300 Subject: [ofa-general] [PATCHv4 07/10] ib_core: RDMAoE UD packet packing support Message-ID: <20090805082937.GH5599@mtls03> Add support functions to aid in packing RDMAoE packets. Signed-off-by: Eli Cohen --- drivers/infiniband/core/ud_header.c | 111 +++++++++++++++++++++++++++++++++++ include/rdma/ib_pack.h | 26 ++++++++ 2 files changed, 137 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c index 8ec7876..d04b6f2 100644 --- a/drivers/infiniband/core/ud_header.c +++ b/drivers/infiniband/core/ud_header.c @@ -80,6 +80,29 @@ static const struct ib_field lrh_table[] = { .size_bits = 16 } }; +static const struct ib_field eth_table[] = { + { STRUCT_FIELD(eth, dmac_h), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { STRUCT_FIELD(eth, dmac_l), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { STRUCT_FIELD(eth, smac_h), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 }, + { STRUCT_FIELD(eth, smac_l), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 32 }, + { STRUCT_FIELD(eth, type), + .offset_words = 3, + .offset_bits = 0, + .size_bits = 16 } +}; + static const struct ib_field grh_table[] = { { STRUCT_FIELD(grh, ip_version), .offset_words = 0, @@ -241,6 +264,53 @@ void ib_ud_header_init(int payload_bytes, EXPORT_SYMBOL(ib_ud_header_init); /** + * ib_rdmaoe_ud_header_init - Initialize UD header structure + * @payload_bytes:Length of packet payload + * @grh_present:GRH flag (if non-zero, GRH will be included) + * @header:Structure to initialize + * + * ib_rdmaoe_ud_header_init() initializes the grh.ip_version, grh.payload_length, + * grh.next_header, bth.opcode, bth.pad_count and + * bth.transport_header_version fields of a &struct eth_ud_header given + * the payload length and whether a GRH will be included. + */ +void ib_rdmaoe_ud_header_init(int payload_bytes, + int grh_present, + struct eth_ud_header *header) +{ + int header_len; + + memset(header, 0, sizeof *header); + + header_len = + sizeof header->eth + + IB_BTH_BYTES + + IB_DETH_BYTES; + if (grh_present) + header_len += IB_GRH_BYTES; + + header->grh_present = grh_present; + if (grh_present) { + header->grh.ip_version = 6; + header->grh.payload_length = + cpu_to_be16((IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) & ~3); /* round up */ + header->grh.next_header = 0x1b; + } + + if (header->immediate_present) + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + else + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY; + header->bth.pad_count = (4 - payload_bytes) & 3; + header->bth.transport_header_version = 0; +} +EXPORT_SYMBOL(ib_rdmaoe_ud_header_init); + +/** * ib_ud_header_pack - Pack UD header struct into wire format * @header:UD header struct * @buf:Buffer to pack into @@ -281,6 +351,47 @@ int ib_ud_header_pack(struct ib_ud_header *header, EXPORT_SYMBOL(ib_ud_header_pack); /** + * rdmaoe_ud_header_pack - Pack UD header struct into eth wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() packs the UD header structure @header into wire + * format in the buffer @buf. + */ +int rdmaoe_ud_header_pack(struct eth_ud_header *header, + void *buf) +{ + int len = 0; + + ib_pack(eth_table, ARRAY_SIZE(eth_table), + &header->eth, buf); + len += IB_ETH_BYTES; + + if (header->grh_present) { + ib_pack(grh_table, ARRAY_SIZE(grh_table), + &header->grh, buf + len); + len += IB_GRH_BYTES; + } + + ib_pack(bth_table, ARRAY_SIZE(bth_table), + &header->bth, buf + len); + len += IB_BTH_BYTES; + + ib_pack(deth_table, ARRAY_SIZE(deth_table), + &header->deth, buf + len); + len += IB_DETH_BYTES; + + if (header->immediate_present) { + memcpy(buf + len, &header->immediate_data, + sizeof header->immediate_data); + len += sizeof header->immediate_data; + } + + return len; +} +EXPORT_SYMBOL(rdmaoe_ud_header_pack); + +/** * ib_ud_header_unpack - Unpack UD header struct from wire format * @header:UD header struct * @buf:Buffer to pack into diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h index d7fc45c..bf199eb 100644 --- a/include/rdma/ib_pack.h +++ b/include/rdma/ib_pack.h @@ -37,6 +37,7 @@ enum { IB_LRH_BYTES = 8, + IB_ETH_BYTES = 14, IB_GRH_BYTES = 40, IB_BTH_BYTES = 12, IB_DETH_BYTES = 8 @@ -210,6 +211,14 @@ struct ib_unpacked_deth { __be32 source_qpn; }; +struct ib_unpacked_eth { + u8 dmac_h[4]; + u8 dmac_l[2]; + u8 smac_h[2]; + u8 smac_l[4]; + __be16 type; +}; + struct ib_ud_header { struct ib_unpacked_lrh lrh; int grh_present; @@ -220,6 +229,16 @@ struct ib_ud_header { __be32 immediate_data; }; +struct eth_ud_header { + struct ib_unpacked_eth eth; + int grh_present; + struct ib_unpacked_grh grh; + struct ib_unpacked_bth bth; + struct ib_unpacked_deth deth; + int immediate_present; + __be32 immediate_data; +}; + void ib_pack(const struct ib_field *desc, int desc_len, void *structure, @@ -234,10 +253,17 @@ void ib_ud_header_init(int payload_bytes, int grh_present, struct ib_ud_header *header); +void ib_rdmaoe_ud_header_init(int payload_bytes, + int grh_present, + struct eth_ud_header *header); + int ib_ud_header_pack(struct ib_ud_header *header, void *buf); int ib_ud_header_unpack(void *buf, struct ib_ud_header *header); +int rdmaoe_ud_header_pack(struct eth_ud_header *header, + void *buf); + #endif /* IB_PACK_H */ -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:29:50 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:29:50 +0300 Subject: [ofa-general] [PATCHv4 08/10] ib_core: Add API to support RDMAoE from userspace Message-ID: <20090805082950.GI5599@mtls03> Add ib_uverbs_get_mac() to be used by ibv_create_ah() to retirieve the remore port's MAC address. Port transport is also returned by ibv_query_port(). ABI version is incremented from 6 to 7. Signed-off-by: Eli Cohen --- drivers/infiniband/core/uverbs.h | 1 + drivers/infiniband/core/uverbs_cmd.c | 32 ++++++++++++++++++++++++++++++++ drivers/infiniband/core/uverbs_main.c | 1 + drivers/infiniband/core/verbs.c | 10 ++++++++++ include/rdma/ib_user_verbs.h | 21 ++++++++++++++++++--- include/rdma/ib_verbs.h | 12 ++++++++++++ 6 files changed, 74 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index b3ea958..e69b04c 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -194,5 +194,6 @@ IB_UVERBS_DECLARE_CMD(create_srq); IB_UVERBS_DECLARE_CMD(modify_srq); IB_UVERBS_DECLARE_CMD(query_srq); IB_UVERBS_DECLARE_CMD(destroy_srq); +IB_UVERBS_DECLARE_CMD(get_mac); #endif /* UVERBS_H */ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 56feab6..012aadf 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -452,6 +452,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file, resp.active_width = attr.active_width; resp.active_speed = attr.active_speed; resp.phys_state = attr.phys_state; + resp.transport = attr.transport; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) @@ -1824,6 +1825,37 @@ err: return ret; } +ssize_t ib_uverbs_get_mac(struct ib_uverbs_file *file, const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_get_mac cmd; + struct ib_uverbs_get_mac_resp resp; + int ret; + struct ib_pd *pd; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) + return -EINVAL; + + ret = ib_get_mac(pd->device, cmd.port, cmd.gid, resp.mac); + put_pd_read(pd); + if (!ret) { + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + return -EFAULT; + + return in_len; + } + + return ret; +} + ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) { diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index eb36a81..2641845 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -108,6 +108,7 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file, [IB_USER_VERBS_CMD_MODIFY_SRQ] = ib_uverbs_modify_srq, [IB_USER_VERBS_CMD_QUERY_SRQ] = ib_uverbs_query_srq, [IB_USER_VERBS_CMD_DESTROY_SRQ] = ib_uverbs_destroy_srq, + [IB_USER_VERBS_CMD_GET_MAC] = ib_uverbs_get_mac, }; static struct vfsmount *uverbs_event_mnt; diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 3b2f00b..7cce5d6 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -912,3 +912,13 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) return qp->device->detach_mcast(qp, gid, lid); } EXPORT_SYMBOL(ib_detach_mcast); + +int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac) +{ + if (!device->get_mac) + return -ENOSYS; + + return device->get_mac(device, port, gid, mac); +} +EXPORT_SYMBOL(ib_get_mac); + diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h index a17f771..49eee8a 100644 --- a/include/rdma/ib_user_verbs.h +++ b/include/rdma/ib_user_verbs.h @@ -42,7 +42,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_VERBS_ABI_VERSION 6 +#define IB_USER_VERBS_ABI_VERSION 7 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -81,7 +81,8 @@ enum { IB_USER_VERBS_CMD_MODIFY_SRQ, IB_USER_VERBS_CMD_QUERY_SRQ, IB_USER_VERBS_CMD_DESTROY_SRQ, - IB_USER_VERBS_CMD_POST_SRQ_RECV + IB_USER_VERBS_CMD_POST_SRQ_RECV, + IB_USER_VERBS_CMD_GET_MAC }; /* @@ -205,7 +206,8 @@ struct ib_uverbs_query_port_resp { __u8 active_width; __u8 active_speed; __u8 phys_state; - __u8 reserved[3]; + __u8 transport; + __u8 reserved[2]; }; struct ib_uverbs_alloc_pd { @@ -621,6 +623,19 @@ struct ib_uverbs_destroy_ah { __u32 ah_handle; }; +struct ib_uverbs_get_mac { + __u64 response; + __u32 pd_handle; + __u8 port; + __u8 reserved[3]; + __u8 gid[16]; +}; + +struct ib_uverbs_get_mac_resp { + __u8 mac[6]; + __u16 reserved; +}; + struct ib_uverbs_attach_mcast { __u8 gid[16]; __u32 qp_handle; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 4eec70f..9470e1a 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1131,6 +1131,9 @@ struct ib_device { struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); + int (*get_mac)(struct ib_device *device, u8 port, + u8 *gid, u8 *mac); + struct ib_dma_mapping_ops *dma_ops; @@ -2035,4 +2038,13 @@ int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); */ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); +/** + * ib_get_mac - get the mac address for the specified gid + * @device: IB device used for traffic + * @port: port number used. + * @gid: gid to be resolved into mac + * @mac: mac of the port bearing this gid + */ +int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac); + #endif /* IB_VERBS_H */ -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:30:08 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:30:08 +0300 Subject: [ofa-general] [PATCHv4 09/10] mlx4: Add support for RDMAoE - address resolution Message-ID: <20090805083008.GJ5599@mtls03> The following path handles address vectors creation for RDMAoE ports. mlx4 needs the MAC address of the remote node to include it in the WQE of a UD QP or in the QP context of connected QPs. Address resolution is done atomically in the case of a link local address or a multicast GID and otherwise -EINVAL is returned. mlx4 transport packets were changed too to accomodate for RDMAoE. Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx4/ah.c | 187 ++++++++++++++++++++++++++++------ drivers/infiniband/hw/mlx4/mlx4_ib.h | 19 +++- drivers/infiniband/hw/mlx4/qp.c | 172 +++++++++++++++++++++---------- drivers/net/mlx4/fw.c | 3 +- include/linux/mlx4/device.h | 31 ++++++- include/linux/mlx4/qp.h | 8 +- 6 files changed, 327 insertions(+), 93 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c index c75ac94..0a015c3 100644 --- a/drivers/infiniband/hw/mlx4/ah.c +++ b/drivers/infiniband/hw/mlx4/ah.c @@ -31,63 +31,166 @@ */ #include "mlx4_ib.h" +#include +#include +#include -struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr, + u8 *mac, int *is_mcast) { - struct mlx4_dev *dev = to_mdev(pd->device)->dev; - struct mlx4_ib_ah *ah; + struct mlx4_ib_rdmaoe *rdmaoe = &dev->rdmaoe; + struct sockaddr_in6 s6 = {0}; + struct net_device *netdev; + int ifidx; - ah = kmalloc(sizeof *ah, GFP_ATOMIC); - if (!ah) - return ERR_PTR(-ENOMEM); + *is_mcast = 0; + spin_lock(&rdmaoe->lock); + netdev = rdmaoe->netdevs[ah_attr->port_num - 1]; + if (!netdev) { + spin_unlock(&rdmaoe->lock); + return -EINVAL; + } + ifidx = netdev->ifindex; + spin_unlock(&rdmaoe->lock); - memset(&ah->av, 0, sizeof ah->av); + memcpy(s6.sin6_addr.s6_addr, ah_attr->grh.dgid.raw, sizeof ah_attr->grh); + s6.sin6_family = AF_INET6; + s6.sin6_scope_id = ifidx; + if (rdma_link_local_addr(&s6.sin6_addr)) + rdma_get_ll_mac(&s6.sin6_addr, mac); + else if (rdma_is_multicast_addr(&s6.sin6_addr)) { + rdma_get_mcast_mac(&s6.sin6_addr, mac); + *is_mcast = 1; + } else + return -EINVAL; - ah->av.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24)); - ah->av.g_slid = ah_attr->src_path_bits; - ah->av.dlid = cpu_to_be16(ah_attr->dlid); - if (ah_attr->static_rate) { - ah->av.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET; - while (ah->av.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && - !(1 << ah->av.stat_rate & dev->caps.stat_rate_support)) - --ah->av.stat_rate; - } - ah->av.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + return 0; +} + +static struct ib_ah *create_ib_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr, + struct mlx4_ib_ah *ah) +{ + struct mlx4_dev *dev = to_mdev(pd->device)->dev; + + ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24)); + ah->av.ib.g_slid = ah_attr->src_path_bits; if (ah_attr->ah_flags & IB_AH_GRH) { - ah->av.g_slid |= 0x80; - ah->av.gid_index = ah_attr->grh.sgid_index; - ah->av.hop_limit = ah_attr->grh.hop_limit; - ah->av.sl_tclass_flowlabel |= + ah->av.ib.g_slid |= 0x80; + ah->av.ib.gid_index = ah_attr->grh.sgid_index; + ah->av.ib.hop_limit = ah_attr->grh.hop_limit; + ah->av.ib.sl_tclass_flowlabel |= cpu_to_be32((ah_attr->grh.traffic_class << 20) | ah_attr->grh.flow_label); - memcpy(ah->av.dgid, ah_attr->grh.dgid.raw, 16); + memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16); + } + + ah->av.ib.dlid = cpu_to_be16(ah_attr->dlid); + if (ah_attr->static_rate) { + ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET; + while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && + !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support)) + --ah->av.ib.stat_rate; } + ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); return &ah->ibah; } +static struct ib_ah *create_rdmaoe_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr, + struct mlx4_ib_ah *ah) +{ + struct mlx4_ib_dev *ibdev = to_mdev(pd->device); + struct mlx4_dev *dev = ibdev->dev; + u8 mac[6]; + int err; + int is_mcast; + + err = mlx4_ib_resolve_grh(ibdev, ah_attr, mac, &is_mcast); + if (err) + return ERR_PTR(err); + + memcpy(ah->av.eth.mac_0_1, mac, 2); + memcpy(ah->av.eth.mac_2_5, mac + 2, 4); + ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24)); + ah->av.ib.g_slid = 0x80; + if (ah_attr->static_rate) { + ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET; + while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && + !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support)) + --ah->av.ib.stat_rate; + } + + /* + * HW requires multicast LID so we just choose one. + */ + if (is_mcast) + ah->av.ib.dlid = cpu_to_be16(0xc000); + + memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16); + ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + + return &ah->ibah; +} + +struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct mlx4_ib_ah *ah; + enum rdma_transport_type transport; + struct ib_ah *ret; + + ah = kzalloc(sizeof *ah, GFP_ATOMIC); + if (!ah) + return ERR_PTR(-ENOMEM); + + transport = rdma_port_get_transport(pd->device, ah_attr->port_num); + if (transport == RDMA_TRANSPORT_RDMAOE) { + if (!(ah_attr->ah_flags & IB_AH_GRH)) { + ret = ERR_PTR(-EINVAL); + goto out; + } else { + /* TBD: need to handle the case when we get called + in an atomic context and there we might sleep. We + don't expect this currently since we're working with + link local addresses which we can translate without + going to sleep */ + ret = create_rdmaoe_ah(pd, ah_attr, ah); + if (IS_ERR(ret)) + goto out; + else + return ret; + } + } else + return create_ib_ah(pd, ah_attr, ah); /* never fails */ + +out: + kfree(ah); + return ret; +} + int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr) { struct mlx4_ib_ah *ah = to_mah(ibah); + enum rdma_transport_type transport; + transport = rdma_port_get_transport(ibah->device, ah_attr->port_num); memset(ah_attr, 0, sizeof *ah_attr); - ah_attr->dlid = be16_to_cpu(ah->av.dlid); - ah_attr->sl = be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28; - ah_attr->port_num = be32_to_cpu(ah->av.port_pd) >> 24; - if (ah->av.stat_rate) - ah_attr->static_rate = ah->av.stat_rate - MLX4_STAT_RATE_OFFSET; - ah_attr->src_path_bits = ah->av.g_slid & 0x7F; + ah_attr->dlid = transport == RDMA_TRANSPORT_IB ? be16_to_cpu(ah->av.ib.dlid) : 0; + ah_attr->sl = be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28; + ah_attr->port_num = be32_to_cpu(ah->av.ib.port_pd) >> 24; + if (ah->av.ib.stat_rate) + ah_attr->static_rate = ah->av.ib.stat_rate - MLX4_STAT_RATE_OFFSET; + ah_attr->src_path_bits = ah->av.ib.g_slid & 0x7F; if (mlx4_ib_ah_grh_present(ah)) { ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.traffic_class = - be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20; + be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20; ah_attr->grh.flow_label = - be32_to_cpu(ah->av.sl_tclass_flowlabel) & 0xfffff; - ah_attr->grh.hop_limit = ah->av.hop_limit; - ah_attr->grh.sgid_index = ah->av.gid_index; - memcpy(ah_attr->grh.dgid.raw, ah->av.dgid, 16); + be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) & 0xfffff; + ah_attr->grh.hop_limit = ah->av.ib.hop_limit; + ah_attr->grh.sgid_index = ah->av.ib.gid_index; + memcpy(ah_attr->grh.dgid.raw, ah->av.ib.dgid, 16); } return 0; @@ -98,3 +201,21 @@ int mlx4_ib_destroy_ah(struct ib_ah *ah) kfree(to_mah(ah)); return 0; } + +int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac) +{ + int err; + struct mlx4_ib_dev *ibdev = to_mdev(device); + struct ib_ah_attr ah_attr = { + .port_num = port, + }; + int is_mcast; + + memcpy(ah_attr.grh.dgid.raw, gid, 16); + err = mlx4_ib_resolve_grh(ibdev, &ah_attr, mac, &is_mcast); + if (err) + ERR_PTR(err); + + return 0; +} + diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 8a7dd67..c644cac 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -138,6 +138,7 @@ struct mlx4_ib_qp { u8 resp_depth; u8 sq_no_prefetch; u8 state; + int mlx_type; }; struct mlx4_ib_srq { @@ -157,7 +158,14 @@ struct mlx4_ib_srq { struct mlx4_ib_ah { struct ib_ah ibah; - struct mlx4_av av; + union mlx4_ext_av av; +}; + +struct mlx4_ib_rdmaoe { + spinlock_t lock; + struct net_device *netdevs[MLX4_MAX_PORTS]; + struct notifier_block nb; + union ib_gid gid_table[MLX4_MAX_PORTS][128]; }; struct mlx4_ib_dev { @@ -175,6 +183,8 @@ struct mlx4_ib_dev { spinlock_t sm_lock; struct mutex cap_mask_mutex; + + struct mlx4_ib_rdmaoe rdmaoe; }; static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev) @@ -313,9 +323,14 @@ int mlx4_ib_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, int npages, int mlx4_ib_unmap_fmr(struct list_head *fmr_list); int mlx4_ib_fmr_dealloc(struct ib_fmr *fmr); +int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr, + u8 *mac, int *is_mcast); + +int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac); + static inline int mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah) { - return !!(ah->av.g_slid & 0x80); + return !!(ah->av.ib.g_slid & 0x80); } #endif /* MLX4_IB_H */ diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 20724ae..4b391fa 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -32,6 +32,7 @@ */ #include +#include #include #include @@ -47,14 +48,21 @@ enum { enum { MLX4_IB_DEFAULT_SCHED_QUEUE = 0x83, - MLX4_IB_DEFAULT_QP0_SCHED_QUEUE = 0x3f + MLX4_IB_DEFAULT_QP0_SCHED_QUEUE = 0x3f, + MLX4_IB_LINK_TYPE_IB = 0, + MLX4_IB_LINK_TYPE_ETH = 1 }; enum { /* * Largest possible UD header: send with GRH and immediate data. + * 4 bytes added to accommodate for eth header instead of lrh */ - MLX4_IB_UD_HEADER_SIZE = 72 + MLX4_IB_UD_HEADER_SIZE = 76 +}; + +enum { + MLX4_RDMAOE_ETHERTYPE = 0x8915 }; struct mlx4_ib_sqp { @@ -62,7 +70,10 @@ struct mlx4_ib_sqp { int pkey_index; u32 qkey; u32 send_psn; - struct ib_ud_header ud_header; + union { + struct ib_ud_header ib; + struct eth_ud_header eth; + } hdr; u8 header_buf[MLX4_IB_UD_HEADER_SIZE]; }; @@ -546,9 +557,9 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, } } - if (sqpn) { + if (sqpn) qpn = sqpn; - } else { + else { err = mlx4_qp_reserve_range(dev->dev, 1, 1, &qpn); if (err) goto err_wrid; @@ -843,6 +854,12 @@ static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port) static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah, struct mlx4_qp_path *path, u8 port) { + int err; + int is_eth = rdma_port_get_transport(&dev->ib_dev, port) == + RDMA_TRANSPORT_RDMAOE ? 1 : 0; + u8 mac[6]; + int is_mcast; + path->grh_mylmc = ah->src_path_bits & 0x7f; path->rlid = cpu_to_be16(ah->dlid); if (ah->static_rate) { @@ -873,6 +890,21 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah, path->sched_queue = MLX4_IB_DEFAULT_SCHED_QUEUE | ((port - 1) << 6) | ((ah->sl & 0xf) << 2); + if (is_eth) { + if (!(ah->ah_flags & IB_AH_GRH)) + return -1; + + err = mlx4_ib_resolve_grh(dev, ah, mac, &is_mcast); + if (err) + return err; + + memcpy(path->dmac_h, mac, 2); + memcpy(path->dmac_l, mac + 2, 4); + path->ackto = MLX4_IB_LINK_TYPE_ETH; + /* use index 0 into MAC table for RDMAoE */ + path->grh_mylmc &= 0x80; + } + return 0; } @@ -972,7 +1004,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_TIMEOUT) { - context->pri_path.ackto = attr->timeout << 3; + context->pri_path.ackto |= (attr->timeout << 3); optpar |= MLX4_QP_OPTPAR_ACK_TIMEOUT; } @@ -1218,79 +1250,109 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr, int header_size; int spc; int i; + void *tmp; + struct ib_ud_header *ib = NULL; + struct eth_ud_header *eth = NULL; + struct ib_unpacked_grh *grh; + struct ib_unpacked_bth *bth; + struct ib_unpacked_deth *deth; send_size = 0; for (i = 0; i < wr->num_sge; ++i) send_size += wr->sg_list[i].length; - ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), &sqp->ud_header); + if (rdma_port_get_transport(sqp->qp.ibqp.device, sqp->qp.port) == RDMA_TRANSPORT_IB) { + ib = &sqp->hdr.ib; + grh = &ib->grh; + bth = &ib->bth; + deth = &ib->deth; + ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), ib); + ib->lrh.service_level = + be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28; + ib->lrh.destination_lid = ah->av.ib.dlid; + ib->lrh.source_lid = cpu_to_be16(ah->av.ib.g_slid & 0x7f); + } else { + eth = &sqp->hdr.eth; + grh = ð->grh; + bth = ð->bth; + deth = ð->deth; + ib_rdmaoe_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), eth); + } - sqp->ud_header.lrh.service_level = - be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28; - sqp->ud_header.lrh.destination_lid = ah->av.dlid; - sqp->ud_header.lrh.source_lid = cpu_to_be16(ah->av.g_slid & 0x7f); if (mlx4_ib_ah_grh_present(ah)) { - sqp->ud_header.grh.traffic_class = - (be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20) & 0xff; - sqp->ud_header.grh.flow_label = - ah->av.sl_tclass_flowlabel & cpu_to_be32(0xfffff); - sqp->ud_header.grh.hop_limit = ah->av.hop_limit; - ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.port_pd) >> 24, - ah->av.gid_index, &sqp->ud_header.grh.source_gid); - memcpy(sqp->ud_header.grh.destination_gid.raw, - ah->av.dgid, 16); + grh->traffic_class = + (be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20) & 0xff; + grh->flow_label = + ah->av.ib.sl_tclass_flowlabel & cpu_to_be32(0xfffff); + grh->hop_limit = ah->av.ib.hop_limit; + ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.ib.port_pd) >> 24, + ah->av.ib.gid_index, &grh->source_gid); + memcpy(grh->destination_gid.raw, + ah->av.ib.dgid, 16); } mlx->flags &= cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) | - (sqp->ud_header.lrh.destination_lid == - IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) | - (sqp->ud_header.lrh.service_level << 8)); - mlx->rlid = sqp->ud_header.lrh.destination_lid; + + if (ib) { + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) | + (ib->lrh.destination_lid == + IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) | + (ib->lrh.service_level << 8)); + mlx->rlid = ib->lrh.destination_lid; + } switch (wr->opcode) { case IB_WR_SEND: - sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; - sqp->ud_header.immediate_present = 0; + bth->opcode = IB_OPCODE_UD_SEND_ONLY; + if (ib) + ib->immediate_present = 0; + else + eth->immediate_present = 0; break; case IB_WR_SEND_WITH_IMM: - sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; - sqp->ud_header.immediate_present = 1; - sqp->ud_header.immediate_data = wr->ex.imm_data; + bth->opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + if (ib) { + ib->immediate_present = 1; + ib->immediate_data = wr->ex.imm_data; + } else { + eth->immediate_present = 1; + eth->immediate_data = wr->ex.imm_data; + } break; default: return -EINVAL; } - sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; - if (sqp->ud_header.lrh.destination_lid == IB_LID_PERMISSIVE) - sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; - sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (ib) { + ib->lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (ib->lrh.destination_lid == IB_LID_PERMISSIVE) + ib->lrh.source_lid = IB_LID_PERMISSIVE; + } else { + memcpy(eth->eth.dmac_h, ah->av.eth.mac_0_1, 2); + memcpy(eth->eth.dmac_h + 2, ah->av.eth.mac_2_5, 2); + memcpy(eth->eth.dmac_l, ah->av.eth.mac_2_5 + 2, 2); + tmp = to_mdev(sqp->qp.ibqp.device)->rdmaoe.netdevs[sqp->qp.port - 1]->dev_addr; + memcpy(eth->eth.smac_h, tmp, 2); + memcpy(eth->eth.smac_l, tmp + 2, 4); + eth->eth.type = cpu_to_be16(MLX4_RDMAOE_ETHERTYPE); + } + bth->solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) ib_get_cached_pkey(ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else ib_get_cached_pkey(ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); - sqp->ud_header.bth.pkey = cpu_to_be16(pkey); - sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); - sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); - sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + bth->pkey = cpu_to_be16(pkey); + bth->destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + bth->psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + deth->qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? sqp->qkey : wr->wr.ud.remote_qkey); - sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); - - header_size = ib_ud_header_pack(&sqp->ud_header, sqp->header_buf); - - if (0) { - printk(KERN_ERR "built UD header of size %d:\n", header_size); - for (i = 0; i < header_size / 4; ++i) { - if (i % 8 == 0) - printk(" [%02x] ", i * 4); - printk(" %08x", - be32_to_cpu(((__be32 *) sqp->header_buf)[i])); - if ((i + 1) % 8 == 0) - printk("\n"); - } - printk("\n"); - } + deth->source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + if (ib) + header_size = ib_ud_header_pack(ib, sqp->header_buf); + else + header_size = rdmaoe_ud_header_pack(eth, sqp->header_buf); /* * Inline data segments may not cross a 64 byte boundary. If @@ -1414,6 +1476,8 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg, memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av)); dseg->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn); dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey); + dseg->vlan = to_mah(wr->wr.ud.ah)->av.eth.vlan; + memcpy(dseg->mac_0_1, to_mah(wr->wr.ud.ah)->av.eth.mac_0_1, 6); } static void set_mlx_icrc_seg(void *dseg) diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index cee199c..20526ce 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -96,7 +96,8 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags) [20] = "Address vector port checking support", [21] = "UD multicast support", [24] = "Demand paging support", - [25] = "Router support" + [25] = "Router support", + [30] = "RDMAoE support" }; int i; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 3aff8a6..b73b5f0 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -66,7 +66,8 @@ enum { MLX4_DEV_CAP_FLAG_ATOMIC = 1 << 18, MLX4_DEV_CAP_FLAG_RAW_MCAST = 1 << 19, MLX4_DEV_CAP_FLAG_UD_AV_PORT = 1 << 20, - MLX4_DEV_CAP_FLAG_UD_MCAST = 1 << 21 + MLX4_DEV_CAP_FLAG_UD_MCAST = 1 << 21, + MLX4_DEV_CAP_FLAG_RDMAOE = 1 << 30 }; enum { @@ -371,6 +372,28 @@ struct mlx4_av { u8 dgid[16]; }; +struct mlx4_eth_av { + __be32 port_pd; + u8 reserved1; + u8 smac_idx; + u16 reserved2; + u8 reserved3; + u8 gid_index; + u8 stat_rate; + u8 hop_limit; + __be32 sl_tclass_flowlabel; + u8 dgid[16]; + u32 reserved4[2]; + __be16 vlan; + u8 mac_0_1[2]; + u8 mac_2_5[4]; +}; + +union mlx4_ext_av { + struct mlx4_av ib; + struct mlx4_eth_av eth; +}; + struct mlx4_dev { struct pci_dev *pdev; unsigned long flags; @@ -399,6 +422,12 @@ struct mlx4_init_port_param { if (((type) == MLX4_PORT_TYPE_IB ? (dev)->caps.port_mask : \ ~(dev)->caps.port_mask) & 1 << ((port) - 1)) +#define mlx4_foreach_ib_transport_port(port, dev) \ + for ((port) = 1; (port) <= (dev)->caps.num_ports; (port)++) \ + if (((dev)->caps.port_mask & 1 << ((port) - 1)) || \ + ((dev)->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE)) + + int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, struct mlx4_buf *buf); void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf); diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index bf8f119..d73534f 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -112,7 +112,9 @@ struct mlx4_qp_path { u8 snooper_flags; u8 reserved3[2]; u8 counter_index; - u8 reserved4[7]; + u8 reserved4; + u8 dmac_h[2]; + u8 dmac_l[4]; }; struct mlx4_qp_context { @@ -218,7 +220,9 @@ struct mlx4_wqe_datagram_seg { __be32 av[8]; __be32 dqpn; __be32 qkey; - __be32 reservd[2]; + __be16 vlan; + u8 mac_0_1[2]; + u8 mac_2_5[4]; }; struct mlx4_wqe_lso_seg { -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:30:23 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:30:23 +0300 Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow interfaces to correspond to each other Message-ID: <20090805083023.GK5599@mtls03> This patch add support RDMAoE for mlx4. Since mlx4_ib now needs to reference mlx4_en netdevices, a new mechanism was added. Two new fields were added to struct mlx4_interface to define a protocol and a get_prot_dev method to retrieve the corresponding protocol's net device. An implementation of the new verb ib_get_port_link_type() - mlx4_ib_get_port_link_type - was added. mlx4_ib_query_port() has been modified to support eth link types. An interface is considered to be active if its corresponding eth interface is active. Code for setting the GID table of a port has been added. Currently, each IB port has a single GID entry in its table and that GID entery equals the link local IPv6 address. Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx4/main.c | 309 +++++++++++++++++++++++++++++++++---- drivers/net/mlx4/en_main.c | 15 ++- drivers/net/mlx4/en_port.c | 4 +- drivers/net/mlx4/en_port.h | 3 +- drivers/net/mlx4/intf.c | 20 +++ drivers/net/mlx4/main.c | 6 + drivers/net/mlx4/mlx4.h | 1 + include/linux/mlx4/cmd.h | 1 + include/linux/mlx4/driver.h | 16 ++- 9 files changed, 335 insertions(+), 40 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index ae3d759..737c6b9 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -34,9 +34,12 @@ #include #include #include +#include +#include #include #include +#include #include #include @@ -57,6 +60,15 @@ static const char mlx4_ib_version[] = DRV_NAME ": Mellanox ConnectX InfiniBand driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; +struct update_gid_work { + struct work_struct work; + union ib_gid gids[128]; + int port; + struct mlx4_ib_dev *dev; +}; + +static struct workqueue_struct *wq; + static void init_query_mad(struct ib_smp *mad) { mad->base_version = 1; @@ -152,28 +164,19 @@ out: return err; } -static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port, - struct ib_port_attr *props) +static enum rdma_transport_type +mlx4_ib_port_get_transport(struct ib_device *device, u8 port_num) { - struct ib_smp *in_mad = NULL; - struct ib_smp *out_mad = NULL; - int err = -ENOMEM; - - in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); - out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); - if (!in_mad || !out_mad) - goto out; - - memset(props, 0, sizeof *props); - - init_query_mad(in_mad); - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + struct mlx4_dev *dev = to_mdev(device)->dev; - err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); - if (err) - goto out; + return dev->caps.port_mask & (1 << (port_num - 1)) ? + RDMA_TRANSPORT_IB : RDMA_TRANSPORT_RDMAOE; +} +static void ib_link_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props, + struct ib_smp *out_mad) +{ props->lid = be16_to_cpup((__be16 *) (out_mad->data + 16)); props->lmc = out_mad->data[34] & 0x7; props->sm_lid = be16_to_cpup((__be16 *) (out_mad->data + 18)); @@ -193,6 +196,67 @@ static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port, props->subnet_timeout = out_mad->data[51] & 0x1f; props->max_vl_num = out_mad->data[37] >> 4; props->init_type_reply = out_mad->data[41] >> 4; + props->transport = RDMA_TRANSPORT_IB; +} + +static void eth_link_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props, + struct ib_smp *out_mad) +{ + struct mlx4_ib_rdmaoe *rdmaoe = &to_mdev(ibdev)->rdmaoe; + struct net_device *ndev; + + props->port_cap_flags = IB_PORT_CM_SUP; + props->gid_tbl_len = to_mdev(ibdev)->dev->caps.gid_table_len[port]; + props->max_msg_sz = to_mdev(ibdev)->dev->caps.max_msg_sz; + props->pkey_tbl_len = 1; + props->bad_pkey_cntr = be16_to_cpup((__be16 *) (out_mad->data + 46)); + props->qkey_viol_cntr = be16_to_cpup((__be16 *) (out_mad->data + 48)); + props->active_width = 0; + props->active_speed = 0; + props->max_mtu = out_mad->data[41] & 0xf; + props->subnet_timeout = 0; + props->max_vl_num = out_mad->data[37] >> 4; + props->init_type_reply = 0; + props->transport = RDMA_TRANSPORT_RDMAOE; + spin_lock(&rdmaoe->lock); + ndev = rdmaoe->netdevs[port - 1]; + if (!ndev) + goto out; + + props->active_mtu = rdmaoe_get_mtu(ndev->mtu); + props->state = netif_running(ndev) && netif_oper_up(ndev) ? + IB_PORT_ACTIVE : IB_PORT_DOWN; + props->phys_state = props->state; +out: + spin_unlock(&rdmaoe->lock); +} + +static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(props, 0, sizeof *props); + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + mlx4_ib_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB ? + ib_link_query_port(ibdev, port, props, out_mad) : + eth_link_query_port(ibdev, port, props, out_mad); out: kfree(in_mad); @@ -201,8 +265,8 @@ out: return err; } -static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index, - union ib_gid *gid) +static int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid) { struct ib_smp *in_mad = NULL; struct ib_smp *out_mad = NULL; @@ -239,6 +303,25 @@ out: return err; } +static int rdmaoe_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid) +{ + struct mlx4_ib_dev *dev = to_mdev(ibdev); + + *gid = dev->rdmaoe.gid_table[port - 1][index]; + + return 0; +} + +static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid) +{ + if (rdma_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB) + return __mlx4_ib_query_gid(ibdev, port, index, gid); + else + return rdmaoe_query_gid(ibdev, port, index, gid); +} + static int mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) { @@ -287,6 +370,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols, { struct mlx4_cmd_mailbox *mailbox; int err; + u8 is_eth = dev->dev->caps.port_type[port] == MLX4_PORT_TYPE_ETH; mailbox = mlx4_alloc_cmd_mailbox(dev->dev); if (IS_ERR(mailbox)) @@ -302,7 +386,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols, ((__be32 *) mailbox->buf)[1] = cpu_to_be32(cap_mask); } - err = mlx4_cmd(dev->dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT, + err = mlx4_cmd(dev->dev, mailbox->dma, port, is_eth, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B); mlx4_free_cmd_mailbox(dev->dev, mailbox); @@ -538,19 +622,146 @@ static struct device_attribute *mlx4_class_attributes[] = { &dev_attr_board_id }; +static void mlx4_addrconf_ifid_eui48(u8 *eui, struct net_device *dev) +{ + memcpy(eui, dev->dev_addr, 3); + memcpy(eui + 5, dev->dev_addr + 3, 3); + eui[3] = 0xFF; + eui[4] = 0xFE; + eui[0] ^= 2; +} + +static void update_gids_task(struct work_struct *work) +{ + struct update_gid_work *gw = container_of(work, struct update_gid_work, work); + struct mlx4_cmd_mailbox *mailbox; + union ib_gid *gids; + int err; + struct mlx4_dev *dev = gw->dev->dev; + struct ib_event event; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) { + printk(KERN_WARNING "update gid table failed %ld\n", PTR_ERR(mailbox)); + return; + } + + gids = mailbox->buf; + memcpy(gids, gw->gids, sizeof gw->gids); + + err = mlx4_cmd(dev, mailbox->dma, MLX4_SET_PORT_GID_TABLE << 8 | gw->port, + 1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B); + if (err) + printk(KERN_WARNING "set port command failed\n"); + else { + memcpy(gw->dev->rdmaoe.gid_table[gw->port - 1], gw->gids, sizeof gw->gids); + event.device = &gw->dev->ib_dev; + event.element.port_num = gw->port; + event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } + + mlx4_free_cmd_mailbox(dev, mailbox); + kfree(gw); +} + +static int update_ipv6_gids(struct mlx4_ib_dev *dev, int port, int clear) +{ + struct net_device *ndev = dev->rdmaoe.netdevs[port - 1]; + struct update_gid_work *work; + + work = kzalloc(sizeof *work, GFP_ATOMIC); + if (!work) + return -ENOMEM; + + if (!clear) { + mlx4_addrconf_ifid_eui48(&work->gids[0].raw[8], ndev); + work->gids[0].global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL); + } + + INIT_WORK(&work->work, update_gids_task); + work->port = port; + work->dev = dev; + queue_work(wq, &work->work); + + return 0; +} + +static void handle_en_event(struct mlx4_ib_dev *dev, int port, unsigned long event) +{ + switch (event) { + case NETDEV_UP: + update_ipv6_gids(dev, port, 0); + break; + + case NETDEV_DOWN: + update_ipv6_gids(dev, port, 1); + } +} + +static void netdev_added(struct mlx4_ib_dev *dev, int port) +{ + update_ipv6_gids(dev, port, 0); +} + +static void netdev_removed(struct mlx4_ib_dev *dev, int port) +{ + update_ipv6_gids(dev, port, 1); +} + +static int mlx4_ib_netdev_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + struct net_device *dev = ptr; + struct mlx4_ib_dev *ibdev; + struct net_device *oldnd; + struct mlx4_ib_rdmaoe *rdmaoe; + int port; + + if (!net_eq(dev_net(dev), &init_net)) + return NOTIFY_DONE; + + ibdev = container_of(this, struct mlx4_ib_dev, rdmaoe.nb); + rdmaoe = &ibdev->rdmaoe; + + spin_lock(&rdmaoe->lock); + mlx4_foreach_ib_transport_port(port, ibdev->dev) { + oldnd = rdmaoe->netdevs[port - 1]; + rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(ibdev->dev, MLX4_PROT_EN, port); + if (oldnd != rdmaoe->netdevs[port - 1]) { + if (rdmaoe->netdevs[port - 1]) + netdev_added(ibdev, port); + else + netdev_removed(ibdev, port); + } + } + + if (dev == rdmaoe->netdevs[0]) + handle_en_event(ibdev, 1, event); + else if (dev == rdmaoe->netdevs[1]) + handle_en_event(ibdev, 2, event); + + spin_unlock(&rdmaoe->lock); + + return NOTIFY_DONE; +} + static void *mlx4_ib_add(struct mlx4_dev *dev) { static int mlx4_ib_version_printed; struct mlx4_ib_dev *ibdev; int num_ports = 0; int i; + int err; + int port; + struct mlx4_ib_rdmaoe *rdmaoe; if (!mlx4_ib_version_printed) { printk(KERN_INFO "%s", mlx4_ib_version); ++mlx4_ib_version_printed; } - mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) + mlx4_foreach_ib_transport_port(i, dev) num_ports++; /* No point in registering a device with no ports... */ @@ -563,6 +774,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) return NULL; } + rdmaoe = &ibdev->rdmaoe; + if (mlx4_pd_alloc(dev, &ibdev->priv_pdn)) goto err_dealloc; @@ -607,10 +820,12 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_GET_MAC); ibdev->ib_dev.query_device = mlx4_ib_query_device; ibdev->ib_dev.query_port = mlx4_ib_query_port; + ibdev->ib_dev.get_port_transport = mlx4_ib_port_get_transport; ibdev->ib_dev.query_gid = mlx4_ib_query_gid; ibdev->ib_dev.query_pkey = mlx4_ib_query_pkey; ibdev->ib_dev.modify_device = mlx4_ib_modify_device; @@ -654,15 +869,26 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) ibdev->ib_dev.map_phys_fmr = mlx4_ib_map_phys_fmr; ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; + ibdev->ib_dev.get_mac = mlx4_ib_get_mac; + + mlx4_foreach_ib_transport_port(port, dev) + rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(dev, MLX4_PROT_EN, port); + spin_lock_init(&rdmaoe->lock); + if (dev->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE && !rdmaoe->nb.notifier_call) { + rdmaoe->nb.notifier_call = mlx4_ib_netdev_event; + err = register_netdevice_notifier(&rdmaoe->nb); + if (err) + goto err_map; + } if (init_node_data(ibdev)) - goto err_map; + goto err_notif; spin_lock_init(&ibdev->sm_lock); mutex_init(&ibdev->cap_mask_mutex); if (ib_register_device(&ibdev->ib_dev)) - goto err_map; + goto err_notif; if (mlx4_ib_mad_init(ibdev)) goto err_reg; @@ -678,6 +904,10 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) err_reg: ib_unregister_device(&ibdev->ib_dev); +err_notif: + flush_workqueue(wq); + unregister_netdevice_notifier(&rdmaoe->nb); + err_map: iounmap(ibdev->uar_map); @@ -700,11 +930,16 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr) mlx4_ib_mad_cleanup(ibdev); ib_unregister_device(&ibdev->ib_dev); + if (ibdev->rdmaoe.nb.notifier_call) { + flush_workqueue(wq); + unregister_netdevice_notifier(&ibdev->rdmaoe.nb); + ibdev->rdmaoe.nb.notifier_call = NULL; + } + iounmap(ibdev->uar_map); - for (p = 1; p <= ibdev->num_ports; ++p) + mlx4_foreach_port(p, dev, MLX4_PORT_TYPE_IB) mlx4_CLOSE_PORT(dev, p); - iounmap(ibdev->uar_map); mlx4_uar_free(dev, &ibdev->priv_uar); mlx4_pd_free(dev, ibdev->priv_pdn); ib_dealloc_device(&ibdev->ib_dev); @@ -745,17 +980,31 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr, static struct mlx4_interface mlx4_ib_interface = { .add = mlx4_ib_add, .remove = mlx4_ib_remove, - .event = mlx4_ib_event + .event = mlx4_ib_event, + .protocol = MLX4_PROT_IB }; static int __init mlx4_ib_init(void) { - return mlx4_register_interface(&mlx4_ib_interface); + int err; + + wq = create_singlethread_workqueue("mlx4_ib"); + if (!wq) + return -ENOMEM; + + err = mlx4_register_interface(&mlx4_ib_interface); + if (err) { + destroy_workqueue(wq); + return err; + } + + return 0; } static void __exit mlx4_ib_cleanup(void) { mlx4_unregister_interface(&mlx4_ib_interface); + destroy_workqueue(wq); } module_init(mlx4_ib_init); diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c index 510633f..6f30eca 100644 --- a/drivers/net/mlx4/en_main.c +++ b/drivers/net/mlx4/en_main.c @@ -51,6 +51,13 @@ static const char mlx4_en_version[] = DRV_NAME ": Mellanox ConnectX HCA Ethernet driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; +static void *get_netdev(struct mlx4_dev *dev, void *ctx, u8 port) +{ + struct mlx4_en_dev *endev = ctx; + + return endev->pndev[port]; +} + static void mlx4_en_event(struct mlx4_dev *dev, void *endev_ptr, enum mlx4_dev_event event, int port) { @@ -229,9 +236,11 @@ err_free_res: } static struct mlx4_interface mlx4_en_interface = { - .add = mlx4_en_add, - .remove = mlx4_en_remove, - .event = mlx4_en_event, + .add = mlx4_en_add, + .remove = mlx4_en_remove, + .event = mlx4_en_event, + .get_prot_dev = get_netdev, + .protocol = MLX4_PROT_EN, }; static int __init mlx4_en_init(void) diff --git a/drivers/net/mlx4/en_port.c b/drivers/net/mlx4/en_port.c index a29abe8..a249887 100644 --- a/drivers/net/mlx4/en_port.c +++ b/drivers/net/mlx4/en_port.c @@ -127,8 +127,8 @@ int mlx4_SET_PORT_qpn_calc(struct mlx4_dev *dev, u8 port, u32 base_qpn, memset(context, 0, sizeof *context); context->base_qpn = cpu_to_be32(base_qpn); - context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_SHIFT | base_qpn); - context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_SHIFT | base_qpn); + context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_EN_SHIFT | base_qpn); + context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_MODE_SHIFT | base_qpn); context->intra_no_vlan = 0; context->no_vlan = MLX4_NO_VLAN_IDX; context->intra_vlan_miss = 0; diff --git a/drivers/net/mlx4/en_port.h b/drivers/net/mlx4/en_port.h index e6477f1..9354891 100644 --- a/drivers/net/mlx4/en_port.h +++ b/drivers/net/mlx4/en_port.h @@ -36,7 +36,8 @@ #define SET_PORT_GEN_ALL_VALID 0x7 -#define SET_PORT_PROMISC_SHIFT 31 +#define SET_PORT_PROMISC_EN_SHIFT 31 +#define SET_PORT_PROMISC_MODE_SHIFT 30 enum { MLX4_CMD_SET_VLAN_FLTR = 0x47, diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c index 0e7eb10..d64530e 100644 --- a/drivers/net/mlx4/intf.c +++ b/drivers/net/mlx4/intf.c @@ -159,3 +159,23 @@ void mlx4_unregister_device(struct mlx4_dev *dev) mutex_unlock(&intf_mutex); } + +void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_device_context *dev_ctx; + unsigned long flags; + void *result = NULL; + + spin_lock_irqsave(&priv->ctx_lock, flags); + + list_for_each_entry(dev_ctx, &priv->ctx_list, list) + if (dev_ctx->intf->protocol == proto && dev_ctx->intf->get_prot_dev) { + result = dev_ctx->intf->get_prot_dev(dev, dev_ctx->context, port); + break; + } + + spin_unlock_irqrestore(&priv->ctx_lock, flags); + + return result; +} diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 30bea96..c72af51 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -100,6 +100,12 @@ module_param_named(use_prio, use_prio, bool, 0444); MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports " "(0/1, default 0)"); +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port) +{ + return mlx4_find_get_prot_dev(dev, proto, port); +} +EXPORT_SYMBOL(mlx4_get_prot_dev); + int mlx4_check_port_params(struct mlx4_dev *dev, enum mlx4_port_type *port_type) { diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 5bd79c2..db068c9 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -364,6 +364,7 @@ int mlx4_restart_one(struct pci_dev *pdev); int mlx4_register_device(struct mlx4_dev *dev); void mlx4_unregister_device(struct mlx4_dev *dev); void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type, int port); +void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port); struct mlx4_dev_cap; struct mlx4_init_hca_param; diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h index 0f82293..22bd8d3 100644 --- a/include/linux/mlx4/cmd.h +++ b/include/linux/mlx4/cmd.h @@ -140,6 +140,7 @@ enum { MLX4_SET_PORT_MAC_TABLE = 0x2, MLX4_SET_PORT_VLAN_TABLE = 0x3, MLX4_SET_PORT_PRIO_MAP = 0x4, + MLX4_SET_PORT_GID_TABLE = 0x5, }; struct mlx4_dev; diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h index 53c5fdb..0083256 100644 --- a/include/linux/mlx4/driver.h +++ b/include/linux/mlx4/driver.h @@ -44,15 +44,23 @@ enum mlx4_dev_event { MLX4_DEV_EVENT_PORT_REINIT, }; +enum mlx4_prot { + MLX4_PROT_IB, + MLX4_PROT_EN, +}; + struct mlx4_interface { - void * (*add) (struct mlx4_dev *dev); - void (*remove)(struct mlx4_dev *dev, void *context); - void (*event) (struct mlx4_dev *dev, void *context, - enum mlx4_dev_event event, int port); + void * (*add) (struct mlx4_dev *dev); + void (*remove)(struct mlx4_dev *dev, void *context); + void (*event) (struct mlx4_dev *dev, void *context, + enum mlx4_dev_event event, int port); + void * (*get_prot_dev) (struct mlx4_dev *dev, void *context, u8 port); + enum mlx4_prot protocol; struct list_head list; }; int mlx4_register_interface(struct mlx4_interface *intf); void mlx4_unregister_interface(struct mlx4_interface *intf); +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port); #endif /* MLX4_DRIVER_H */ -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:34:22 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:34:22 +0300 Subject: [ofa-general] [PATCHv4] libibverbs: Add RDMAoE support Message-ID: <20090805083422.GA6659@mtls03> Extend the ibv_query_port() verb to return a port transport protocol which can be one of RDMA_TRANSPORT_IB, RDMA_TRANSPORT_IWARP or RDMA_TRANSPORT_RDMAOE. This can be used by applications to know if they must use GRH as is the case in RDMAoE. Add a new system call to get the MAC address of the remote port that a UD address vector refers to. Update ibv_rc_pingpong and ibv_ud_pingpong to accept a remote GID so that they can be used with an RDMAoE port. Signed-off-by: Eli Cohen --- Changed the reference to a port from link type to protocol type. This patch is tagged v4 to create correspondence with the kernel patches. examples/devinfo.c | 15 ++++++++++++ examples/pingpong.c | 9 +++++++ examples/pingpong.h | 2 + examples/rc_pingpong.c | 50 ++++++++++++++++++++++++++++++++-------- examples/ud_pingpong.c | 38 +++++++++++++++++++++++++++---- include/infiniband/driver.h | 1 + include/infiniband/kern-abi.h | 25 ++++++++++++++++++-- include/infiniband/verbs.h | 12 +++++++++ src/cmd.c | 20 ++++++++++++++++ src/libibverbs.map | 1 + 10 files changed, 155 insertions(+), 18 deletions(-) diff --git a/examples/devinfo.c b/examples/devinfo.c index caa5d5f..a42a6dc 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -175,6 +175,20 @@ static int print_all_port_gids(struct ibv_context *ctx, uint8_t port_num, int tb return rc; } +static const char *transport_type_str(enum rdma_transport_type type) +{ + switch (type) { + case RDMA_TRANSPORT_IB: + return "IB"; + case RDMA_TRANSPORT_IWARP: + return "IWARP"; + case RDMA_TRANSPORT_RDMAOE: + return "RDMAOE"; + default: + return "Unknown"; + } +} + static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port) { struct ibv_context *ctx; @@ -273,6 +287,7 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port) printf("\t\t\tsm_lid:\t\t\t%d\n", port_attr.sm_lid); printf("\t\t\tport_lid:\t\t%d\n", port_attr.lid); printf("\t\t\tport_lmc:\t\t0x%02x\n", port_attr.lmc); + printf("\t\t\ttrasnport_type:\t\t%s\n", transport_type_str(port_attr.transport)); if (verbose) { printf("\t\t\tmax_msg_sz:\t\t0x%x\n", port_attr.max_msg_sz); diff --git a/examples/pingpong.c b/examples/pingpong.c index b916f59..d4a46e4 100644 --- a/examples/pingpong.c +++ b/examples/pingpong.c @@ -31,6 +31,8 @@ */ #include "pingpong.h" +#include +#include enum ibv_mtu pp_mtu_to_enum(int mtu) { @@ -53,3 +55,10 @@ uint16_t pp_get_local_lid(struct ibv_context *context, int port) return attr.lid; } + +int pp_get_port_info(struct ibv_context *context, int port, + struct ibv_port_attr *attr) +{ + return ibv_query_port(context, port, attr); +} + diff --git a/examples/pingpong.h b/examples/pingpong.h index 71d7c3f..16d3466 100644 --- a/examples/pingpong.h +++ b/examples/pingpong.h @@ -37,5 +37,7 @@ enum ibv_mtu pp_mtu_to_enum(int mtu); uint16_t pp_get_local_lid(struct ibv_context *context, int port); +int pp_get_port_info(struct ibv_context *context, int port, + struct ibv_port_attr *attr); #endif /* IBV_PINGPONG_H */ diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c index 26fa45c..4250cdf 100644 --- a/examples/rc_pingpong.c +++ b/examples/rc_pingpong.c @@ -67,6 +67,8 @@ struct pingpong_context { int size; int rx_depth; int pending; + struct ibv_port_attr portinfo; + union ibv_gid dgid; }; struct pingpong_dest { @@ -94,6 +96,12 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, .port_num = port } }; + + if (ctx->dgid.global.interface_id) { + attr.ah_attr.is_global = 1; + attr.ah_attr.grh.hop_limit = 1; + attr.ah_attr.grh.dgid = ctx->dgid; + } if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_AV | @@ -289,11 +297,11 @@ out: static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, int rx_depth, int port, - int use_event) + int use_event, int is_server) { struct pingpong_context *ctx; - ctx = malloc(sizeof *ctx); + ctx = calloc(1, sizeof *ctx); if (!ctx) return NULL; @@ -306,7 +314,7 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, return NULL; } - memset(ctx->buf, 0, size); + memset(ctx->buf, 0x7b + is_server, size); ctx->context = ibv_open_device(ib_dev); if (!ctx->context) { @@ -481,6 +489,7 @@ static void usage(const char *argv0) printf(" -n, --iters= number of exchanges (default 1000)\n"); printf(" -l, --sl= service level value\n"); printf(" -e, --events sleep on CQ events (default poll)\n"); + printf(" -g, --gid= gid of the other port\n"); } int main(int argc, char *argv[]) @@ -504,6 +513,7 @@ int main(int argc, char *argv[]) int rcnt, scnt; int num_cq_events = 0; int sl = 0; + char *grh = NULL; srand48(getpid() * time(NULL)); @@ -520,10 +530,11 @@ int main(int argc, char *argv[]) { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "sl", .has_arg = 1, .val = 'l' }, { .name = "events", .has_arg = 0, .val = 'e' }, + { .name = "gid", .has_arg = 1, .val = 'g' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:", long_options, NULL); if (c == -1) break; @@ -575,6 +586,10 @@ int main(int argc, char *argv[]) ++use_event; break; + case 'g': + grh = strdupa(optarg); + break; + default: usage(argv[0]); return 1; @@ -614,7 +629,7 @@ int main(int argc, char *argv[]) } } - ctx = pp_init_ctx(ib_dev, size, rx_depth, ib_port, use_event); + ctx = pp_init_ctx(ib_dev, size, rx_depth, ib_port, use_event, !servername); if (!ctx) return 1; @@ -630,17 +645,31 @@ int main(int argc, char *argv[]) return 1; } - my_dest.lid = pp_get_local_lid(ctx->context, ib_port); - my_dest.qpn = ctx->qp->qp_num; - my_dest.psn = lrand48() & 0xffffff; - if (!my_dest.lid) { - fprintf(stderr, "Couldn't get local LID\n"); + + if (pp_get_port_info(ctx->context, ib_port, &ctx->portinfo)) { + fprintf(stderr, "Couldn't get port info\n"); return 1; } + my_dest.lid = ctx->portinfo.lid; + if (ctx->portinfo.transport == RDMA_TRANSPORT_RDMAOE) { + if (!grh) { + fprintf(stderr, "Couldn't get local LID\n"); + return 1; + } + inet_pton(AF_INET6, grh, &ctx->dgid); + } else { + if (!my_dest.lid) { + fprintf(stderr, "Couldn't get local LID\n"); + return 1; + } + } + my_dest.qpn = ctx->qp->qp_num; + my_dest.psn = lrand48() & 0xffffff; printf(" local address: LID 0x%04x, QPN 0x%06x, PSN 0x%06x\n", my_dest.lid, my_dest.qpn, my_dest.psn); + if (servername) rem_dest = pp_client_exch_dest(servername, port, &my_dest); else @@ -705,6 +734,7 @@ int main(int argc, char *argv[]) fprintf(stderr, "poll CQ failed %d\n", ne); return 1; } + } while (!use_event && ne < 1); for (i = 0; i < ne; ++i) { diff --git a/examples/ud_pingpong.c b/examples/ud_pingpong.c index 8f3d50b..b3aa55d 100644 --- a/examples/ud_pingpong.c +++ b/examples/ud_pingpong.c @@ -68,6 +68,8 @@ struct pingpong_context { int size; int rx_depth; int pending; + struct ibv_port_attr portinfo; + union ibv_gid dgid; }; struct pingpong_dest { @@ -105,6 +107,12 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, return 1; } + if (ctx->dgid.global.interface_id) { + ah_attr.is_global = 1; + ah_attr.grh.hop_limit = 1; + ah_attr.grh.dgid = ctx->dgid; + } + ctx->ah = ibv_create_ah(ctx->pd, &ah_attr); if (!ctx->ah) { fprintf(stderr, "Failed to create AH\n"); @@ -478,6 +486,7 @@ static void usage(const char *argv0) printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); printf(" -n, --iters= number of exchanges (default 1000)\n"); printf(" -e, --events sleep on CQ events (default poll)\n"); + printf(" -g, --gid specify remote gid\n"); } int main(int argc, char *argv[]) @@ -500,6 +509,7 @@ int main(int argc, char *argv[]) int rcnt, scnt; int num_cq_events = 0; int sl = 0; + char *gid = NULL; srand48(getpid() * time(NULL)); @@ -515,10 +525,11 @@ int main(int argc, char *argv[]) { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "sl", .has_arg = 1, .val = 'l' }, { .name = "events", .has_arg = 0, .val = 'e' }, + { .name = "gid", .has_arg = 1, .val = 'g' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:r:n:l:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:r:n:l:eg:", long_options, NULL); if (c == -1) break; @@ -563,6 +574,10 @@ int main(int argc, char *argv[]) ++use_event; break; + case 'g': + gid = strdupa(optarg); + break; + default: usage(argv[0]); return 1; @@ -618,12 +633,25 @@ int main(int argc, char *argv[]) return 1; } - my_dest.lid = pp_get_local_lid(ctx->context, ib_port); + if (pp_get_port_info(ctx->context, ib_port, &ctx->portinfo)) { + fprintf(stderr, "Couldn't get port info\n"); + return 1; + } + my_dest.lid = ctx->portinfo.lid; + my_dest.qpn = ctx->qp->qp_num; my_dest.psn = lrand48() & 0xffffff; - if (!my_dest.lid) { - fprintf(stderr, "Couldn't get local LID\n"); - return 1; + if (ctx->portinfo.transport == RDMA_TRANSPORT_IB) { + if (!my_dest.lid) { + fprintf(stderr, "Couldn't get local LID\n"); + return 1; + } + } else { + if (!gid) { + fprintf(stderr, "must specify remote GID\n"); + return 1; + } + inet_pton(AF_INET6, gid, &ctx->dgid); } printf(" local address: LID 0x%04x, QPN 0x%06x, PSN 0x%06x\n", diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h index 67a3bf8..cbd261f 100644 --- a/include/infiniband/driver.h +++ b/include/infiniband/driver.h @@ -131,6 +131,7 @@ int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, int ibv_cmd_destroy_ah(struct ibv_ah *ah); int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +int ibv_cmd_get_mac(struct ibv_pd *pd, uint8_t port, uint8_t *gid, uint8_t *mac); int ibv_dontfork_range(void *base, size_t size); int ibv_dofork_range(void *base, size_t size); diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h index 0db083a..7823da8 100644 --- a/include/infiniband/kern-abi.h +++ b/include/infiniband/kern-abi.h @@ -46,7 +46,7 @@ * The minimum and maximum kernel ABI that we can handle. */ #define IB_USER_VERBS_MIN_ABI_VERSION 1 -#define IB_USER_VERBS_MAX_ABI_VERSION 6 +#define IB_USER_VERBS_MAX_ABI_VERSION 7 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -85,7 +85,8 @@ enum { IB_USER_VERBS_CMD_MODIFY_SRQ, IB_USER_VERBS_CMD_QUERY_SRQ, IB_USER_VERBS_CMD_DESTROY_SRQ, - IB_USER_VERBS_CMD_POST_SRQ_RECV + IB_USER_VERBS_CMD_POST_SRQ_RECV, + IB_USER_VERBS_CMD_GET_MAC, }; /* @@ -223,7 +224,8 @@ struct ibv_query_port_resp { __u8 active_width; __u8 active_speed; __u8 phys_state; - __u8 reserved[3]; + __u8 transport; + __u8 reserved[2]; }; struct ibv_alloc_pd { @@ -798,6 +800,7 @@ enum { IB_USER_VERBS_CMD_QUERY_SRQ_V2, IB_USER_VERBS_CMD_DESTROY_SRQ_V2, IB_USER_VERBS_CMD_POST_SRQ_RECV_V2, + IB_USER_VERBS_CMD_GET_MAC_V2 = -1, /* * Set commands that didn't exist to -1 so our compile-time * trick opcodes in IBV_INIT_CMD() doesn't break. @@ -878,4 +881,20 @@ struct ibv_create_srq_resp_v5 { __u32 srq_handle; }; +struct ibv_get_mac { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 pd_handle; + __u8 port; + __u8 reserved[3]; + __u8 dgid[16]; +}; + +struct ibv_get_mac_resp { + __u8 mac[6]; + __u16 reserved; +}; + #endif /* KERN_ABI_H */ diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index a04cc62..f81f17f 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -61,6 +61,7 @@ union ibv_gid { uint64_t subnet_prefix; uint64_t interface_id; } global; + uint32_t dwords[4]; }; enum ibv_node_type { @@ -161,6 +162,16 @@ enum ibv_port_state { IBV_PORT_ACTIVE_DEFER = 5 }; +enum rdma_transport_type { + RDMA_TRANSPORT_IB, + RDMA_TRANSPORT_IWARP, + RDMA_TRANSPORT_RDMAOE +}; +enum ibv_port_link_type { + PORT_LINK_IB, + PORT_LINK_ETH +}; + struct ibv_port_attr { enum ibv_port_state state; enum ibv_mtu max_mtu; @@ -181,6 +192,7 @@ struct ibv_port_attr { uint8_t active_width; uint8_t active_speed; uint8_t phys_state; + enum rdma_transport_type transport; }; enum ibv_event_type { diff --git a/src/cmd.c b/src/cmd.c index 66d7134..30754ac 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -162,6 +162,7 @@ int ibv_cmd_query_device(struct ibv_context *context, return 0; } +#include int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr, struct ibv_query_port *cmd, size_t cmd_size) @@ -196,6 +197,7 @@ int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num, port_attr->active_width = resp.active_width; port_attr->active_speed = resp.active_speed; port_attr->phys_state = resp.phys_state; + port_attr->transport = resp.transport; return 0; } @@ -1122,3 +1124,21 @@ int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) return 0; } + +int ibv_cmd_get_mac(struct ibv_pd *pd, uint8_t port, uint8_t *gid, uint8_t *mac) +{ + struct ibv_get_mac cmd; + struct ibv_get_mac_resp resp; + + IBV_INIT_CMD_RESP(&cmd, sizeof cmd, GET_MAC, &resp, sizeof resp); + memcpy(cmd.dgid, gid, sizeof cmd.dgid); + cmd.pd_handle = pd->handle; + cmd.port = port; + + if (write(pd->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) + return errno; + + memcpy(mac, resp.mac, 6); + + return 0; +} diff --git a/src/libibverbs.map b/src/libibverbs.map index 1827da0..1688e73 100644 --- a/src/libibverbs.map +++ b/src/libibverbs.map @@ -64,6 +64,7 @@ IBVERBS_1.0 { ibv_cmd_destroy_ah; ibv_cmd_attach_mcast; ibv_cmd_detach_mcast; + ibv_cmd_get_mac; ibv_copy_qp_attr_from_kern; ibv_copy_path_rec_from_kern; ibv_copy_path_rec_to_kern; -- 1.6.3.3 From eli at mellanox.co.il Wed Aug 5 01:36:48 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 11:36:48 +0300 Subject: [ofa-general] [PATCHv4] libmlx4: Add RDMAoE support Message-ID: <20090805083648.GA6696@mtls03> Modify mlx4_create_ah() to check the port's transport protocol, and for the case of RDMAoE ports, do a system call to retrieve the remote port's MAC address. Make modifications to address vector data structs and code to accomodate for RDMAoE. --- Changed the reference to a port from link type to protocol type. This patch is tagged v4 to create correspondence with the kernel patches. src/mlx4.h | 3 +++ src/qp.c | 2 ++ src/verbs.c | 29 +++++++++++++++++++++++++++++ src/wqe.h | 3 ++- 4 files changed, 36 insertions(+), 1 deletions(-) diff --git a/src/mlx4.h b/src/mlx4.h index 827a201..20d3fdd 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -236,11 +236,14 @@ struct mlx4_av { uint8_t hop_limit; uint32_t sl_tclass_flowlabel; uint8_t dgid[16]; + uint8_t mac[8]; }; struct mlx4_ah { struct ibv_ah ibv_ah; struct mlx4_av av; + uint16_t vlan; + uint8_t mac[6]; }; static inline unsigned long align(unsigned long val, unsigned long align) diff --git a/src/qp.c b/src/qp.c index d194ae3..cd8fab0 100644 --- a/src/qp.c +++ b/src/qp.c @@ -143,6 +143,8 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg, memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av)); dseg->dqpn = htonl(wr->wr.ud.remote_qpn); dseg->qkey = htonl(wr->wr.ud.remote_qkey); + dseg->vlan = htons(to_mah(wr->wr.ud.ah)->vlan); + memcpy(dseg->mac, to_mah(wr->wr.ud.ah)->mac, 6); } static void __set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ibv_sge *sg) diff --git a/src/verbs.c b/src/verbs.c index cc179a0..e60ab05 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -614,9 +614,21 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp) return 0; } +static int mcast_mac(uint8_t *mac) +{ + int i; + uint8_t val = 0xff; + + for (i = 0; i < 6; ++i) + val &= mac[i]; + + return val == 0xff; +} + struct ibv_ah *mlx4_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) { struct mlx4_ah *ah; + struct ibv_port_attr port_attr; ah = malloc(sizeof *ah); if (!ah) @@ -642,7 +654,24 @@ struct ibv_ah *mlx4_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) memcpy(ah->av.dgid, attr->grh.dgid.raw, 16); } + if (ibv_query_port(pd->context, attr->port_num, &port_attr)) + goto err; + + if (port_attr.transport == RDMA_TRANSPORT_RDMAOE) { + if (ibv_cmd_get_mac(pd, attr->port_num, ah->av.dgid, ah->mac)) + goto err; + + ah->vlan = 0; + if (mcast_mac(ah->mac)) + ah->av.dlid = htons(0xc000); + + } + + return &ah->ibv_ah; +err: + free(ah); + return NULL; } int mlx4_destroy_ah(struct ibv_ah *ah) diff --git a/src/wqe.h b/src/wqe.h index 6f7f309..ea6f27f 100644 --- a/src/wqe.h +++ b/src/wqe.h @@ -78,7 +78,8 @@ struct mlx4_wqe_datagram_seg { uint32_t av[8]; uint32_t dqpn; uint32_t qkey; - uint32_t reserved[2]; + __be16 vlan; + uint8_t mac[6]; }; struct mlx4_wqe_data_seg { -- 1.6.3.3 From kliteyn at dev.mellanox.co.il Wed Aug 5 01:55:43 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Aug 2009 11:55:43 +0300 Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <20090804153509.GH7993@me> References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me> Message-ID: <4A79490F.4000704@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > On 17:32 Tue 04 Aug , Yevgeny Kliteynik wrote: >> opt.max_wire_smps is uint32, but then when it's propagated >> into the VL15 poller it's casted to int32. Fixing the >> parameter handling to protect it from wrong values. >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/main.c | 2 +- >> opensm/opensm/osm_subnet.c | 7 +++++++ >> 2 files changed, 8 insertions(+), 1 deletions(-) >> >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >> index 296d5d5..9cb9990 100644 >> --- a/opensm/opensm/main.c >> +++ b/opensm/opensm/main.c >> @@ -722,7 +722,7 @@ int main(int argc, char *argv[]) >> >> case 'n': >> opt.max_wire_smps = strtol(optarg, NULL, 0); > > Then you likely want to use strtoul(). Right >> - if (opt.max_wire_smps <= 0) >> + if (opt.max_wire_smps > 0x7FFFFFFF) >> opt.max_wire_smps = 0x7FFFFFFF; > > What about opt.max_wire_smps == 0? Good point. > Sasha > >> printf(" Max wire smp's = %d\n", opt.max_wire_smps); >> break; >> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c >> index ec15f8a..c07d823 100644 >> --- a/opensm/opensm/osm_subnet.c >> +++ b/opensm/opensm/osm_subnet.c >> @@ -1066,6 +1066,13 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) >> p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK; >> } >> >> + if (p_opts->max_wire_smps > 0x7FFFFFFF) { >> + log_report(" Invalid Cached Option Value: max_wire_smps = %u," >> + " Using Default: %u\n", >> + p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE); >> + p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; >> + } > > Ditto. Right again. And since we're on this, perhaps the right thing here would be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal valid value (0x7FFFFFFF)? -- Yevgeny > Sasha > From sashak at voltaire.com Wed Aug 5 02:05:13 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 12:05:13 +0300 Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <4A79490F.4000704@dev.mellanox.co.il> References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me> <4A79490F.4000704@dev.mellanox.co.il> Message-ID: <20090805090513.GO7993@me> On 11:55 Wed 05 Aug , Yevgeny Kliteynik wrote: > And since we're on this, perhaps the right thing here would > be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal > valid value (0x7FFFFFFF)? In which case? When provided max_wire_smps is 0 or invalid? Sasha From sashak at voltaire.com Wed Aug 5 02:32:28 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 12:32:28 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: <20090722151615.GA24576@comcast.net> References: <20090722151615.GA24576@comcast.net> Message-ID: <20090805093228.GP7993@me> Hi Hal, On 11:16 Wed 22 Jul , Hal Rosenstock wrote: > > +/* > + * sort_switches - reorder switch array > + */ > +static void sort_switches(lash_t *p_lash, mesh_t *mesh) > +{ > + int i, j; > + int num_switches = p_lash->num_switches; > + sort_ctx_t sort_ctx; > + comp_t *index; > + int *reverse; > + switch_t *s; > + switch_t **switches; > + > + index = malloc(num_switches * sizeof(comp_t)); > + reverse = malloc(num_switches * sizeof(int)); > + switches = malloc(num_switches * sizeof(switch_t *)); > + if (!index || !reverse || !switches) { > + OSM_LOG(&p_lash->p_osm->log, OSM_LOG_ERROR, > + "Failed memory allocation - switches not sorted!\n"); > + goto Exit; > + } > + > + sort_ctx.mesh = mesh; > + sort_ctx.p_lash = p_lash; > + > + for (i = 0; i < num_switches; i++) { > + index[i].index = i; > + index[i].ctx = &sort_ctx; > + } > + > + qsort(index, num_switches, sizeof(comp_t), compare_switch); > + > + for (i = 0; i < num_switches; i++) > + reverse[index[i].index] = i; > + > + for (i = 0; i < num_switches; i++) { > + s = p_lash->switches[index[i].index]; > + switches[i] = s; > + s->id = i; > + for (j = 0; j < s->node->num_links; j++) > + s->node->links[j]->switch_id = > + reverse[s->node->links[j]->switch_id]; Isn't it the same as: s->node->links[j]->switch_id = index[s->node->links[j]->switch_id].index; (and then reverse array is obsolete)? Sasha > + } > + > + for (i = 0; i < num_switches; i++) > + p_lash->switches[i] = switches[i]; > + > +Exit: > + if (switches) > + free(switches); > + if (index) > + free(index); > + if (reverse) > + free(reverse); > +} > + > +/* > * osm_mesh_delete - free per mesh resources > */ > static void mesh_delete(mesh_t *mesh) > @@ -1470,6 +1561,8 @@ int osm_do_mesh_analysis(lash_t *p_lash) > if (reorder_links(p_lash, mesh)) > goto err; > > + sort_switches(p_lash, mesh); > + > p = buf; > p += sprintf(p, "found "); > for (i = 0; i < mesh->dimension; i++) > From sashak at voltaire.com Wed Aug 5 02:44:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 12:44:33 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: <20090722151615.GA24576@comcast.net> References: <20090722151615.GA24576@comcast.net> Message-ID: <20090805094433.GQ7993@me> On 11:16 Wed 22 Jul , Hal Rosenstock wrote: > > diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c > index 23fad87..dce2ea1 100644 > --- a/opensm/opensm/osm_mesh.c > +++ b/opensm/opensm/osm_mesh.c > @@ -185,6 +185,16 @@ typedef struct _mesh { > int dim_order[MAX_DIMENSION]; > } mesh_t; > > +typedef struct sort_ctx { > + lash_t *p_lash; > + mesh_t *mesh; > +} sort_ctx_t; > + > +typedef struct comp { > + int index; > + sort_ctx_t *ctx; > +} comp_t; And wouldn't it be simpler to use: struct comp { switch_t **s; sort_ctx_t ctx; }; ? So you will have already sorted switches and only will need to care about s->id and s->links fixing (and will not need switches[] array too). Sasha From vlad at lists.openfabrics.org Wed Aug 5 03:09:06 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 5 Aug 2009 03:09:06 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090805-0200 daily build status Message-ID: <20090805100906.9E06DE616F0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090805-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From kliteyn at dev.mellanox.co.il Wed Aug 5 03:07:21 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Aug 2009 13:07:21 +0300 Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <20090805090513.GO7993@me> References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me> <4A79490F.4000704@dev.mellanox.co.il> <20090805090513.GO7993@me> Message-ID: <4A7959D9.1080805@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 11:55 Wed 05 Aug , Yevgeny Kliteynik wrote: >> And since we're on this, perhaps the right thing here would >> be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal >> valid value (0x7FFFFFFF)? > > In which case? When provided max_wire_smps is 0 or invalid? Both. -- Yevgeny > Sasha > From sashak at voltaire.com Wed Aug 5 04:03:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 14:03:35 +0300 Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <4A7959D9.1080805@dev.mellanox.co.il> References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me> <4A79490F.4000704@dev.mellanox.co.il> <20090805090513.GO7993@me> <4A7959D9.1080805@dev.mellanox.co.il> Message-ID: <20090805110335.GR7993@me> On 13:07 Wed 05 Aug , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 11:55 Wed 05 Aug , Yevgeny Kliteynik wrote: > >> And since we're on this, perhaps the right thing here would > >> be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal > >> valid value (0x7FFFFFFF)? > > > > In which case? When provided max_wire_smps is 0 or invalid? > > Both. I think that for case of providing invalid value fallback to the default is better. Of course we can discuss about what the default value could be, but it is different story. Sasha From kliteyn at dev.mellanox.co.il Wed Aug 5 04:10:42 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Aug 2009 14:10:42 +0300 Subject: [ofa-general] Re: [PATCH] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <20090805110335.GR7993@me> References: <4A784698.10803@dev.mellanox.co.il> <20090804153509.GH7993@me> <4A79490F.4000704@dev.mellanox.co.il> <20090805090513.GO7993@me> <4A7959D9.1080805@dev.mellanox.co.il> <20090805110335.GR7993@me> Message-ID: <4A7968B2.1010808@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 13:07 Wed 05 Aug , Yevgeny Kliteynik wrote: >> Sasha Khapyorsky wrote: >>> On 11:55 Wed 05 Aug , Yevgeny Kliteynik wrote: >>>> And since we're on this, perhaps the right thing here would >>>> be not using OSM_DEFAULT_SMP_MAX_ON_WIRE, but the maximal >>>> valid value (0x7FFFFFFF)? >>> In which case? When provided max_wire_smps is 0 or invalid? >> Both. > > I think that for case of providing invalid value fallback to the default > is better. Of course we can discuss about what the default value could > be, but it is different story. OK, so 0 will go to 0x7FFFFFFF, and invalid value will fall back to default. Patch in 3...2...1... -- Yevgeny > Sasha > From kliteyn at dev.mellanox.co.il Wed Aug 5 04:20:53 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Aug 2009 14:20:53 +0300 Subject: [ofa-general] [PATCH v2] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <4A784698.10803@dev.mellanox.co.il> References: <4A784698.10803@dev.mellanox.co.il> Message-ID: <4A796B15.7000802@dev.mellanox.co.il> Hi Sasha, V2 of this patch: opt.max_wire_smps is uint32, but then when it's propagated into the VL15 poller it's casted to int32. Fixing the parameter handling to protect it from wrong values. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/main.c | 5 +++-- opensm/opensm/osm_subnet.c | 12 ++++++++++++ 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 296d5d5..ca20ff9 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -721,8 +721,9 @@ int main(int argc, char *argv[]) break; case 'n': - opt.max_wire_smps = strtol(optarg, NULL, 0); - if (opt.max_wire_smps <= 0) + opt.max_wire_smps = strtoul(optarg, NULL, 0); + if (opt.max_wire_smps == 0 || + opt.max_wire_smps > 0x7FFFFFFF) opt.max_wire_smps = 0x7FFFFFFF; printf(" Max wire smp's = %d\n", opt.max_wire_smps); break; diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index ec15f8a..c43bef7 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1066,6 +1066,18 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK; } + if (p_opts->max_wire_smps == 0) { + log_report(" Invalid Cached Option Value: max_wire_smps = 0," + " Using unlimited: 0x7FFFFFFF\n"); + p_opts->max_wire_smps = 0x7FFFFFFF; + } + else if (p_opts->max_wire_smps > 0x7FFFFFFF) { + log_report(" Invalid Cached Option Value: max_wire_smps = %u," + " Using Default: %u\n", + p_opts->max_wire_smps, OSM_DEFAULT_SMP_MAX_ON_WIRE); + p_opts->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; + } + if (strcmp(p_opts->console, OSM_DISABLE_CONSOLE) && strcmp(p_opts->console, OSM_LOCAL_CONSOLE) #ifdef ENABLE_OSM_CONSOLE_SOCKET -- 1.5.1.4 From hal.rosenstock at gmail.com Wed Aug 5 04:24:45 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 07:24:45 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090804201505.GI7993@me> References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> Message-ID: On Tue, Aug 4, 2009 at 4:15 PM, Sasha Khapyorsky wrote: > You can setup new_lfts arrays in routing engines and at the end of cycle > call single osm_*setup*_lfts() which will do everything - setup TOPs and > start to run LFT blocks update. Are you saying to move the calls in the individual routing engines to osm_ucast_mgr_set_fwd_table() up into osm_ucast_mgr_process() (and doing so consolidates the changes I had made to the various routing engines in one place) ? Just wanted to be sure I understand what you mean. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Aug 5 05:59:16 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 08:59:16 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: <20090805093228.GP7993@me> References: <20090722151615.GA24576@comcast.net> <20090805093228.GP7993@me> Message-ID: Hi Sasha, On Wed, Aug 5, 2009 at 5:32 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 11:16 Wed 22 Jul , Hal Rosenstock wrote: > > > > +/* > > + * sort_switches - reorder switch array > > + */ > > +static void sort_switches(lash_t *p_lash, mesh_t *mesh) > > +{ > > + int i, j; > > + int num_switches = p_lash->num_switches; > > + sort_ctx_t sort_ctx; > > + comp_t *index; > > + int *reverse; > > + switch_t *s; > > + switch_t **switches; > > + > > + index = malloc(num_switches * sizeof(comp_t)); > > + reverse = malloc(num_switches * sizeof(int)); > > + switches = malloc(num_switches * sizeof(switch_t *)); > > + if (!index || !reverse || !switches) { > > + OSM_LOG(&p_lash->p_osm->log, OSM_LOG_ERROR, > > + "Failed memory allocation - switches not > sorted!\n"); > > + goto Exit; > > + } > > + > > + sort_ctx.mesh = mesh; > > + sort_ctx.p_lash = p_lash; > > + > > + for (i = 0; i < num_switches; i++) { > > + index[i].index = i; > > + index[i].ctx = &sort_ctx; > > + } > > + > > + qsort(index, num_switches, sizeof(comp_t), compare_switch); > > + > > + for (i = 0; i < num_switches; i++) > > + reverse[index[i].index] = i; > > + > > + for (i = 0; i < num_switches; i++) { > > + s = p_lash->switches[index[i].index]; > > + switches[i] = s; > > + s->id = i; > > + for (j = 0; j < s->node->num_links; j++) > > + s->node->links[j]->switch_id = > > + reverse[s->node->links[j]->switch_id]; > > Isn't it the same as: > > s->node->links[j]->switch_id = > index[s->node->links[j]->switch_id].index; No. -- Hal > > (and then reverse array is obsolete)? > > Sasha > > > + } > > + > > + for (i = 0; i < num_switches; i++) > > + p_lash->switches[i] = switches[i]; > > + > > +Exit: > > + if (switches) > > + free(switches); > > + if (index) > > + free(index); > > + if (reverse) > > + free(reverse); > > +} > > + > > +/* > > * osm_mesh_delete - free per mesh resources > > */ > > static void mesh_delete(mesh_t *mesh) > > @@ -1470,6 +1561,8 @@ int osm_do_mesh_analysis(lash_t *p_lash) > > if (reorder_links(p_lash, mesh)) > > goto err; > > > > + sort_switches(p_lash, mesh); > > + > > p = buf; > > p += sprintf(p, "found "); > > for (i = 0; i < mesh->dimension; i++) > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Aug 5 06:43:52 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 16:43:52 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> Message-ID: <20090805134352.GS7993@me> On 07:24 Wed 05 Aug , Hal Rosenstock wrote: > > Are you saying to move the calls in the individual routing engines to > osm_ucast_mgr_set_fwd_table() up into osm_ucast_mgr_process() (and doing > so consolidates the changes I had made to the various routing engines in one > place) ? Yes. Sasha From slavas at Voltaire.COM Wed Aug 5 06:47:02 2009 From: slavas at Voltaire.COM (Slava Strebkov) Date: Wed, 05 Aug 2009 16:47:02 +0300 Subject: [ofa-general] [PATCH 1/2 v3] opensm: Storage organization for multicast groups Message-ID: <4A798D56.2020408@Voltaire.COM> Subject: [PATCH 1/2] Storage organization for multicast groups Main purpose is to prepare infrastructure for (many) mgids to one mlid compression. Proposed the following changes: 1. Element in mlid array is now a multicast group holder. 2. mgrp_holder keeps a list of mgroups sharing same mlid. With introduction of compression, there will be many multicast groups per mlid. Current implementation keeps one mgid to one mlid ratio. 3. mgrp_holder has a map of ports sharing same mlid. Ports sorted by port guid. Port map is necessary for building spanning tree per mgroup_holder, not just for single mgroup. 4. Element in port map keeps a list of mgroups opened by this port. This allows quick deletion of mgroups when port changes state to DOWN. 5. Multicast processing functions use mgroup_holder object instead of mgroup. Signed-off-by: Slava Strebkov --- opensm/include/opensm/osm_multicast.h | 343 +++++++++++++++++++++++++++++--- opensm/include/opensm/osm_sm.h | 10 +- opensm/include/opensm/osm_subnet.h | 38 ++-- opensm/opensm/osm_drop_mgr.c | 14 +- opensm/opensm/osm_mcast_mgr.c | 228 +++++++++++++--------- opensm/opensm/osm_multicast.c | 198 +++++++++++++++++-- opensm/opensm/osm_qos_policy.c | 38 ++-- opensm/opensm/osm_sa.c | 31 +-- opensm/opensm/osm_sa_mcmember_record.c | 94 +++++---- opensm/opensm/osm_sa_path_record.c | 13 +- opensm/opensm/osm_sm.c | 81 +++++++- opensm/opensm/osm_subnet.c | 31 +++- 12 files changed, 855 insertions(+), 264 deletions(-) diff --git a/opensm/include/opensm/osm_multicast.h b/opensm/include/opensm/osm_multicast.h index 9a47de5..61d1ba6 100644 --- a/opensm/include/opensm/osm_multicast.h +++ b/opensm/include/opensm/osm_multicast.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -107,6 +107,82 @@ typedef struct osm_mcast_mgr_ctxt { * * SEE ALSO *********/ +/****s* OpenSM: Multicast Group Holder/osm_mgrp_holder_t +* NAME +* osm_mgrp_holder_t +* +* DESCRIPTION +* Holder for mgroups. +* +* The osm_mgrp_t object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ + +typedef struct osm_mgrp_holder { + cl_qmap_t mgrp_port_map; + cl_qlist_t mgrp_list; + osm_mtree_node_t *p_root; + ib_net16_t mlid; + boolean_t to_be_deleted; + uint32_t last_tree_id; + uint32_t last_change_id; +} osm_mgrp_holder_t; + +/* +* FIELDS +* mgrp_port_map +* Map of all ports joined same mlid +* +* mgrp_list +* List of mgroups having same mlid +* +* p_root +* Pointer to the root "tree node" in the single spanning tree +* for this multicast group holder.The nodes of the tree represent +* switches. Member ports are not represented in the tree. +* +* mlid +* mlid of current group holder +* +* to_be_deleted +* Since holders are deleted when there are no mgroups in. +* +* last_change_id +* a counter for the number of changes applied to the group in this holder. +* This counter shuold be incremented on any modification +* to the group: joining or leaving of ports. +* +* last_tree_id +* the last change id used for building the current tree. +*/ + /****s* OpenSM: Multicast group Port /osm_mgrp_port _t +* NAME +* osm_mgrp_port _t +* +* DESCRIPTION +* Holder for pointers to mgroups and port guid. +* +* +* SYNOPSIS +*/ +typedef struct _osm_mgrp_port { + cl_map_item_t guid_item; + cl_qlist_t mgroups; + ib_net64_t port_guid; +} osm_mgrp_port_t; +/* +* FIELDS +* guid_item +* Map for ports. Must be first element +* +* mgroups +* List of mgroups opened by this port. +* +* portguid +* guid of port representing current structure +*/ /****s* OpenSM: Multicast Group/osm_mgrp_t * NAME @@ -122,14 +198,13 @@ typedef struct osm_mcast_mgr_ctxt { */ typedef struct osm_mgrp { cl_fmap_item_t map_item; + cl_list_item_t mlid_item; + cl_list_item_t port_item; ib_net16_t mlid; - osm_mtree_node_t *p_root; cl_qmap_t mcm_port_tbl; ib_member_rec_t mcmember_rec; boolean_t well_known; boolean_t to_be_deleted; - uint32_t last_change_id; - uint32_t last_tree_id; unsigned full_members; } osm_mgrp_t; /* @@ -141,10 +216,11 @@ typedef struct osm_mgrp { * The network ordered LID of this Multicast Group (must be * >= 0xC000). * -* p_root -* Pointer to the root "tree node" in the single spanning tree -* for this multicast group. The nodes of the tree represent -* switches. Member ports are not represented in the tree. +* mlid_item +* List item for groups with same MLID +* +* port_item +* List item for groups opened on same port * * mcm_port_tbl * Table (sorted by port GUID) of osm_mcm_port_t objects @@ -163,14 +239,6 @@ typedef struct osm_mgrp { * track the fact the group is about to be deleted so we can * track the fact a new join is actually a create request. * -* last_change_id -* a counter for the number of changes applied to the group. -* This counter shuold be incremented on any modification -* to the group: joining or leaving of ports. -* -* last_tree_id -* the last change id used for building the current tree. -* * SEE ALSO *********/ @@ -456,30 +524,111 @@ osm_mgrp_delete_port(IN osm_subn_t * const p_subn, int osm_mgrp_remove_port(osm_subn_t *subn, osm_log_t *log, osm_mgrp_t *mgrp, osm_mcm_port_t *mcm, uint8_t join_state); -/****f* OpenSM: Multicast Group/osm_mgrp_apply_func +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_new * NAME -* osm_mgrp_apply_func +* osm_mgrp_holder_new * * DESCRIPTION -* Calls the specified function for each element in the tree. -* Elements are passed to the callback function in no particular order. +* Allocates and initializes a Multicast Group Holder for use. * * SYNOPSIS */ -void -osm_mgrp_apply_func(const osm_mgrp_t * const p_mgrp, - osm_mgrp_func_t p_func, void *context); +osm_mgrp_holder_t *osm_mgrp_holder_new(IN osm_subn_t * p_subn, + IN ib_net16_t mlid); +/* +* PARAMETERS +* p_subn +* (in) pointer to osm_subnet +* mlid +* [in] Multicast LID for this multicast group holder. +* +* RETURN VALUES +* pointer to initialized osm_mgrp_holder_t +* or NULL, if unsuccessful +* +* SEE ALSO +* Multicast Group Holder, osm_mgrp_holder_delete +*********/ +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_delete +* NAME +* osm_mgrp_holder_delete +* +* DESCRIPTION +* Removes entry from array of holders +* Removes port from mgroup port list +* +* SYNOPSIS +*/ +void osm_mgrp_holder_delete(IN osm_subn_t * p_subn, + IN ib_net16_t mlid); + /* * PARAMETERS +* +* p_subn +* [in] Pointer to osm_subnet +* +* mlid +* [in] holder's mlid +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* +*********/ +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_add_mgrp_port +* NAME +* osm_mgrp_holder_port_add_mgrp +* +* DESCRIPTION +* Allocates osm_mgrp_port_t for new port joined to mgroup with mlid of this holder, +* and adds mgroup to mgroup map of existed osm_mgrp_port_t object. +* +* SYNOPSIS +*/ +ib_api_status_t osm_mgrp_holder_port_add_mgrp(IN osm_mgrp_holder_t * + p_mgrp_holder, + IN osm_mgrp_t * p_mgrp, + IN ib_net64_t port_guid); +/* +* PARAMETERS +* p_mgrp_holder +* (in) pointer to osm_mgrp_holder_t * p_mgrp -* [in] Pointer to an osm_mgrp_t object. +* (in) pointer to osm_mgrp_t * -* p_func -* [in] Pointer to the users callback function. +* RETURN VALUES +* IB_SUCCESS or +* IB_INSUFFICIENT_MEMORY * -* context -* [in] User context passed to the callback function. +* SEE ALSO +* Multicast Group Holder, osm_mgrp_holder_delete_mgrp_port +*********/ +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_delete_mgrp_port +* NAME +* osm_mgrp_holder_port_delete_mgrp * +* DESCRIPTION +* Deletes osm_mgrp_port_t for specified port +* +* SYNOPSIS +*/ +void osm_mgrp_holder_port_delete_mgrp(IN osm_mgrp_holder_t * p_mgrp_holder, + IN osm_mgrp_t * p_mgrp, + IN ib_net64_t port_guid); +/* +* PARAMETERS +* p_mgrp_holder +* [in] Pointer to an osm_mgrp_holder_t object. +* +* p_mgrp +* (in) Pointer to osm_mgrp_t object +* +* port_guid +* [in] Port guid of the departing port. * * RETURN VALUES * None. @@ -487,8 +636,144 @@ osm_mgrp_apply_func(const osm_mgrp_t * const p_mgrp, * NOTES * * SEE ALSO -* Multicast Group +Multicast Group Holder,osm_holder_add_mgrp_port +*********/ +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_add_mgrp +* NAME +* osm_mgrp_holder_add_mgrp +* +* DESCRIPTION +* Adds mgroup to holder according to its mgid +* +* +* SYNOPSIS +*/ +void osm_mgrp_holder_add_mgrp(IN osm_mgrp_holder_t * p_mgrp_holder, + IN osm_mgrp_t * p_mgrp, + IN osm_log_t * const p_log); +/* +* PARAMETERS +* +* p_mgrp_holder +* [in] Pointer to an osm_mgrp_holder_t object. +* +* p_mgrp +* [in] mgroup to add. +* +* RETURN VALUES +* None. +* +* NOTES +* Updates common_mgid when holder is being reused +* SEE ALSO +* Multicast Group Holder,osm_mgrp_holder_delete_mgrp +*********/ +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_delete_mgrp +* NAME +* osm_mgrp_holder_delete_mgrp +* +* DESCRIPTION +* Deletes mgroup from holder according to its mgid +* +* +* SYNOPSIS +*/ +void osm_mgrp_holder_delete_mgrp(IN osm_mgrp_holder_t * p_mgrp_holder, + IN osm_mgrp_t * p_mgrp); +/* +* PARAMETERS +* +* p_mgrp_holder +* [in] Pointer to an osm_mgrp_holder_t object. +* +* p_mgrp +* [in] mgroup to delete. +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* Multicast Group Holder,osm_mgrp_holder_add_mgrp *********/ +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_remove_port +* NAME +* osm_mgrp_holder_remove_port +* +* DESCRIPTION +* Removes osm_mgrp_port_t from mgrp_port_map of holder +* Removes port from mgroup port list +* +* SYNOPSIS +*/ +void osm_mgrp_holder_remove_port(IN osm_subn_t * const p_subn, + IN osm_log_t * const p_log, + IN osm_mgrp_holder_t * const p_mgrp_holder, + IN const ib_net64_t port_guid); +/* +* PARAMETERS +* +* p_subn +* [in] Pointer to the subnet object +* +* p_log +* [in] The log object pointer +* +* p_mgrp_holder +* [in] Pointer to an osm_mgrp_holder_t object. +* +* port_guid +* [in] Port guid of the departing port. +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +* +*********/ +/****f* OpenSM: Subnet/osm_get_mgrp_by_mlid +* NAME +* osm_get_mgrp_by_mlid +* +* DESCRIPTION +* The looks for the given multicast group in the subnet table by mlid. +* NOTE: this code is not thread safe. Need to grab the lock before +* calling it. +* +* SYNOPSIS +*/ +static inline struct osm_mgrp_holder *osm_get_mgrp_holder_by_mlid(osm_subn_t const + *p_subn, + ib_net16_t mlid) +{ + return p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO]; +} +/* +* PARAMETERS +* p_subn +* [in] Pointer to an osm_subn_t object +* +* mlid +* [in] The multicast group mlid in network order +* +* RETURN VALUES +* The multicast group structure pointer if found. NULL otherwise. +*********/ +static inline ib_net16_t osm_mgrp_holder_get_mlid(IN osm_mgrp_holder_t * + const p_mgrp_holder) +{ + return (p_mgrp_holder->mlid); +} + +static inline boolean_t osm_mgrp_holder_is_empty(IN const osm_mgrp_holder_t * + const p_mgrp_holder) +{ + return (cl_qmap_count(&p_mgrp_holder->mgrp_port_map) == 0); +} + END_C_DECLS #endif /* _OSM_MULTICAST_H_ */ diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h index cc8321d..7f898ad 100644 --- a/opensm/include/opensm/osm_sm.h +++ b/opensm/include/opensm/osm_sm.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -61,6 +61,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -539,7 +540,8 @@ osm_resp_send(IN osm_sm_t * sm, ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * const p_sm, IN const ib_net16_t mlid, - IN const ib_net64_t port_guid); + IN const ib_net64_t port_guid, + IN const ib_gid_t * p_mgid); /* * PARAMETERS * p_sm @@ -551,6 +553,8 @@ osm_sm_mcgrp_join(IN osm_sm_t * const p_sm, * port_guid * [in] Port GUID to add to the group. * +* p_mgid +* [in] MGID to add to the group holder. * RETURN VALUES * None * @@ -572,7 +576,7 @@ osm_sm_mcgrp_join(IN osm_sm_t * const p_sm, */ ib_api_status_t osm_sm_mcgrp_leave(IN osm_sm_t * const p_sm, - IN const ib_net16_t mlid, IN const ib_net64_t port_guid); + IN osm_mgrp_t * p_mgrp, IN ib_net64_t port_guid); /* * PARAMETERS * p_sm diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 6c20de8..fad8780 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -513,7 +513,7 @@ typedef struct osm_subn { boolean_t coming_out_of_standby; unsigned need_update; cl_fmap_t mgrp_mgid_tbl; - void *mgroups[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1]; + void *mgroup_holders[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1]; } osm_subn_t; /* * FIELDS @@ -634,8 +634,8 @@ typedef struct osm_subn { * This flag should be on during first non-master heavy * (including pre-master discovery stage) * -* mgroups -* Array of pointers to all Multicast Group objects in the subnet. +* mgroup_holders +* Array of pointers to all Multicast Group Holder objects in the subnet. * Indexed by MLID offset from base MLID. * * SEE ALSO @@ -935,32 +935,34 @@ struct osm_port *osm_get_port_by_guid(IN osm_subn_t const *p_subn, * osm_port_t *********/ -/****f* OpenSM: Subnet/osm_get_mgrp_by_mlid +/****f* OpenSM: Multicast Group Holder /osm_mgrp_holder_get_mlid_by_mgid * NAME -* osm_get_mgrp_by_mlid +* osm_mgrp_holder_get_mlid_by_mgid * * DESCRIPTION -* The looks for the given multicast group in the subnet table by mlid. -* NOTE: this code is not thread safe. Need to grab the lock before -* calling it. +* Searches mgroup with given mgid +* Returns mlid of the found mgroup * * SYNOPSIS */ -static inline -struct osm_mgrp *osm_get_mgrp_by_mlid(osm_subn_t const *p_subn, ib_net16_t mlid) -{ - return p_subn->mgroups[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO]; -} +ib_net16_t osm_mgrp_holder_get_mlid_by_mgid(IN osm_subn_t const *p_subn, + IN const ib_gid_t * const p_mgid); /* * PARAMETERS +* * p_subn -* [in] Pointer to an osm_subn_t object +* [in] Pointer to osm_subn_t object * -* mlid -* [in] The multicast group mlid in network order +* p_mgid +* [in] pointer to mgid * * RETURN VALUES -* The multicast group structure pointer if found. NULL otherwise. +* mlid of found holder, or zero. +* +* NOTES +* +* SEE ALSO +* *********/ /****f* OpenSM: Helper/osm_get_physp_by_mad_addr diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index c9a4f33..e1f2bd3 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -158,7 +158,6 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port) osm_port_t *p_port_check; cl_qmap_t *p_sm_guid_tbl; osm_mcm_info_t *p_mcm; - osm_mgrp_t *p_mgrp; cl_ptr_vector_t *p_port_lid_tbl; uint16_t min_lid_ho; uint16_t max_lid_ho; @@ -168,6 +167,7 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port) ib_gid_t port_gid; ib_mad_notice_attr_t notice; ib_api_status_t status; + osm_mgrp_holder_t *p_mgrp_holder; OSM_LOG_ENTER(sm->p_log); @@ -212,10 +212,12 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port) p_mcm = (osm_mcm_info_t *) cl_qlist_remove_head(&p_port->mcm_list); while (p_mcm != (osm_mcm_info_t *) cl_qlist_end(&p_port->mcm_list)) { - p_mgrp = osm_get_mgrp_by_mlid(sm->p_subn, p_mcm->mlid); - if (p_mgrp) { - osm_mgrp_delete_port(sm->p_subn, sm->p_log, - p_mgrp, p_port->guid); + p_mgrp_holder = + osm_get_mgrp_holder_by_mlid(sm->p_subn, p_mcm->mlid); + if (p_mgrp_holder) { + osm_mgrp_holder_remove_port(sm->p_subn, sm->p_log, + p_mgrp_holder, + p_port->guid); osm_mcm_info_delete((osm_mcm_info_t *) p_mcm); } p_mcm = diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index 4dbbaa0..f506393 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -55,6 +55,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -111,14 +112,15 @@ static void mcast_mgr_purge_tree_node(IN osm_mtree_node_t * p_mtn) /********************************************************************** **********************************************************************/ -static void mcast_mgr_purge_tree(osm_sm_t * sm, IN osm_mgrp_t * p_mgrp) +static void mcast_mgr_purge_tree(osm_sm_t * sm, + IN osm_mgrp_holder_t * p_mgrp_holder) { OSM_LOG_ENTER(sm->p_log); - if (p_mgrp->p_root) - mcast_mgr_purge_tree_node(p_mgrp->p_root); + if (p_mgrp_holder->p_root) + mcast_mgr_purge_tree_node(p_mgrp_holder->p_root); - p_mgrp->p_root = NULL; + p_mgrp_holder->p_root = NULL; OSM_LOG_EXIT(sm->p_log); } @@ -126,41 +128,40 @@ static void mcast_mgr_purge_tree(osm_sm_t * sm, IN osm_mgrp_t * p_mgrp) /********************************************************************** **********************************************************************/ static float osm_mcast_mgr_compute_avg_hops(osm_sm_t * sm, - const osm_mgrp_t * p_mgrp, + const osm_mgrp_holder_t * + p_mgrp_holder, const osm_switch_t * p_sw) { float avg_hops = 0; uint32_t hops = 0; uint32_t num_ports = 0; const osm_port_t *p_port; - const osm_mcm_port_t *p_mcm_port; - const cl_qmap_t *p_mcm_tbl; + const osm_mgrp_port_t *p_holder_port; OSM_LOG_ENTER(sm->p_log); - p_mcm_tbl = &p_mgrp->mcm_port_tbl; /* For each member of the multicast group, compute the number of hops to its base LID. */ - for (p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(p_mcm_tbl); - p_mcm_port != (osm_mcm_port_t *) cl_qmap_end(p_mcm_tbl); - p_mcm_port = - (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item)) { + for (p_holder_port = + (osm_mgrp_port_t *) cl_qmap_head(&p_mgrp_holder->mgrp_port_map); + p_holder_port != + (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map); + p_holder_port = + (osm_mgrp_port_t *) cl_qmap_next(&p_holder_port->guid_item)) { /* Acquire the port object for this port guid, then create the new worker object to build the list. */ p_port = osm_get_port_by_guid(sm->p_subn, - ib_gid_get_guid(&p_mcm_port-> - port_gid)); + p_holder_port->port_guid); if (!p_port) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A18: " "No port object for port 0x%016" PRIx64 "\n", - cl_ntoh64(ib_gid_get_guid - (&p_mcm_port->port_gid))); + cl_ntoh64(p_holder_port->port_guid)); continue; } @@ -185,40 +186,39 @@ static float osm_mcast_mgr_compute_avg_hops(osm_sm_t * sm, of the group HCAs **********************************************************************/ static float osm_mcast_mgr_compute_max_hops(osm_sm_t * sm, - const osm_mgrp_t * p_mgrp, + const osm_mgrp_holder_t * + p_mgrp_holder, const osm_switch_t * p_sw) { uint32_t max_hops = 0; uint32_t hops = 0; const osm_port_t *p_port; - const osm_mcm_port_t *p_mcm_port; - const cl_qmap_t *p_mcm_tbl; + const osm_mgrp_port_t *p_mgrp_holder_port; OSM_LOG_ENTER(sm->p_log); - p_mcm_tbl = &p_mgrp->mcm_port_tbl; /* For each member of the multicast group, compute the number of hops to its base LID. */ - for (p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(p_mcm_tbl); - p_mcm_port != (osm_mcm_port_t *) cl_qmap_end(p_mcm_tbl); - p_mcm_port = - (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item)) { + for (p_mgrp_holder_port = + (osm_mgrp_port_t *) cl_qmap_head(&p_mgrp_holder->mgrp_port_map); + p_mgrp_holder_port != + (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map); + p_mgrp_holder_port = + (osm_mgrp_port_t *) cl_qmap_next(&p_mgrp_holder_port->guid_item)) { /* Acquire the port object for this port guid, then create the new worker object to build the list. */ p_port = osm_get_port_by_guid(sm->p_subn, - ib_gid_get_guid(&p_mcm_port-> - port_gid)); + p_mgrp_holder_port->port_guid); if (!p_port) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A1A: " "No port object for port 0x%016" PRIx64 "\n", - cl_ntoh64(ib_gid_get_guid - (&p_mcm_port->port_gid))); + cl_ntoh64(p_mgrp_holder_port->port_guid)); continue; } @@ -244,7 +244,8 @@ static float osm_mcast_mgr_compute_max_hops(osm_sm_t * sm, of the multicast group. **********************************************************************/ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm, - const osm_mgrp_t * p_mgrp) + const osm_mgrp_holder_t * + p_mgrp_holder) { cl_qmap_t *p_sw_tbl; const osm_switch_t *p_sw; @@ -252,7 +253,7 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm, float hops = 0; float best_hops = 10000; /* any big # will do */ #ifdef OSM_VENDOR_INTF_ANAFA - boolean_t use_avg_hops = TRUE; /* anafa2 - bug hca on switch *//* use max hops for root */ + boolean_t use_avg_hops = TRUE; /* anafa2 - bug hca on switch *//* use max hops for root */ #else boolean_t use_avg_hops = FALSE; /* use max hops for root */ #endif @@ -261,7 +262,7 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm, p_sw_tbl = &sm->p_subn->sw_guid_tbl; - CL_ASSERT(!osm_mgrp_is_empty(p_mgrp)); + CL_ASSERT(!osm_mgrp_holder_is_empty(p_mgrp_holder)); for (p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl); @@ -270,9 +271,13 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm, continue; if (use_avg_hops) - hops = osm_mcast_mgr_compute_avg_hops(sm, p_mgrp, p_sw); + hops = + osm_mcast_mgr_compute_avg_hops(sm, p_mgrp_holder, + p_sw); else - hops = osm_mcast_mgr_compute_max_hops(sm, p_mgrp, p_sw); + hops = + osm_mcast_mgr_compute_max_hops(sm, p_mgrp_holder, + p_sw); OSM_LOG(sm->p_log, OSM_LOG_DEBUG, "Switch 0x%016" PRIx64 ", hops = %f\n", @@ -301,7 +306,8 @@ static osm_switch_t *mcast_mgr_find_optimal_switch(osm_sm_t * sm, This function returns the existing or optimal root swtich for the tree. **********************************************************************/ static osm_switch_t *mcast_mgr_find_root_switch(osm_sm_t * sm, - const osm_mgrp_t * p_mgrp) + const osm_mgrp_holder_t * + p_mgrp_holder) { const osm_switch_t *p_sw = NULL; @@ -313,7 +319,7 @@ static osm_switch_t *mcast_mgr_find_root_switch(osm_sm_t * sm, the root will be always on the first switch attached to it. - Very bad ... */ - p_sw = mcast_mgr_find_optimal_switch(sm, p_mgrp); + p_sw = mcast_mgr_find_optimal_switch(sm, p_mgrp_holder); OSM_LOG_EXIT(sm->p_log); return (osm_switch_t *) p_sw; @@ -393,7 +399,8 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) spanning tree that eminate from this switch. On input, the p_list contains the group members that must be routed from this switch. **********************************************************************/ -static void mcast_mgr_subdivide(osm_sm_t * sm, osm_mgrp_t * p_mgrp, +static void mcast_mgr_subdivide(osm_sm_t * sm, + osm_mgrp_holder_t * p_mgrp_holder, osm_switch_t * p_sw, cl_qlist_t * p_list, cl_qlist_t * list_array, uint8_t array_size) { @@ -404,7 +411,7 @@ static void mcast_mgr_subdivide(osm_sm_t * sm, osm_mgrp_t * p_mgrp, OSM_LOG_ENTER(sm->p_log); - mlid_ho = cl_ntoh16(osm_mgrp_get_mlid(p_mgrp)); + mlid_ho = cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder)); /* For Multicast Groups, we want not to count on previous @@ -494,7 +501,8 @@ static void mcast_mgr_purge_list(osm_sm_t * sm, cl_qlist_t * p_list) The function returns the newly created mtree node element. **********************************************************************/ -static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp, +static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, + osm_mgrp_holder_t * p_mgrp_holder, osm_switch_t * p_sw, cl_qlist_t * p_list, uint8_t depth, uint8_t upstream_port, @@ -520,7 +528,7 @@ static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp, node_guid = osm_node_get_node_guid(p_sw->p_node); node_guid_ho = cl_ntoh64(node_guid); - mlid_ho = cl_ntoh16(osm_mgrp_get_mlid(p_mgrp)); + mlid_ho = cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder)); OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, "Routing MLID 0x%X through switch 0x%" PRIx64 @@ -597,7 +605,8 @@ static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp, for (i = 0; i < max_children; i++) cl_qlist_init(&list_array[i]); - mcast_mgr_subdivide(sm, p_mgrp, p_sw, p_list, list_array, max_children); + mcast_mgr_subdivide(sm, p_mgrp_holder, p_sw, p_list, list_array, + max_children); p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); @@ -680,8 +689,9 @@ static osm_mtree_node_t *mcast_mgr_branch(osm_sm_t * sm, osm_mgrp_t * p_mgrp, CL_ASSERT(p_remote_physp); p_mtn->child_array[i] = - mcast_mgr_branch(sm, p_mgrp, p_remote_node->sw, - p_port_list, depth, + mcast_mgr_branch(sm, p_mgrp_holder, + p_remote_node->sw, p_port_list, + depth, osm_physp_get_port_num (p_remote_physp), p_max_depth); } else { @@ -716,11 +726,11 @@ Exit: /********************************************************************** **********************************************************************/ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm, - osm_mgrp_t * p_mgrp) + osm_mgrp_holder_t * + p_mgrp_holder) { - const cl_qmap_t *p_mcm_tbl; const osm_port_t *p_port; - const osm_mcm_port_t *p_mcm_port; + const osm_mgrp_port_t *p_mgrp_port; uint32_t num_ports; cl_qlist_t port_list; osm_switch_t *p_sw; @@ -739,14 +749,13 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm, on multicast forwarding table information if the user wants to preserve existing multicast routes. */ - mcast_mgr_purge_tree(sm, p_mgrp); + mcast_mgr_purge_tree(sm, p_mgrp_holder); - p_mcm_tbl = &p_mgrp->mcm_port_tbl; - num_ports = cl_qmap_count(p_mcm_tbl); + num_ports = cl_qmap_count(&p_mgrp_holder->mgrp_port_map); if (num_ports == 0) { OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, "MLID 0x%X has no members - nothing to do\n", - cl_ntoh16(osm_mgrp_get_mlid(p_mgrp))); + cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder))); goto Exit; } @@ -766,11 +775,11 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm, Locate the switch around which to create the spanning tree for this multicast group. */ - p_sw = mcast_mgr_find_root_switch(sm, p_mgrp); + p_sw = mcast_mgr_find_root_switch(sm, p_mgrp_holder); if (p_sw == NULL) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A08: " "Unable to locate a suitable switch for group 0x%X\n", - cl_ntoh16(osm_mgrp_get_mlid(p_mgrp))); + cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder))); status = IB_ERROR; goto Exit; } @@ -778,22 +787,22 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm, /* Build the first "subset" containing all member ports. */ - for (p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(p_mcm_tbl); - p_mcm_port != (osm_mcm_port_t *) cl_qmap_end(p_mcm_tbl); - p_mcm_port = - (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item)) { + for (p_mgrp_port = + (osm_mgrp_port_t *) cl_qmap_head(&p_mgrp_holder->mgrp_port_map); + p_mgrp_port != + (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map); + p_mgrp_port = + (osm_mgrp_port_t *) cl_qmap_next(&p_mgrp_port->guid_item)) { /* Acquire the port object for this port guid, then create the new worker object to build the list. */ - p_port = osm_get_port_by_guid(sm->p_subn, - ib_gid_get_guid(&p_mcm_port-> - port_gid)); + p_port = + osm_get_port_by_guid(sm->p_subn, p_mgrp_port->port_guid); if (!p_port) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A09: " "No port object for port 0x%016" PRIx64 "\n", - cl_ntoh64(ib_gid_get_guid - (&p_mcm_port->port_gid))); + cl_ntoh64(p_mgrp_port->port_guid)); continue; } @@ -801,8 +810,7 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm, if (p_wobj == NULL) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A10: " "Insufficient memory to route port 0x%016" - PRIx64 "\n", - cl_ntoh64(osm_port_get_guid(p_port))); + PRIx64 "\n", cl_ntoh64(p_mgrp_port->port_guid)); continue; } @@ -810,12 +818,14 @@ static ib_api_status_t mcast_mgr_build_spanning_tree(osm_sm_t * sm, } count = cl_qlist_count(&port_list); - p_mgrp->p_root = mcast_mgr_branch(sm, p_mgrp, p_sw, &port_list, 0, 0, - &max_depth); + p_mgrp_holder->p_root = + mcast_mgr_branch(sm, p_mgrp_holder, p_sw, &port_list, 0, 0, + &max_depth); OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, "Configured MLID 0x%X for %u ports, max tree depth = %u\n", - cl_ntoh16(osm_mgrp_get_mlid(p_mgrp)), count, max_depth); + cl_ntoh16(osm_mgrp_holder_get_mlid(p_mgrp_holder)), count, + max_depth); Exit: OSM_LOG_EXIT(sm->p_log); @@ -1023,17 +1033,20 @@ Exit: NOTE : The lock should be held externally! **********************************************************************/ static ib_api_status_t mcast_mgr_process_mgrp(osm_sm_t * sm, - IN osm_mgrp_t * p_mgrp) + IN osm_mgrp_holder_t * p_mgrp_holder) { ib_api_status_t status = IB_SUCCESS; ib_net16_t mlid; + osm_mgrp_t *p_mgrp; + cl_list_item_t *p_item; + unsigned has_full_members = 0; OSM_LOG_ENTER(sm->p_log); - mlid = osm_mgrp_get_mlid(p_mgrp); + mlid = osm_mgrp_holder_get_mlid(p_mgrp_holder); OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Processing multicast group 0x%X\n", cl_ntoh16(mlid)); + "Processing multicast group_holder 0x%X\n", cl_ntoh16(mlid)); /* Clear the multicast tables to start clean, then build @@ -1042,27 +1055,52 @@ static ib_api_status_t mcast_mgr_process_mgrp(osm_sm_t * sm, */ mcast_mgr_clear(sm, cl_ntoh16(mlid)); - if (p_mgrp->full_members) { - status = mcast_mgr_build_spanning_tree(sm, p_mgrp); + p_item = cl_qlist_head(&p_mgrp_holder->mgrp_list); + while (p_item != cl_qlist_end(&p_mgrp_holder->mgrp_list)) { + char gid_str[INET6_ADDRSTRLEN]; + p_mgrp = (osm_mgrp_t *) + PARENT_STRUCT(p_item, osm_mgrp_t, mlid_item); + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "MLID 0x%x has mgrp %s\n",cl_ntoh16(p_mgrp->mlid), + inet_ntop(AF_INET6, + p_mgrp->mcmember_rec.mgid.raw, + gid_str, sizeof(gid_str))); + p_item = cl_qlist_next(p_item); + if (p_mgrp->to_be_deleted) { + osm_mcm_port_t *p_mcm_port; + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Destroying mgrp %s with lid:0x%x\n", + inet_ntop(AF_INET6, + p_mgrp->mcmember_rec.mgid.raw, + gid_str, sizeof(gid_str)), + cl_ntoh16(p_mgrp->mlid)); + osm_mgrp_holder_delete_mgrp(p_mgrp_holder, p_mgrp); + p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(&p_mgrp->mcm_port_tbl); + while (p_mcm_port != + (osm_mcm_port_t *) cl_qmap_end(&p_mgrp->mcm_port_tbl)) { + osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp, + p_mcm_port->port_gid.unicast.interface_id); + p_mcm_port = + (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item); + } + cl_fmap_remove_item(&sm->p_subn->mgrp_mgid_tbl, + &p_mgrp->map_item); + osm_mgrp_delete(p_mgrp); + } + else if (!has_full_members) + has_full_members = p_mgrp->full_members; + } + if (has_full_members) { + status = mcast_mgr_build_spanning_tree(sm, p_mgrp_holder); if (status != IB_SUCCESS) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: " "Unable to create spanning tree (%s)\n", ib_get_err_str(status)); goto Exit; } - } else if (p_mgrp->to_be_deleted) { - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Destroying mgrp with lid:0x%x\n", - cl_ntoh16(p_mgrp->mlid)); - sm->p_subn->mgroups[cl_ntoh16(p_mgrp->mlid) - - IB_LID_MCAST_START_HO] = NULL; - cl_fmap_remove_item(&sm->p_subn->mgrp_mgid_tbl, - &p_mgrp->map_item); - osm_mgrp_delete(p_mgrp); - goto Exit; + p_mgrp_holder->last_tree_id = p_mgrp_holder->last_change_id; } - p_mgrp->last_tree_id = p_mgrp->last_change_id; Exit: OSM_LOG_EXIT(sm->p_log); @@ -1076,7 +1114,7 @@ int osm_mcast_mgr_process(osm_sm_t * sm) osm_switch_t *p_sw; cl_qmap_t *p_sw_tbl; cl_qlist_t *p_list = &sm->mgrp_list; - osm_mgrp_t *p_mgrp; + osm_mgrp_holder_t *p_mgrp_holder; int i, ret = 0; OSM_LOG_ENTER(sm->p_log); @@ -1104,9 +1142,10 @@ int osm_mcast_mgr_process(osm_sm_t * sm) of the subnet. Not due to a specific multicast request. So the request type is subnet_change and the port guid is 0. */ - p_mgrp = sm->p_subn->mgroups[i]; - if (p_mgrp) - mcast_mgr_process_mgrp(sm, p_mgrp); + p_mgrp_holder = sm->p_subn->mgroup_holders[i]; + if (p_mgrp_holder) { + mcast_mgr_process_mgrp(sm, p_mgrp_holder); + } } /* @@ -1141,7 +1180,7 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) cl_qlist_t *p_list = &sm->mgrp_list; osm_switch_t *p_sw; cl_qmap_t *p_sw_tbl; - osm_mgrp_t *p_mgrp; + osm_mgrp_holder_t *p_mgrp_holder; ib_net16_t mlid; osm_mcast_mgr_ctxt_t *ctx; int ret = 0; @@ -1169,24 +1208,25 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) /* since we delayed the execution we prefer to pass the mlid as the mgrp identifier and then find it or abort */ - p_mgrp = osm_get_mgrp_by_mlid(sm->p_subn, mlid); - if (!p_mgrp) + p_mgrp_holder = osm_get_mgrp_holder_by_mlid(sm->p_subn, mlid); + if (!p_mgrp_holder) continue; /* if there was no change from the last time * we processed the group we can skip doing anything */ - if (p_mgrp->last_change_id == p_mgrp->last_tree_id) { + if (p_mgrp_holder->last_change_id == + p_mgrp_holder->last_tree_id) { OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Skip processing mgrp with lid:0x%X change id:%u\n", - cl_ntoh16(mlid), p_mgrp->last_change_id); + "Skip processing p_mgrp_holder with lid:0x%X change id:%u\n", + cl_ntoh16(mlid), p_mgrp_holder->last_change_id); continue; } OSM_LOG(sm->p_log, OSM_LOG_DEBUG, "Processing mgrp with lid:0x%X change id:%u\n", - cl_ntoh16(mlid), p_mgrp->last_change_id); - mcast_mgr_process_mgrp(sm, p_mgrp); + cl_ntoh16(mlid), p_mgrp_holder->last_change_id); + mcast_mgr_process_mgrp(sm, p_mgrp_holder); } /* diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c index d2733c4..072b591 100644 --- a/opensm/opensm/osm_multicast.c +++ b/opensm/opensm/osm_multicast.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -48,6 +48,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -67,8 +68,6 @@ void osm_mgrp_delete(IN osm_mgrp_t * p_mgrp) (osm_mcm_port_t *) cl_qmap_next(&p_mcm_port->map_item); osm_mcm_port_delete(p_mcm_port); } - /* destroy the mtree_node structure */ - osm_mtree_destroy(p_mgrp->p_root); free(p_mgrp); } @@ -86,9 +85,6 @@ osm_mgrp_t *osm_mgrp_new(IN const ib_net16_t mlid) memset(p_mgrp, 0, sizeof(*p_mgrp)); cl_qmap_init(&p_mgrp->mcm_port_tbl); p_mgrp->mlid = mlid; - p_mgrp->last_change_id = 0; - p_mgrp->last_tree_id = 0; - p_mgrp->to_be_deleted = FALSE; return p_mgrp; } @@ -133,6 +129,7 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t * subn, osm_log_t * log, ib_net64_t port_guid; osm_mcm_port_t *p_mcm_port; cl_map_item_t *prev_item; + osm_mgrp_holder_t *p_mgrp_holder; uint8_t prev_join_state = 0; uint8_t prev_scope; @@ -167,9 +164,18 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t * subn, osm_log_t * log, p_mcm_port->scope_state = ib_member_set_scope_state(prev_scope, prev_join_state | join_state); - } else { - /* track the fact we modified the group ports */ - p_mgrp->last_change_id++; + } + + p_mgrp_holder = osm_get_mgrp_holder_by_mlid(subn, p_mgrp->mlid); + if (! p_mgrp_holder || + (IB_SUCCESS != osm_mgrp_holder_port_add_mgrp(p_mgrp_holder, + p_mgrp, port_guid)) ) { + /* if the above failed and added port is new one, remove port also from mcm_port_tbl */ + if (! prev_join_state) { + cl_qmap_remove_item(&p_mgrp->mcm_port_tbl, &p_mcm_port->map_item); + osm_mcm_port_delete(p_mcm_port); + } + return NULL; } if ((join_state & IB_JOIN_STATE_FULL) && @@ -212,7 +218,6 @@ int osm_mgrp_remove_port(osm_subn_t * subn, osm_log_t * log, osm_mgrp_t * mgrp, cl_ntoh64(mcm->port_gid.unicast.interface_id)); osm_mcm_port_delete(mcm); /* track the fact we modified the group */ - mgrp->last_change_id++; ret = 1; } @@ -285,16 +290,173 @@ static void mgrp_apply_func_sub(const osm_mgrp_t * p_mgrp, /********************************************************************** **********************************************************************/ -void osm_mgrp_apply_func(const osm_mgrp_t * p_mgrp, osm_mgrp_func_t p_func, - void *context) +static osm_mgrp_port_t *osm_mgrp_port_new(ib_net64_t port_guid) +{ + osm_mgrp_port_t *p_mgrp_port = + (osm_mgrp_port_t *) malloc(sizeof(osm_mgrp_port_t)); + if (!p_mgrp_port) { + return NULL; + } + memset(p_mgrp_port, 0, sizeof(*p_mgrp_port)); + p_mgrp_port->port_guid = port_guid; + cl_qlist_init(&p_mgrp_port->mgroups); + return p_mgrp_port; +} + +/********************************************************************** + **********************************************************************/ +osm_mgrp_holder_t *osm_mgrp_holder_new(IN osm_subn_t * p_subn, + ib_net16_t mlid) { - osm_mtree_node_t *p_mtn; + osm_mgrp_holder_t *p_mgrp_holder; + p_mgrp_holder = + p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] = + (osm_mgrp_holder_t *) malloc(sizeof(*p_mgrp_holder)); + if (!p_mgrp_holder) + return NULL; - CL_ASSERT(p_mgrp); - CL_ASSERT(p_func); + memset(p_mgrp_holder, 0, sizeof(*p_mgrp_holder)); + p_mgrp_holder->mlid = mlid; + cl_qmap_init(&p_mgrp_holder->mgrp_port_map); + cl_qlist_init(&p_mgrp_holder->mgrp_list); + return p_mgrp_holder; +} + +/********************************************************************** + **********************************************************************/ +void osm_mgrp_holder_delete(IN osm_subn_t *p_subn, ib_net16_t mlid) +{ + osm_mgrp_port_t *p_osm_mgr_port; + cl_map_item_t *p_item; + + osm_mgrp_holder_t *p_mgrp_holder = + p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO]; + p_item = cl_qmap_head(&p_mgrp_holder->mgrp_port_map); + /* Delete ports shared same MLID */ + while (p_item != cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) { + p_osm_mgr_port = (osm_mgrp_port_t *) p_item; + cl_qlist_remove_all(&p_osm_mgr_port->mgroups); + cl_qmap_remove_item(&p_mgrp_holder->mgrp_port_map, p_item); + p_item = cl_qmap_head(&p_mgrp_holder->mgrp_port_map); + free(p_osm_mgr_port); + } + /* Remove mgrp from this MLID */ + cl_qlist_remove_all(&p_mgrp_holder->mgrp_list); + /* Destroy the mtree_node structure */ + osm_mtree_destroy(p_mgrp_holder->p_root); + p_subn->mgroup_holders[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] = NULL; + free(p_mgrp_holder); +} + +/********************************************************************** + **********************************************************************/ +void osm_mgrp_holder_remove_port(osm_subn_t * subn, osm_log_t * p_log, + osm_mgrp_holder_t * p_mgrp_holder, + ib_net64_t port_guid) +{ + osm_mgrp_t *p_mgrp; + cl_list_item_t *p_item; + + OSM_LOG_ENTER(p_log); + + osm_mgrp_port_t *p_mgrp_port = (osm_mgrp_port_t *) + cl_qmap_remove(&p_mgrp_holder->mgrp_port_map, port_guid); + if (p_mgrp_port != + (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) { + char gid_str[INET6_ADDRSTRLEN]; + OSM_LOG(p_log, OSM_LOG_DEBUG, + "port 0x%" PRIx64 " removed from mlid 0x%X\n", + port_guid, cl_ntoh16(p_mgrp_holder->mlid)); + while ((p_item = + cl_qlist_remove_head(&p_mgrp_port->mgroups)) != + cl_qlist_end(&p_mgrp_port->mgroups)) { + p_mgrp = (osm_mgrp_t *) + PARENT_STRUCT(p_item, osm_mgrp_t,port_item); + OSM_LOG(p_log, OSM_LOG_DEBUG, + "removing mgrp mgid %s from port 0x%" PRIx64"\n", + inet_ntop(AF_INET6,p_mgrp->mcmember_rec.mgid.raw, + gid_str, sizeof(gid_str)), + cl_ntoh64(port_guid)); + osm_mgrp_delete_port(subn, p_log, p_mgrp, port_guid); + } + free(p_mgrp_port); + } + OSM_LOG_EXIT(p_log); +} - p_mtn = p_mgrp->p_root; +/********************************************************************** + **********************************************************************/ +void osm_mgrp_holder_add_mgrp(osm_mgrp_holder_t * p_mgrp_holder, + osm_mgrp_t * p_mgrp, osm_log_t * p_log) +{ + char gid_str[INET6_ADDRSTRLEN]; + + OSM_LOG_ENTER(p_log); + p_mgrp_holder->to_be_deleted = 0; + cl_qlist_insert_tail(&p_mgrp_holder->mgrp_list, &p_mgrp->mlid_item); + OSM_LOG(p_log, OSM_LOG_DEBUG, + "mgrp with MGID:%s added to holder with mlid = 0x%X\n", + inet_ntop(AF_INET6, p_mgrp->mcmember_rec.mgid.raw, gid_str, + sizeof(gid_str)), cl_ntoh16(p_mgrp_holder->mlid)); + p_mgrp_holder->last_change_id++; + OSM_LOG_EXIT(p_log); +} - if (p_mtn) - mgrp_apply_func_sub(p_mgrp, p_mtn, p_func, context); +/********************************************************************** + **********************************************************************/ +void osm_mgrp_holder_delete_mgrp(osm_mgrp_holder_t * p_mgrp_holder, + osm_mgrp_t * p_mgrp) +{ + p_mgrp->to_be_deleted = 1; + cl_qlist_remove_item(&p_mgrp_holder->mgrp_list, &p_mgrp->mlid_item); + if (0 == cl_qlist_count(&p_mgrp_holder->mgrp_list)) { + /* No more mgroups on this mlid */ + p_mgrp_holder->to_be_deleted = 1; + p_mgrp_holder->last_tree_id = 0; + p_mgrp_holder->last_change_id = 0; + } +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t osm_mgrp_holder_port_add_mgrp(osm_mgrp_holder_t * p_mgrp_holder, + osm_mgrp_t * p_mgrp, + ib_net64_t port_guid) +{ + osm_mgrp_port_t *p_mgrp_port = (osm_mgrp_port_t *) + cl_qmap_get(&p_mgrp_holder->mgrp_port_map, port_guid); + if (p_mgrp_port == + (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) { + /* new port to mlid */ + p_mgrp_port = osm_mgrp_port_new(port_guid); + if (!p_mgrp_port) { + return IB_INSUFFICIENT_MEMORY; + } + cl_qmap_insert(&p_mgrp_holder->mgrp_port_map, + p_mgrp_port->port_guid, &p_mgrp_port->guid_item); + } + cl_qlist_insert_tail(&p_mgrp_port->mgroups, &p_mgrp->port_item); + p_mgrp_holder->last_change_id++; + return IB_SUCCESS; +} + +/********************************************************************** + **********************************************************************/ +void osm_mgrp_holder_port_delete_mgrp(osm_mgrp_holder_t * p_mgrp_holder, + osm_mgrp_t * p_mgrp, + ib_net64_t port_guid) +{ + osm_mgrp_port_t *p_mgrp_port = (osm_mgrp_port_t *) + cl_qmap_get(&p_mgrp_holder->mgrp_port_map, port_guid); + if (p_mgrp_port != + (osm_mgrp_port_t *) cl_qmap_end(&p_mgrp_holder->mgrp_port_map)) { + cl_qlist_remove_item(&p_mgrp_port->mgroups, &p_mgrp->port_item); + if (0 == cl_qlist_count(&p_mgrp_port->mgroups)) { + /* No mgroups registered on this port for current mlid */ + cl_qmap_remove_item(&p_mgrp_holder->mgrp_port_map, + &p_mgrp_port->guid_item); + free(p_mgrp_port); + } + p_mgrp_holder->last_change_id++; + } } diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 7826578..041377f 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -785,7 +785,9 @@ static void __qos_policy_validate_pkey( uint8_t sl; uint32_t flow; uint8_t hop; + osm_mgrp_holder_t * p_mgrp_holder; osm_mgrp_t * p_mgrp; + cl_list_item_t *p_item; if (!p_qos_policy || !p_qos_match_rule || !p_prtn) return; @@ -809,31 +811,35 @@ static void __qos_policy_validate_pkey( if (!p_prtn->mlid) return; - p_mgrp = osm_get_mgrp_by_mlid(p_qos_policy->p_subn, p_prtn->mlid); - if (!p_mgrp) { + p_mgrp_holder = + osm_get_mgrp_holder_by_mlid(p_qos_policy->p_subn, p_prtn->mlid); + if (!p_mgrp_holder) { OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_ERROR, - "ERR AC16: MCast group for partition with " - "pkey 0x%04X not found\n", - cl_ntoh16(p_prtn->pkey)); + "ERR AC16: MCast mgrp_holder for partition with pkey 0x%04X not found\n", + cl_ntoh16(p_prtn->pkey)); return; } - CL_ASSERT((cl_ntoh16(p_mgrp->mcmember_rec.pkey) & 0x7fff) == - (cl_ntoh16(p_prtn->pkey) & 0x7fff)); - - ib_member_get_sl_flow_hop(p_mgrp->mcmember_rec.sl_flow_hop, - &sl, &flow, &hop); - if (sl != p_prtn->sl) { - OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + p_item = cl_qlist_head(&p_mgrp_holder->mgrp_list); + while (p_item != cl_qlist_end(&p_mgrp_holder->mgrp_list)) { + p_mgrp = (osm_mgrp_t *) PARENT_STRUCT(p_item, osm_mgrp_t, + mlid_item); + p_item = cl_qlist_next(p_item); + CL_ASSERT((cl_ntoh16(p_mgrp->mcmember_rec.pkey) & 0x7fff) == + (cl_ntoh16(p_prtn->pkey) & 0x7fff)); + ib_member_get_sl_flow_hop(p_mgrp->mcmember_rec.sl_flow_hop, + &sl, &flow, &hop); + if (sl != p_prtn->sl) { + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, "Updating MCGroup (MLID 0x%04x) SL to " "match partition SL (%u)\n", cl_hton16(p_mgrp->mcmember_rec.mlid), p_prtn->sl); - p_mgrp->mcmember_rec.sl_flow_hop = - ib_member_set_sl_flow_hop(p_prtn->sl, flow, hop); + p_mgrp->mcmember_rec.sl_flow_hop = + ib_member_set_sl_flow_hop(p_prtn->sl, flow, hop); + } } } - /*************************************************** ***************************************************/ diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c index fcc3f27..22dd495 100644 --- a/opensm/opensm/osm_sa.c +++ b/opensm/opensm/osm_sa.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -706,17 +706,15 @@ static void sa_dump_all_sa(osm_opensm_t * p_osm, FILE * file) { struct opensm_dump_context dump_context; osm_mgrp_t *p_mgrp; - int i; dump_context.p_osm = p_osm; dump_context.file = file; OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump multicast\n"); cl_plock_acquire(&p_osm->lock); - for (i = 0; i <= p_osm->subn.max_mcast_lid_ho - IB_LID_MCAST_START_HO; - i++) { - p_mgrp = p_osm->subn.mgroups[i]; - if (p_mgrp) - sa_dump_one_mgrp(p_mgrp, &dump_context); + p_mgrp = (osm_mgrp_t*)cl_fmap_head(&p_osm->subn.mgrp_mgid_tbl); + while (p_mgrp != (osm_mgrp_t*)cl_fmap_end(&p_osm->subn.mgrp_mgid_tbl)) { + sa_dump_one_mgrp(p_mgrp, &dump_context); + p_mgrp = (osm_mgrp_t*) cl_fmap_next(&p_mgrp->map_item); } OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump inform\n"); cl_qlist_apply_func(&p_osm->subn.sa_infr_list, @@ -740,23 +738,16 @@ static osm_mgrp_t *load_mcgroup(osm_opensm_t * p_osm, ib_net16_t mlid, unsigned well_known) { ib_net64_t comp_mask; - osm_mgrp_t *p_mgrp; + cl_fmap_item_t *p_fitem; + osm_mgrp_t *p_mgrp = NULL; cl_plock_excl_acquire(&p_osm->lock); - p_mgrp = osm_get_mgrp_by_mlid(&p_osm->subn, mlid); - if (p_mgrp) { - if (!memcmp(&p_mgrp->mcmember_rec.mgid, &p_mcm_rec->mgid, - sizeof(ib_gid_t))) { - OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, - "mgrp %04x is already here.", cl_ntoh16(mlid)); + p_fitem = cl_fmap_get(&p_osm->subn.mgrp_mgid_tbl, &p_mcm_rec->mgid); + if (p_fitem != cl_fmap_end(&p_osm->subn.mgrp_mgid_tbl)) { + OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, + "mgrp %04x is already here.", cl_ntoh16(mlid)); goto _out; - } - OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, - "mlid %04x is already used by another MC group. Will " - "request clients reregistration.\n", cl_ntoh16(mlid)); - p_mgrp = NULL; - goto _out; } comp_mask = IB_MCR_COMPMASK_MTU | IB_MCR_COMPMASK_MTU_SEL diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index a9e0a3b..3838a08 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -121,14 +121,17 @@ static ib_net16_t get_new_mlid(osm_sa_t * sa, ib_net16_t requested_mlid) if (requested_mlid && cl_ntoh16(requested_mlid) >= IB_LID_MCAST_START_HO && cl_ntoh16(requested_mlid) <= p_subn->max_mcast_lid_ho - && !osm_get_mgrp_by_mlid(p_subn, requested_mlid)) + && !osm_get_mgrp_holder_by_mlid(p_subn, requested_mlid)) return requested_mlid; max = p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO + 1; for (i = 0; i < max; i++) { - osm_mgrp_t *p_mgrp = sa->p_subn->mgroups[i]; - if (!p_mgrp || p_mgrp->to_be_deleted) - return cl_hton16(i + IB_LID_MCAST_START_HO); + osm_mgrp_holder_t *p_mgrp_holder = sa->p_subn->mgroup_holders[i]; + if (!p_mgrp_holder || p_mgrp_holder->to_be_deleted) { + OSM_LOG(sa->p_log, OSM_LOG_DEBUG, "returning mgrp_holder to_be_deleted =%d\n", + p_mgrp_holder ? p_mgrp_holder->to_be_deleted : 0); + return cl_hton16(i + IB_LID_MCAST_START_HO); + } } return 0; @@ -146,8 +149,9 @@ static void cleanup_mgrp(IN osm_sa_t * sa, osm_mgrp_t * mgrp) /* Remove MGRP only if osm_mcm_port_t count is 0 and not a well known group */ if (cl_is_qmap_empty(&mgrp->mcm_port_tbl) && !mgrp->well_known) { - sa->p_subn->mgroups[cl_ntoh16(mgrp->mlid) - - IB_LID_MCAST_START_HO] = NULL; + osm_mgrp_holder_t *p_mgrp_holder = + osm_get_mgrp_holder_by_mlid(sa->p_subn, mgrp->mlid); + osm_mgrp_holder_delete_mgrp(p_mgrp_holder, mgrp); cl_fmap_remove_item(&sa->p_subn->mgrp_mgid_tbl, &mgrp->map_item); osm_mgrp_delete(mgrp); @@ -802,19 +806,19 @@ static boolean_t mgrp_request_is_realizable(IN osm_sa_t * sa, Call this function to create a new mgrp. **********************************************************************/ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa, - IN ib_net64_t comp_mask, - IN const ib_member_rec_t * - const p_recvd_mcmember_rec, - IN const osm_physp_t * p_physp, - OUT osm_mgrp_t ** pp_mgrp) + IN ib_net64_t comp_mask, + IN const ib_member_rec_t * + const p_recvd_mcmember_rec, + IN const osm_physp_t * p_physp, + OUT osm_mgrp_t ** pp_mgrp) { - ib_net16_t mlid; + ib_net16_t mlid, existed_mlid; unsigned zero_mgid, i; uint8_t scope; ib_gid_t *p_mgid; - osm_mgrp_t *p_prev_mgrp; ib_api_status_t status = IB_SUCCESS; ib_member_rec_t mcm_rec = *p_recvd_mcmember_rec; /* copy for modifications */ + osm_mgrp_holder_t * p_mgrp_holder; OSM_LOG_ENTER(sa->p_log); @@ -890,6 +894,15 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa, goto Exit; } + if (0 != (existed_mlid = osm_mgrp_holder_get_mlid_by_mgid(sa->p_subn, p_mgid))) { + char gid_str[INET6_ADDRSTRLEN]; + mlid = existed_mlid; + OSM_LOG(sa->p_log, OSM_LOG_DEBUG, + "found existed mlid 0x%04x for mgid %s\n", + cl_ntoh16(mlid), inet_ntop(AF_INET6, p_mgid->raw, + gid_str, sizeof gid_str)); + } + /* create a new MC Group */ *pp_mgrp = osm_mgrp_new(mlid); if (*pp_mgrp == NULL) { @@ -914,25 +927,26 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa, /* Insert the new group in the data base */ - /* since we might have an old group by that mlid - one whose deletion was delayed for an idle time - we need to deallocate it first */ - p_prev_mgrp = osm_get_mgrp_by_mlid(sa->p_subn, mlid); - if (p_prev_mgrp) { + + p_mgrp_holder = osm_get_mgrp_holder_by_mlid(sa->p_subn, mlid); + if (!p_mgrp_holder) { OSM_LOG(sa->p_log, OSM_LOG_DEBUG, - "Found previous group for mlid:0x%04x - " - "Destroying it first\n", cl_ntoh16(mlid)); - sa->p_subn->mgroups[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] = - NULL; - cl_fmap_remove_item(&sa->p_subn->mgrp_mgid_tbl, - &p_prev_mgrp->map_item); - osm_mgrp_delete(p_prev_mgrp); + "Creating new mgrp_holder for mlid:0x%04x\n", + cl_ntoh16(mlid)); + p_mgrp_holder = osm_mgrp_holder_new(sa->p_subn, mlid); } + if (!p_mgrp_holder) { + OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B08: " + "osm_mgrp_holder_new failed\n"); + free_mlid(sa, mlid); + status = IB_INSUFFICIENT_MEMORY; + goto Exit; + } cl_fmap_insert(&sa->p_subn->mgrp_mgid_tbl, &(*pp_mgrp)->mcmember_rec.mgid, &(*pp_mgrp)->map_item); - sa->p_subn->mgroups[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] = *pp_mgrp; + osm_mgrp_holder_add_mgrp(p_mgrp_holder, *pp_mgrp, sa->p_log); Exit: OSM_LOG_EXIT(sa->p_log); @@ -1074,7 +1088,7 @@ static void mcmr_rcv_leave_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) CL_PLOCK_RELEASE(sa->p_lock); /* we can leave if port was deleted from MCG */ - if (removed && osm_sm_mcgrp_leave(sa->sm, mlid, portguid)) + if (removed && osm_sm_mcgrp_leave(sa->sm, p_mgrp, portguid)) OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B09: " "osm_sm_mcgrp_leave failed\n"); @@ -1102,6 +1116,7 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) osm_physp_t *p_request_physp; uint8_t is_new_group; /* TRUE = there is a need to create a group */ uint8_t join_state; + osm_mgrp_holder_t *p_mgrp_holder; OSM_LOG_ENTER(sa->p_log); @@ -1275,6 +1290,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) goto Exit; } + p_mgrp_holder = osm_get_mgrp_holder_by_mlid(sa->p_subn, mlid); + CL_ASSERT(p_mgrp_holder); /* create or update existing port (join-state will be updated) */ status = add_new_mgrp_port(sa, p_mgrp, p_recvd_mcmember_rec, osm_madw_get_mad_addr_ptr(p_madw), @@ -1282,6 +1299,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) if (status != IB_SUCCESS) { /* we fail to add the port so we might need to delete the group */ + osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp, + p_recvd_mcmember_rec->port_gid.unicast.interface_id); cleanup_mgrp(sa, p_mgrp); CL_PLOCK_RELEASE(sa->p_lock); @@ -1304,7 +1323,7 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) /* do the actual routing (actually schedule the update) */ status = osm_sm_mcgrp_join(sa->sm, mlid, p_recvd_mcmember_rec->port_gid.unicast. - interface_id); + interface_id, &p_recvd_mcmember_rec->mgid); if (status != IB_SUCCESS) { OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B14: " @@ -1315,9 +1334,10 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) CL_PLOCK_EXCL_ACQUIRE(sa->p_lock); /* the request for routing failed so we need to remove the port */ + osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp, + p_recvd_mcmember_rec->port_gid.unicast.interface_id); osm_mgrp_delete_port(sa->p_subn, sa->p_log, p_mgrp, - p_recvd_mcmember_rec->port_gid. - unicast.interface_id); + p_recvd_mcmember_rec->port_gid.unicast.interface_id); cleanup_mgrp(sa, p_mgrp); CL_PLOCK_RELEASE(sa->p_lock); osm_sa_send_error(sa, p_madw, IB_SA_MAD_STATUS_NO_RESOURCES); @@ -1549,7 +1569,6 @@ static void mcmr_query_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) osm_physp_t *p_req_physp; boolean_t trusted_req; osm_mgrp_t *p_mgrp; - int i; OSM_LOG_ENTER(sa->p_log); @@ -1578,12 +1597,11 @@ static void mcmr_query_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) CL_PLOCK_ACQUIRE(sa->p_lock); /* simply go over all MCGs and match */ - for (i = 0; i <= sa->p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO; - i++) { - p_mgrp = sa->p_subn->mgroups[i]; - if (p_mgrp) - mcmr_by_comp_mask(sa, p_rcvd_rec, comp_mask, p_mgrp, - p_req_physp, trusted_req, &rec_list); + p_mgrp = (osm_mgrp_t *) cl_fmap_head(&sa->p_subn->mgrp_mgid_tbl); + while (p_mgrp != (osm_mgrp_t *) cl_fmap_end(&sa->p_subn->mgrp_mgid_tbl)) { + mcmr_by_comp_mask(sa, p_rcvd_rec, comp_mask, p_mgrp, + p_req_physp, trusted_req, &rec_list); + p_mgrp = (osm_mgrp_t *) cl_fmap_next(&p_mgrp->map_item); } CL_PLOCK_RELEASE(sa->p_lock); diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 75d9516..aa63d78 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -1468,11 +1468,14 @@ static osm_mgrp_t *pr_get_mgrp(IN osm_sa_t * sa, IN const osm_madw_t * p_madw) mgrp = NULL; goto Exit; } - } else - if (!(mgrp = osm_get_mgrp_by_mlid(sa->p_subn, p_pr->dlid))) - OSM_LOG(sa->p_log, OSM_LOG_ERROR, - "ERR 1F11: " "No MC group found for PathRecord " + } else { + mgrp = osm_get_mgrp_by_mgid(sa, &p_pr->dgid); + if (!mgrp) + OSM_LOG(sa->p_log, OSM_LOG_ERROR, + "ERR 1F11: " + "No MC group found for PathRecord " "destination LID 0x%x\n", p_pr->dlid); + } } Exit: diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c index b3ce69a..d990450 100644 --- a/opensm/opensm/osm_sm.c +++ b/opensm/opensm/osm_sm.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -47,6 +47,7 @@ #include #include +#include #include #include #include @@ -468,12 +469,15 @@ static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm, /********************************************************************** **********************************************************************/ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, - IN const ib_net64_t port_guid) + IN const ib_net64_t port_guid, + IN const ib_gid_t * p_mgid) { - osm_mgrp_t *p_mgrp; + osm_mgrp_t *p_mgrp = NULL; osm_port_t *p_port; ib_api_status_t status = IB_SUCCESS; osm_mcm_info_t *p_mcm; + cl_list_item_t *p_item; + osm_mgrp_holder_t *p_mgrp_holder; OSM_LOG_ENTER(p_sm->p_log); @@ -497,8 +501,44 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, /* * If this multicast group does not already exist, create it. */ - p_mgrp = osm_get_mgrp_by_mlid(p_sm->p_subn, mlid); - if (!p_mgrp || !osm_mgrp_is_guid(p_mgrp, port_guid)) { + p_mgrp_holder = osm_get_mgrp_holder_by_mlid(p_sm->p_subn, mlid); + if (p_mgrp_holder) { + char gid_str[INET6_ADDRSTRLEN]; + if (TRUE) { + size_t gr_count = cl_qlist_count(&p_mgrp_holder->mgrp_list); + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, + "mlid 0x%X has %lu mgroups\n", cl_ntoh16(mlid), gr_count); + if (gr_count) { + p_item = + cl_qlist_head(&p_mgrp_holder->mgrp_list); + while (p_item != + cl_qlist_end(&p_mgrp_holder->mgrp_list)) { + p_mgrp = (osm_mgrp_t *) + PARENT_STRUCT(p_item, osm_mgrp_t, + mlid_item); + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, + "mlid 0x%X has mgrp with MGID: %s\n", + cl_ntoh16(mlid), + inet_ntop(AF_INET6, + p_mgrp->mcmember_rec. + mgid.raw, gid_str, + sizeof gid_str)); + p_item = cl_qlist_next(p_item); + } + } + } + p_mgrp = (osm_mgrp_t *)cl_fmap_get(&p_sm->p_subn->mgrp_mgid_tbl, p_mgid); + if (p_mgrp == (osm_mgrp_t *)cl_fmap_end(&p_sm->p_subn->mgrp_mgid_tbl)) { + p_mgrp = NULL; + OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, + "group with MGID: %s not found on mlid 0x%X\n", + inet_ntop(AF_INET6, + p_mgid->raw, + gid_str, sizeof gid_str), + cl_ntoh16(mlid)); + } + } + if (!p_mgrp_holder || !p_mgrp || !osm_mgrp_is_guid(p_mgrp, port_guid)) { /* * The group removed or the port is not a * member of the group, then fail immediately. @@ -513,6 +553,22 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, goto Exit; } + /* if there was no change from the last time + * we processed the group we can skip doing anything + */ + if (p_mgrp_holder->last_change_id == p_mgrp_holder->last_tree_id) { + OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE, + "Skip processing mgrp holder with lid:0x%X last change id:%u\n", + cl_ntoh16(mlid), p_mgrp_holder->last_change_id); + goto Exit; + } else { + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, + "processing mgrp holder with lid:0x%X port: 0x%016" + PRIx64 " last change id:%u tree id:%u\n", + cl_ntoh16(mlid), cl_ntoh64(port_guid), + p_mgrp_holder->last_change_id, + p_mgrp_holder->last_tree_id); + } /* * Check if the object (according to mlid) already exists on this port. * If it does - then no need to update it again, and no need to @@ -549,12 +605,13 @@ Exit: /********************************************************************** **********************************************************************/ -ib_api_status_t osm_sm_mcgrp_leave(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, +ib_api_status_t osm_sm_mcgrp_leave(IN osm_sm_t * p_sm, IN osm_mgrp_t * p_mgrp, IN const ib_net64_t port_guid) { - osm_mgrp_t *p_mgrp; osm_port_t *p_port; ib_api_status_t status; + osm_mgrp_holder_t *p_mgrp_holder; + ib_net16_t mlid = p_mgrp->mlid; OSM_LOG_ENTER(p_sm->p_log); @@ -577,21 +634,25 @@ ib_api_status_t osm_sm_mcgrp_leave(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, } /* - * Get the multicast group object for this group. + * Get the multicast group holder object for this group. */ - p_mgrp = osm_get_mgrp_by_mlid(p_sm->p_subn, mlid); - if (!p_mgrp) { + p_mgrp_holder = osm_get_mgrp_holder_by_mlid(p_sm->p_subn, mlid); + if (!p_mgrp_holder) { OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E08: " "No multicast group for MLID 0x%X\n", cl_ntoh16(mlid)); status = IB_INVALID_PARAMETER; goto Exit; } + osm_mgrp_holder_port_delete_mgrp(p_mgrp_holder, p_mgrp, port_guid); /* * Walk the list of ports in the group, and remove the appropriate one. */ osm_port_remove_mgrp(p_port, mlid); + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, + " Calling sm_mgrp_process for mgrp with mlid = 0x%X\n", + cl_ntoh16(mlid)); status = sm_mgrp_process(p_sm, p_mgrp); Exit: CL_PLOCK_RELEASE(p_sm->p_lock); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 0d11811..6ed95d4 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. @@ -428,8 +428,9 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn) osm_switch_t *p_sw, *p_next_sw; osm_remote_sm_t *p_rsm, *p_next_rsm; osm_prtn_t *p_prtn, *p_next_prtn; - osm_mgrp_t *p_mgrp; + osm_mgrp_holder_t *p_mgrp_holder; osm_infr_t *p_infr, *p_next_infr; + osm_mgrp_t *p_mgrp; /* it might be a good idea to de-allocate all known objects */ p_next_node = (osm_node_t *) cl_qmap_head(&p_subn->node_guid_tbl); @@ -471,14 +472,20 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn) osm_prtn_delete(&p_prtn); } - cl_fmap_remove_all(&p_subn->mgrp_mgid_tbl); for (i = 0; i <= p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO; i++) { - p_mgrp = p_subn->mgroups[i]; - p_subn->mgroups[i] = NULL; - if (p_mgrp) - osm_mgrp_delete(p_mgrp); + p_mgrp_holder = p_subn->mgroup_holders[i]; + if (p_mgrp_holder){ + osm_mgrp_holder_delete(p_subn, p_mgrp_holder->mlid); + } + } + + p_mgrp = (osm_mgrp_t*)cl_fmap_head(&p_subn->mgrp_mgid_tbl); + while (p_mgrp != (osm_mgrp_t*)cl_fmap_end(&p_subn->mgrp_mgid_tbl)) { + cl_fmap_remove_item(&p_subn->mgrp_mgid_tbl, (cl_fmap_item_t*)p_mgrp); + osm_mgrp_delete(p_mgrp); + p_mgrp = (osm_mgrp_t*)cl_fmap_head(&p_subn->mgrp_mgid_tbl); } p_next_infr = (osm_infr_t *) cl_qlist_head(&p_subn->sa_infr_list); @@ -1646,3 +1653,13 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts) return 0; } + +ib_net16_t osm_mgrp_holder_get_mlid_by_mgid(IN osm_subn_t const *p_subn, + IN const ib_gid_t * const p_mgid) +{ + osm_mgrp_t *p_mgrp = (osm_mgrp_t*)cl_fmap_get(&p_subn->mgrp_mgid_tbl, p_mgid); + if (p_mgrp != (osm_mgrp_t*)cl_fmap_end(&p_subn->mgrp_mgid_tbl)) { + return p_mgrp->mlid; + } + return 0; +} -- 1.6.3.3 From slavas at Voltaire.COM Wed Aug 5 06:48:35 2009 From: slavas at Voltaire.COM (Slava Strebkov) Date: Wed, 05 Aug 2009 16:48:35 +0300 Subject: [ofa-general] [PATCH 2/2 v3] opensm: Compression of multicast group according to pkey Message-ID: <4A798DB3.90604@Voltaire.COM> Subject: [PATCH 2/2] Compression of multicast group according to pkey Additional data structure added: 1. Map of all partition keys opened in the fabric. 2. Map of all multicast group holders shared same pkey. MLID assignment for multicast groups works in a usual manner, allocating free entry for newly created group. Proposed compression algorithm starts working when there are no more free entries in the mlid array. List of MLIDs for new multicast group will be chosen from the pkey indexed map according to the requested pkey. MLID which shares minimum number of ports will be given to newly created multicast group. Signed-off-by: Slava Strebkov --- opensm/include/opensm/osm_multicast.h | 133 ++++++++++++++++++++++++++++++++ opensm/include/opensm/osm_subnet.h | 36 +++++++++ opensm/opensm/osm_mcast_mgr.c | 4 + opensm/opensm/osm_multicast.c | 109 +++++++++++++++++++++++++- opensm/opensm/osm_sa_mcmember_record.c | 38 +++++---- opensm/opensm/osm_subnet.c | 8 ++ 6 files changed, 308 insertions(+), 20 deletions(-) diff --git a/opensm/include/opensm/osm_multicast.h b/opensm/include/opensm/osm_multicast.h index 61d1ba6..7bd2f81 100644 --- a/opensm/include/opensm/osm_multicast.h +++ b/opensm/include/opensm/osm_multicast.h @@ -128,6 +128,7 @@ typedef struct osm_mgrp_holder { boolean_t to_be_deleted; uint32_t last_tree_id; uint32_t last_change_id; + cl_map_item_t mlid_item; } osm_mgrp_holder_t; /* @@ -156,6 +157,9 @@ typedef struct osm_mgrp_holder { * * last_tree_id * the last change id used for building the current tree. +* +* mlid_item +* list item in list of holders shared same pkey. */ /****s* OpenSM: Multicast group Port /osm_mgrp_port _t * NAME @@ -775,5 +779,134 @@ static inline boolean_t osm_mgrp_holder_is_empty(IN const osm_mgrp_holder_t * return (cl_qmap_count(&p_mgrp_holder->mgrp_port_map) == 0); } +/****f* OpenSM: Subnet/osm_mlid_pkey_delete +* NAME +* osm_mlid_pkey_delete +* +* DESCRIPTION +* Frees the objects. +* +* SYNOPSIS +*/ +void osm_mlid_pkey_delete(osm_mlid_pkey_t *p_mlid_pkey); +/* +* PARAMETERS +* p_mlid_pkey +* [in] Pointer to an osm_mlid_pkey_t object +* +* RETURN VALUES +* None. +* +* +* SEE ALSO +* osm_mlid_pkey_new +*********/ + +/****f* OpenSM: Subnet/osm_mlid_pkey_new +* NAME +* osm_mlid_pkey_new +* +* DESCRIPTION +* Creates new object of osm_mlid_pkey_t. +* +* SYNOPSIS +*/ +osm_mlid_pkey_t *osm_mlid_pkey_new(IN ib_net16_t pkey); +/* +* PARAMETERS +* pkey +* [in] Partition key for the object +* +* RETURN VALUES +* Pointer to osm_mlid_pkey_t, or NULL. +* +* SEE ALSO +* osm_mlid_pkey_delete +*********/ + +/****f* OpenSM: Subnet/osm_mlid_pkey_add_holder +* NAME +* osm_mlid_pkey_add_holder +* +* DESCRIPTION +* Adds osm_mlid_pkey_t object to map +* +* SYNOPSIS +*/ +void osm_mlid_pkey_add_holder(osm_mgrp_holder_t *p_mgrp_holder, + ib_net16_t pkey, osm_subn_t *p_subn); +/* +* PARAMETERS +* p_mgrp_holder +* [in] Pointer to osm_mgrp_holder_t +* +* pkey +* [in] Partition key for the object +* +* p_subn +* [in] Pointer to an osm_subn_t object +* +* RETURN VALUES +* None. +* +* SEE ALSO +* osm_mlid_pkey_remove_holder +*********/ + +/****f* OpenSM: Subnet/osm_mlid_pkey_remove_holder +* NAME +* osm_mlid_pkey_remove_holder +* +* DESCRIPTION +* removes osm_mlid_pkey_t object from map +* +* SYNOPSIS +*/ +void osm_mlid_pkey_remove_holder(osm_mgrp_holder_t *p_mgrp_holder, + ib_net16_t pkey, osm_subn_t *p_subn); +/* +* PARAMETERS +* p_mgrp_holder +* [in] Pointer to osm_mgrp_holder_t +* +* pkey +* [in] Partition key for the object +* +* p_subn +* [in] Pointer to an osm_subn_t object +* +* RETURN VALUES +* None. +* +* SEE ALSO +* osm_mlid_pkey_add_holder +*********/ + +/****f* OpenSM: Subnet/osm_mlid_pkey_get_existed_mlid +* NAME +* osm_mlid_pkey_get_existed_mlid +* +* DESCRIPTION +* return used mlid with miminum ports, matched by pkey +* +* SYNOPSIS +*/ +ib_net16_t osm_mlid_pkey_get_existed_mlid(IN osm_subn_t *p_subn, IN ib_net16_t pkey); +/* +* PARAMETERS +* +* p_subn +* [in] Pointer to an osm_subn_t object +* +* pkey +* [in] Partition key for the object +* +* RETURN VALUES +* matched mlid or 0 if not found +* +* SEE ALSO +* osm_mlid_pkey_add_holder +*********/ + END_C_DECLS #endif /* _OSM_MULTICAST_H_ */ diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index fad8780..aea6c45 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -469,6 +469,37 @@ typedef struct osm_subn_opt { * Subnet object *********/ +/****s* OpenSM: Subnet/osm_mlid_pkey_t +* NAME +* osm_mlid_pkey_t +* +* DESCRIPTION +* Structure combines all MLIDs opened on same pkey value. +* Used for mgid to mlid compresion +* +* SYNOPSIS +*/ +typedef struct osm_mlid_pkey { + cl_map_item_t pkey_item; + ib_net16_t pkey; + cl_qmap_t mlid_holder_map; +} osm_mlid_pkey_t; +/* +* FIELDS +* pkey_item +* Map Item for qmap linkage. Must be first element!! +* Indexed by pkey. +* +* pkey +* Partition key (P_Key) for multicast group(s). +* +* mlid_holder_map +* Map of osm_mgrp_holder_t objects. Indexed by mlid +* +* SEE ALSO +* osm_mgrp_holder_t +*********/ + /****s* OpenSM: Subnet/osm_subn_t * NAME * osm_subn_t @@ -514,6 +545,7 @@ typedef struct osm_subn { unsigned need_update; cl_fmap_t mgrp_mgid_tbl; void *mgroup_holders[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1]; + cl_qmap_t mlid_pkey_tbl; } osm_subn_t; /* * FIELDS @@ -638,6 +670,10 @@ typedef struct osm_subn { * Array of pointers to all Multicast Group Holder objects in the subnet. * Indexed by MLID offset from base MLID. * +* mlid_pkey_tbl; +* Map of osm_pkey_mlid_t objects. Arranged by mgrp pkey value. +* Contains MLIDs for mgroups with same pkey. +* * SEE ALSO * Subnet object *********/ diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index f506393..ec3dec6 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -1075,6 +1075,10 @@ static ib_api_status_t mcast_mgr_process_mgrp(osm_sm_t * sm, gid_str, sizeof(gid_str)), cl_ntoh16(p_mgrp->mlid)); osm_mgrp_holder_delete_mgrp(p_mgrp_holder, p_mgrp); + if (p_mgrp_holder->to_be_deleted) { + osm_mlid_pkey_remove_holder(p_mgrp_holder, + p_mgrp->mcmember_rec.pkey,sm->p_subn); + } p_mcm_port = (osm_mcm_port_t *) cl_qmap_head(&p_mgrp->mcm_port_tbl); while (p_mcm_port != (osm_mcm_port_t *) cl_qmap_end(&p_mgrp->mcm_port_tbl)) { diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c index 072b591..4724bd3 100644 --- a/opensm/opensm/osm_multicast.c +++ b/opensm/opensm/osm_multicast.c @@ -366,10 +366,9 @@ void osm_mgrp_holder_remove_port(osm_subn_t * subn, osm_log_t * p_log, char gid_str[INET6_ADDRSTRLEN]; OSM_LOG(p_log, OSM_LOG_DEBUG, "port 0x%" PRIx64 " removed from mlid 0x%X\n", - port_guid, cl_ntoh16(p_mgrp_holder->mlid)); - while ((p_item = - cl_qlist_remove_head(&p_mgrp_port->mgroups)) != - cl_qlist_end(&p_mgrp_port->mgroups)) { + cl_ntoh64(port_guid), cl_ntoh16(p_mgrp_holder->mlid)); + while (!cl_is_qlist_empty(&p_mgrp_port->mgroups)) { + p_item = cl_qlist_remove_head(&p_mgrp_port->mgroups); p_mgrp = (osm_mgrp_t *) PARENT_STRUCT(p_item, osm_mgrp_t,port_item); OSM_LOG(p_log, OSM_LOG_DEBUG, @@ -460,3 +459,105 @@ void osm_mgrp_holder_port_delete_mgrp(osm_mgrp_holder_t * p_mgrp_holder, p_mgrp_holder->last_change_id++; } } + +/********************************************************************** + **********************************************************************/ +void osm_mlid_pkey_delete(osm_mlid_pkey_t *p_mlid_pkey) +{ + cl_qmap_remove_all(&p_mlid_pkey->mlid_holder_map); + free(p_mlid_pkey); +} + +/********************************************************************** + **********************************************************************/ +osm_mlid_pkey_t *osm_mlid_pkey_new(ib_net16_t pkey) +{ + osm_mlid_pkey_t *p_mlid_pkey = malloc(sizeof(osm_mlid_pkey_t)); + if (!p_mlid_pkey) { + return NULL; + } + memset(p_mlid_pkey, 0, sizeof(*p_mlid_pkey)); + cl_qmap_init(&p_mlid_pkey->mlid_holder_map); + p_mlid_pkey->pkey = pkey; + return p_mlid_pkey; +} + +/********************************************************************** + **********************************************************************/ +void osm_mlid_pkey_add_holder(osm_mgrp_holder_t *p_mgrp_holder, + ib_net16_t pkey, osm_subn_t *p_subn) +{ + osm_mlid_pkey_t *p_mlid_pkey = (osm_mlid_pkey_t*)cl_qmap_get(&p_subn->mlid_pkey_tbl, + 0x7fff & pkey); + if (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) { + cl_qmap_insert(&p_mlid_pkey->mlid_holder_map, p_mgrp_holder->mlid,&p_mgrp_holder->mlid_item); + } + else { + p_mlid_pkey = osm_mlid_pkey_new(pkey); + if (p_mlid_pkey) { + cl_qmap_insert(&p_mlid_pkey->mlid_holder_map, p_mgrp_holder->mlid, + &p_mgrp_holder->mlid_item); + cl_qmap_insert(&p_subn->mlid_pkey_tbl, 0x7fff & pkey,&p_mlid_pkey->pkey_item); + } + } +} + +/********************************************************************** + **********************************************************************/ +void osm_mlid_pkey_remove_holder(osm_mgrp_holder_t *p_mgrp_holder, + ib_net16_t pkey, osm_subn_t *p_subn) +{ + osm_mlid_pkey_t *p_mlid_pkey = (osm_mlid_pkey_t*) + cl_qmap_get(&p_subn->mlid_pkey_tbl, 0x7fff & pkey); + if (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) { + cl_qmap_remove_item(&p_mlid_pkey->mlid_holder_map, &p_mgrp_holder->mlid_item); + if (!cl_qmap_count(&p_mlid_pkey->mlid_holder_map)) { + /* no more groups with given pkey exist */ + osm_mlid_pkey_delete(p_mlid_pkey); + } + } +} + +/********************************************************************** + **********************************************************************/ +static ib_net16_t osm_mlid_pkey_get_mlid(IN osm_mlid_pkey_t *p_mlid_pkey) +{ + cl_map_item_t *p_item; + osm_mgrp_holder_t *p_mgrp_holder; + osm_mgrp_holder_t *p_matched_holder = NULL; + size_t port_count = 0; + for (p_item = cl_qmap_head(&p_mlid_pkey->mlid_holder_map); + p_item != cl_qmap_end(&p_mlid_pkey->mlid_holder_map); + p_item = cl_qmap_next(p_item)) { + p_mgrp_holder = (osm_mgrp_holder_t*) + PARENT_STRUCT(p_item, osm_mgrp_holder_t,mlid_item); + if (!port_count) { + /* init p_matched_holder and count */ + port_count = cl_qmap_count(&p_mgrp_holder->mgrp_port_map); + p_matched_holder = p_mgrp_holder; + } + else { + if (port_count > cl_qmap_count(&p_mgrp_holder->mgrp_port_map)) { + port_count = cl_qmap_count(&p_mgrp_holder->mgrp_port_map); + p_matched_holder = p_mgrp_holder; + } + } + } + if (p_matched_holder) { + return p_matched_holder->mlid; + } + return 0; +} + +/********************************************************************** + **********************************************************************/ +ib_net16_t osm_mlid_pkey_get_existed_mlid(IN osm_subn_t *p_subn, IN ib_net16_t pkey) +{ + osm_mlid_pkey_t *p_mlid_pkey = + (osm_mlid_pkey_t*)cl_qmap_get(&p_subn->mlid_pkey_tbl, 0x7fff & pkey); + if (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) { + /* found obect with mgroups matched requested pkey */ + return osm_mlid_pkey_get_mlid(p_mlid_pkey); + } + return 0; +} diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index 3838a08..3c34592 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -152,6 +152,10 @@ static void cleanup_mgrp(IN osm_sa_t * sa, osm_mgrp_t * mgrp) osm_mgrp_holder_t *p_mgrp_holder = osm_get_mgrp_holder_by_mlid(sa->p_subn, mgrp->mlid); osm_mgrp_holder_delete_mgrp(p_mgrp_holder, mgrp); + if (p_mgrp_holder->to_be_deleted) { + osm_mlid_pkey_remove_holder(p_mgrp_holder, + mgrp->mcmember_rec.pkey,sa->p_subn); + } cl_fmap_remove_item(&sa->p_subn->mgrp_mgid_tbl, &mgrp->map_item); osm_mgrp_delete(mgrp); @@ -812,13 +816,14 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa, IN const osm_physp_t * p_physp, OUT osm_mgrp_t ** pp_mgrp) { - ib_net16_t mlid, existed_mlid; + ib_net16_t mlid; unsigned zero_mgid, i; uint8_t scope; ib_gid_t *p_mgid; ib_api_status_t status = IB_SUCCESS; ib_member_rec_t mcm_rec = *p_recvd_mcmember_rec; /* copy for modifications */ osm_mgrp_holder_t * p_mgrp_holder; + boolean_t new_mlid = TRUE; OSM_LOG_ENTER(sa->p_log); @@ -836,15 +841,22 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa, */ mlid = get_new_mlid(sa, mcm_rec.mlid); if (mlid == 0) { - OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: " - "get_new_mlid failed request mlid 0x%04x\n", - cl_ntoh16(mcm_rec.mlid)); - status = IB_SA_MAD_STATUS_NO_RESOURCES; - goto Exit; + /* try to add mcgroup to existed mlid */ + mlid = osm_mlid_pkey_get_existed_mlid(sa->p_subn, mcm_rec.pkey); + if (mlid == 0) { + OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: " + "get_new_mlid failed request mlid 0x%04x\n", + cl_ntoh16(mcm_rec.mlid)); + status = IB_SA_MAD_STATUS_NO_RESOURCES; + goto Exit; + } + new_mlid = FALSE; + OSM_LOG(sa->p_log, OSM_LOG_DEBUG, + "Found existed mlid 0x%X\n", cl_ntoh16(mlid)); } OSM_LOG(sa->p_log, OSM_LOG_DEBUG, - "Obtained new mlid 0x%X\n", cl_ntoh16(mlid)); + "Obtained mlid 0x%X\n", cl_ntoh16(mlid)); /* we need to create the new MGID if it was not defined */ if (zero_mgid) { @@ -894,15 +906,6 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa, goto Exit; } - if (0 != (existed_mlid = osm_mgrp_holder_get_mlid_by_mgid(sa->p_subn, p_mgid))) { - char gid_str[INET6_ADDRSTRLEN]; - mlid = existed_mlid; - OSM_LOG(sa->p_log, OSM_LOG_DEBUG, - "found existed mlid 0x%04x for mgid %s\n", - cl_ntoh16(mlid), inet_ntop(AF_INET6, p_mgid->raw, - gid_str, sizeof gid_str)); - } - /* create a new MC Group */ *pp_mgrp = osm_mgrp_new(mlid); if (*pp_mgrp == NULL) { @@ -947,6 +950,9 @@ ib_api_status_t osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa, &(*pp_mgrp)->mcmember_rec.mgid, &(*pp_mgrp)->map_item); osm_mgrp_holder_add_mgrp(p_mgrp_holder, *pp_mgrp, sa->p_log); + if (new_mlid) + osm_mlid_pkey_add_holder(p_mgrp_holder, + (*pp_mgrp)->mcmember_rec.pkey, sa->p_subn); Exit: OSM_LOG_EXIT(sa->p_log); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 6ed95d4..1826219 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -416,6 +416,7 @@ void osm_subn_construct(IN osm_subn_t * const p_subn) cl_qmap_init(&p_subn->rtr_guid_tbl); cl_qmap_init(&p_subn->prtn_pkey_tbl); cl_fmap_init(&p_subn->mgrp_mgid_tbl, compar_mgids); + cl_qmap_init(&p_subn->mlid_pkey_tbl); } /********************************************************************** @@ -431,6 +432,7 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn) osm_mgrp_holder_t *p_mgrp_holder; osm_infr_t *p_infr, *p_next_infr; osm_mgrp_t *p_mgrp; + osm_mlid_pkey_t *p_mlid_pkey; /* it might be a good idea to de-allocate all known objects */ p_next_node = (osm_node_t *) cl_qmap_head(&p_subn->node_guid_tbl); @@ -472,6 +474,12 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn) osm_prtn_delete(&p_prtn); } + p_mlid_pkey = (osm_mlid_pkey_t*)cl_qmap_head(&p_subn->mlid_pkey_tbl); + while (p_mlid_pkey != (osm_mlid_pkey_t*)cl_qmap_end(&p_subn->mlid_pkey_tbl)) { + cl_qmap_remove_item(&p_subn->mlid_pkey_tbl, (cl_map_item_t*)p_mlid_pkey); + osm_mlid_pkey_delete(p_mlid_pkey); + p_mlid_pkey = (osm_mlid_pkey_t*)cl_qmap_head(&p_subn->mlid_pkey_tbl); + } for (i = 0; i <= p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO; i++) { -- 1.6.3.3 From sashak at voltaire.com Wed Aug 5 07:12:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 17:12:56 +0300 Subject: [ofa-general] Re: [PATCH v2] opensm: fixing handling of opt.max_wire_smps In-Reply-To: <4A796B15.7000802@dev.mellanox.co.il> References: <4A784698.10803@dev.mellanox.co.il> <4A796B15.7000802@dev.mellanox.co.il> Message-ID: <20090805141256.GT7993@me> On 14:20 Wed 05 Aug , Yevgeny Kliteynik wrote: > Hi Sasha, > > V2 of this patch: > > opt.max_wire_smps is uint32, but then when it's propagated > into the VL15 poller it's casted to int32. Fixing the > parameter handling to protect it from wrong values. > > Signed-off-by: Yevgeny Kliteynik Applied with change noted below. Thanks. > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index ec15f8a..c43bef7 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -1066,6 +1066,18 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) > p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK; > } > > + if (p_opts->max_wire_smps == 0) { > + log_report(" Invalid Cached Option Value: max_wire_smps = 0," > + " Using unlimited: 0x7FFFFFFF\n"); > + p_opts->max_wire_smps = 0x7FFFFFFF; > + } '0' is not an invalid value, it is means "unlimited", so I'm removing this error message. Sasha From hal.rosenstock at gmail.com Wed Aug 5 07:21:15 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 10:21:15 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: <20090805094433.GQ7993@me> References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me> Message-ID: On Wed, Aug 5, 2009 at 5:44 AM, Sasha Khapyorsky wrote: > On 11:16 Wed 22 Jul , Hal Rosenstock wrote: > > > > diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c > > index 23fad87..dce2ea1 100644 > > --- a/opensm/opensm/osm_mesh.c > > +++ b/opensm/opensm/osm_mesh.c > > @@ -185,6 +185,16 @@ typedef struct _mesh { > > int dim_order[MAX_DIMENSION]; > > } mesh_t; > > > > +typedef struct sort_ctx { > > + lash_t *p_lash; > > + mesh_t *mesh; > > +} sort_ctx_t; > > + > > +typedef struct comp { > > + int index; > > + sort_ctx_t *ctx; > > +} comp_t; > > And wouldn't it be simpler to use: > > struct comp { > switch_t **s; Are you thinking this is: s = &p_lash->switches[i]; > > sort_ctx_t ctx; > }; > ? So you will have already sorted switches and only will need to care > about s->id and s->links fixing (and will not need switches[] array too). Then comp would contain an ordered list of p_lash->switches array pointers which would need to be walked through for actually reordering that array. If so, it's the cost of the new switches array v. the cost of reordering the original lash switches array. I haven't thought that through yet. Is this what you mean or am I missing your idea on how the p_lash->switches array is to be reordered ? -- Hal > > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Aug 5 07:43:55 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 10:43:55 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090805134352.GS7993@me> References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> <20090805134352.GS7993@me> Message-ID: On Wed, Aug 5, 2009 at 9:43 AM, Sasha Khapyorsky wrote: > On 07:24 Wed 05 Aug , Hal Rosenstock wrote: > > > > Are you saying to move the calls in the individual routing engines to > > osm_ucast_mgr_set_fwd_table() up into osm_ucast_mgr_process() (and doing > > so consolidates the changes I had made to the various routing engines in > one > > place) ? > > Yes. Should this be done as a separate step on the way to the LFT parallelization across switches ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bart.vanassche at gmail.com Wed Aug 5 08:00:03 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Wed, 5 Aug 2009 17:00:03 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Tue, Aug 4, 2009 at 11:39 PM, Roland Dreier wrote: > >  > By the way, Vladislav Bolkhovitin was so kind to inform me that this >  > issue is not specific to the SRP initiator. For more information, see >  > also http://thread.gmane.org/gmane.linux.scsi/26166. > > I'm not sure I follow this exactly -- the idea is that sg_reset > generates SCSI commands that are somehow different?  What does the LLD > have to do to handle them? > > Is the problem that we get a command with bogus host_scribble (since SRP > never saw it before) and so srp_find_req() gets confused? A search with grep for the text '->eh_device_reset_handler' through the kernel sources learned me that this handler can be invoked from the following two functions: * scsi_try_bus_device_reset() in drivers/scsi/scsi_error.c; * try_to_reset_cmd_device() in drivers/scsi/libsas/sas_scsi_host.c. So if the function srp_reset_device() is called, it is called from scsi_try_bus_device_reset(). This last function can be invoked by scsi_abort_eh_cmnd(), by scsi_eh_bus_device_reset() or by scsi_reset_provider(). The last function, scsi_reset_provider(), is invoked by the sg_reset command by issuing an SG_SCSI_RESET ioctl. The NULL pointer dereference happens when srp_reset_device() calls srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with req->scmnd->device == NULL. When the sg_reset command issues an SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an scmnd structure and sets scmnd->device to NULL. It is this scmnd structure that is passed to srp_reset_device(). What I'm not sure about is whether scsi_reset_provider() should set req->scmnd->device to a non-NULL value or whether srp_send_tsk_mgmt() should be able to handle the condition req->scmnd->device == NULL. Bart. From hnrose at comcast.net Wed Aug 5 08:22:58 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 5 Aug 2009 11:22:58 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: Handle calloc failure in generate_cdg_for_sp Message-ID: <20090805152258.GA16417@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 168a758..b3107f0 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -323,8 +323,8 @@ static int generate_routing_func_for_mst(lash_t * p_lash, int sw_id, return 0; } -static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch, - int lane) +static int generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch, + int lane) { unsigned num_switches = p_lash->num_switches; switch_t **switches = p_lash->switches; @@ -339,6 +339,8 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch, if (cdg_vertex_matrix[lane][sw][next_switch] == NULL) { v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0])); + if (!v) + return -1; v->from = sw; v->to = next_switch; v->temp = 1; @@ -380,6 +382,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw, int dest_switch, prev = v; } + return 0; } static void set_temp_depend_to_permanent_for_sp(lash_t * p_lash, int sw, @@ -448,7 +451,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash, int sw, int dest_switch, } } -static void balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed) +static int balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed) { unsigned num_switches = p_lash->num_switches; cdg_vertex_t ****cdg_vertex_matrix = p_lash->cdg_vertex_matrix; @@ -499,8 +502,9 @@ static void balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed) } } - generate_cdg_for_sp(p_lash, src, dest, min_filled_lane); - generate_cdg_for_sp(p_lash, dest, src, min_filled_lane); + if (generate_cdg_for_sp(p_lash, src, dest, min_filled_lane) || + generate_cdg_for_sp(p_lash, dest, src, min_filled_lane)) + return -1; output_link = p_lash->switches[src]->routing_table[dest].out_link; next_switch = get_next_switch(p_lash, src, output_link); @@ -596,6 +600,7 @@ static void balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed) virtual_location[i][j][old_max_filled_lane] = 1; } } + return 0; } static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw) @@ -837,8 +842,12 @@ static int lash_core(lash_t * p_lash) v_lane = 0; stop = 0; while (v_lane < lanes_needed && stop == 0) { - generate_cdg_for_sp(p_lash, i, dest_switch, v_lane); - generate_cdg_for_sp(p_lash, dest_switch, i, v_lane); + if (generate_cdg_for_sp(p_lash, i, dest_switch, v_lane) || + generate_cdg_for_sp(p_lash, dest_switch, i, v_lane)) { + OSM_LOG(p_log, OSM_LOG_ERROR, + "ERR 4D07: generate_cdg_for_sp failed\n"); + goto Exit; + } output_link = switches[i]->routing_table[dest_switch].out_link; @@ -903,8 +912,12 @@ static int lash_core(lash_t * p_lash) if (++lanes_needed > p_lash->vl_min) goto Error_Not_Enough_Lanes; - generate_cdg_for_sp(p_lash, i, dest_switch, v_lane); - generate_cdg_for_sp(p_lash, dest_switch, i, v_lane); + if (generate_cdg_for_sp(p_lash, i, dest_switch, v_lane) || + generate_cdg_for_sp(p_lash, dest_switch, i, v_lane)) { + OSM_LOG(p_log, OSM_LOG_ERROR, + "ERR 4D08: generate_cdg_for_sp failed\n"); + goto Exit; + } set_temp_depend_to_permanent_for_sp(p_lash, i, dest_switch, v_lane); @@ -929,7 +942,10 @@ static int lash_core(lash_t * p_lash) OSM_LOG(p_log, OSM_LOG_INFO, "Lanes needed: %d, Balancing\n", lanes_needed); - balance_virtual_lanes(p_lash, lanes_needed); + if (balance_virtual_lanes(p_lash, lanes_needed)) { + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D09: Balancing failed\n"); + goto Exit; + } for (i = 0; i < lanes_needed; i++) OSM_LOG(p_log, OSM_LOG_INFO, "Lanes in layer %d: %d\n", From sean.hefty at intel.com Wed Aug 5 08:46:43 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Aug 2009 08:46:43 -0700 Subject: [ofa-general] [PATCH] cma: fix access to freed memory In-Reply-To: <20090803092528.GA25528@mtls03> References: <20090803092528.GA25528@mtls03> Message-ID: <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com> >rdma_join_multicast() allocates struct cma_multicast and then proceeds to join >to a multicast address. However, the join operation completes in another >context and the allocated struct could be released if the user destroys either >the rdma_id object or decides to leave the multicast group while the join is in >progress. This patch uses reference counting to to avoid such situation. It >also protects removal from id_priv->mc_list in cma_leave_mc_groups(). rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast. This call will block until the join callback completes or is canceled. Can you describe the race with cma_ib_mc_handler in more detail? Also, cma_leave_mc_groups is only called from rdma_destroy_id. Locking around the mc->list shouldn't be required, since calls to join/leave aren't allowed. - Sean From eli at dev.mellanox.co.il Wed Aug 5 09:16:10 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 5 Aug 2009 19:16:10 +0300 Subject: [ofa-general] [PATCH] cma: fix access to freed memory In-Reply-To: <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com> References: <20090803092528.GA25528@mtls03> <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com> Message-ID: <20090805161610.GA13892@mtls03> On Wed, Aug 05, 2009 at 08:46:43AM -0700, Sean Hefty wrote: > rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast. This call > will block until the join callback completes or is canceled. Can you describe > the race with cma_ib_mc_handler in more detail? That explains it. I was using a different "join" implementation for RDMAoE without a "leave" operation so I had to use this kref solution. So no need to this patch. I will provide a distinct solution for the RDMAoE case. Thanks. From sashak at voltaire.com Wed Aug 5 09:22:36 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 19:22:36 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me> Message-ID: <20090805162236.GU7993@me> On 10:21 Wed 05 Aug , Hal Rosenstock wrote: > > Is this what you mean or am I missing your idea on how the p_lash->switches > array is to be reordered ? Thinking more about this I suppose that an original structure is good enough for doing what you need without intermediate buffers. It could be something like this: qsort(index....); for (i = 0; i < num_switches; i++) lash->switches[index[i].index]->id = i; for (i = 0; i < num_switches; i++) { s = lash->switches[i]; for (j = 0; j < s->num_links; j++) s->links[j]->switch_id = lash->switches[s->links[j]->switch_id]->id; } for (i = 0; i < num_switches; i++) { s = lash->switches[i]; while (s->id != i) { s1 = lash->switches[s->id]; lash->switches[s->id] = s; s = s1; } } Would it work? Sasha From sashak at voltaire.com Wed Aug 5 09:25:52 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 19:25:52 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: Handle calloc failure in generate_cdg_for_sp In-Reply-To: <20090805152258.GA16417@comcast.net> References: <20090805152258.GA16417@comcast.net> Message-ID: <20090805162552.GV7993@me> On 11:22 Wed 05 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Aug 5 09:31:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 19:31:40 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> <20090805134352.GS7993@me> Message-ID: <20090805163140.GW7993@me> On 10:43 Wed 05 Aug , Hal Rosenstock wrote: > > Should this be done as a separate step on the way to the LFT parallelization > across switches ? What do you mean by "separate step" (separate from what)? I'm trying to replay the idea again: each routing engine calculates LFTs and fill sw->new_lfts array accordingly, after all it calls a procedure for sending switches' LFT blocks (and TOPs). So routing engine itself should not care about how exactly LFT blocks update MADs submission is actually implemented. Sasha From hal.rosenstock at gmail.com Wed Aug 5 10:03:55 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 13:03:55 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: <20090805162236.GU7993@me> References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me> <20090805162236.GU7993@me> Message-ID: On Wed, Aug 5, 2009 at 12:22 PM, Sasha Khapyorsky wrote: > On 10:21 Wed 05 Aug , Hal Rosenstock wrote: > > > > Is this what you mean or am I missing your idea on how the > p_lash->switches > > array is to be reordered ? > > Thinking more about this I suppose that an original structure is good > enough for doing what you need without intermediate buffers. It could be > something like this: > > qsort(index....); > > for (i = 0; i < num_switches; i++) > lash->switches[index[i].index]->id = i; > > for (i = 0; i < num_switches; i++) { > s = lash->switches[i]; > for (j = 0; j < s->num_links; j++) > s->links[j]->switch_id = > lash->switches[s->links[j]->switch_id]->id; > } > > for (i = 0; i < num_switches; i++) { > s = lash->switches[i]; > while (s->id != i) { > s1 = lash->switches[s->id]; > lash->switches[s->id] = s; > s = s1; > } > } > > Would it work? Even if something like this works (haven't played with it yet), is it worth iterating over lash->switches array to save a memory allocation ? That seems to be what is being optimized to me. Also, couldn't this be a subsequent step in the evolution of this code ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Aug 5 10:07:10 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 13:07:10 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090805163140.GW7993@me> References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> <20090805134352.GS7993@me> <20090805163140.GW7993@me> Message-ID: On Wed, Aug 5, 2009 at 12:31 PM, Sasha Khapyorsky wrote: > On 10:43 Wed 05 Aug , Hal Rosenstock wrote: > > > > Should this be done as a separate step on the way to the LFT > parallelization > > across switches ? > > What do you mean by "separate step" (separate from what)? Separate patches: first to move the osm_ucast_mgr_set_fwd_table call up a level and a second one to the implement the LFT parallelization across switches underneath that. > > > I'm trying to replay the idea again: each routing engine calculates LFTs > and fill sw->new_lfts array accordingly, after all it calls a procedure > for sending switches' LFT blocks (and TOPs). So routing engine itself > should not care about how exactly LFT blocks update MADs submission is > actually implemented. > Yes, understood. -- Hal > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Aug 5 10:32:02 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 10:32:02 -0700 Subject: [ofa-general] Re: [PATCH 5/5] RDMA/nes: Rework the disconn routine for terminate and flushing In-Reply-To: <20090723220051.GA5304@dewood-MOBL> (Don Wood's message of "Thu, 23 Jul 2009 17:00:51 -0500") References: <20090723220051.GA5304@dewood-MOBL> Message-ID: thanks, applied all 10 pending patches. From rdreier at cisco.com Wed Aug 5 10:40:38 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 10:40:38 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: (Bart Van Assche's message of "Mon, 3 Aug 2009 15:21:21 +0200") References: Message-ID: Now I'm confused about this patch for another reason: > @@ -1429,6 +1431,8 @@ static int srp_reset_device(struct scsi_ > return FAILED; > if (req->tsk_status) > return FAILED; > + if (!req->scmnd->device) > + return FAILED; > > spin_lock_irq(target->scsi_host->host_lock); This adds the check *after* the call to srp_send_tsk_mgmt() -- which is where scmnd->device will be dereferenced. So how does this fix the bug? - R. From sashak at voltaire.com Wed Aug 5 10:45:30 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 20:45:30 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> <20090805134352.GS7993@me> <20090805163140.GW7993@me> Message-ID: <20090805174530.GX7993@me> On 13:07 Wed 05 Aug , Hal Rosenstock wrote: > > Separate patches: first to move the osm_ucast_mgr_set_fwd_table call up a > level and a second one to the implement the LFT parallelization across > switches underneath that. Basically I'm fine with single patch too. And yes, it could be done as you are proposing - it is up to you. Sasha From rdreier at cisco.com Wed Aug 5 10:44:23 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 10:44:23 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: (Bart Van Assche's message of "Wed, 5 Aug 2009 17:00:03 +0200") References: Message-ID: > The NULL pointer dereference happens when srp_reset_device() calls > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with > req->scmnd->device == NULL. When the sg_reset command issues an > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an > scmnd structure and sets scmnd->device to NULL. It is this scmnd > structure that is passed to srp_reset_device(). What I'm not sure > about is whether scsi_reset_provider() should set req->scmnd->device > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to > handle the condition req->scmnd->device == NULL. Well, I don't see how the reset ioctl can do anything useful unless it passes a device in with the scsi command -- otherwise for example srp_reset_device() has no idea what LUN to try and reset. - R. From bart.vanassche at gmail.com Wed Aug 5 10:48:40 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Wed, 5 Aug 2009 19:48:40 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Wed, Aug 5, 2009 at 7:40 PM, Roland Dreier wrote: > Now I'm confused about this patch for another reason: > >  > @@ -1429,6 +1431,8 @@ static int srp_reset_device(struct scsi_ >  >              return FAILED; >  >      if (req->tsk_status) >  >              return FAILED; >  > +    if (!req->scmnd->device) >  > +            return FAILED; >  > >  >      spin_lock_irq(target->scsi_host->host_lock); > > This adds the check *after* the call to srp_send_tsk_mgmt() -- which is > where scmnd->device will be dereferenced.  So how does this fix the bug? I made a mistake while preparing and posting the patch. The check should have been inserted before the call to srp_send_tsk_mgmt() of course. Bart. From sashak at voltaire.com Wed Aug 5 10:50:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 20:50:00 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me> <20090805162236.GU7993@me> Message-ID: <20090805175000.GY7993@me> On 13:03 Wed 05 Aug , Hal Rosenstock wrote: > > > > Thinking more about this I suppose that an original structure is good > > enough for doing what you need without intermediate buffers. It could be > > something like this: > > > > qsort(index....); > > > > for (i = 0; i < num_switches; i++) > > lash->switches[index[i].index]->id = i; > > > > for (i = 0; i < num_switches; i++) { > > s = lash->switches[i]; > > for (j = 0; j < s->num_links; j++) > > s->links[j]->switch_id = > > lash->switches[s->links[j]->switch_id]->id; > > } > > > > for (i = 0; i < num_switches; i++) { > > s = lash->switches[i]; > > while (s->id != i) { > > s1 = lash->switches[s->id]; > > lash->switches[s->id] = s; > > s = s1; > > } > > } > > > > Would it work? > > > Even if something like this works (haven't played with it yet), is it worth > iterating over lash->switches array to save a memory allocation ? It is single pass finally - just put everything in the places. I don't think that this introduces more calculations than the original code did. > Also, couldn't this be a subsequent > step in the evolution of this code ? Yes, I think it could. Sasha From bart.vanassche at gmail.com Wed Aug 5 10:48:47 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Wed, 5 Aug 2009 19:48:47 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier wrote: > >  > The NULL pointer dereference happens when srp_reset_device() calls >  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with >  > req->scmnd->device == NULL. When the sg_reset command issues an >  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an >  > scmnd structure and sets scmnd->device to NULL. It is this scmnd >  > structure that is passed to srp_reset_device(). What I'm not sure >  > about is whether scsi_reset_provider() should set req->scmnd->device >  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to >  > handle the condition req->scmnd->device == NULL. > > Well, I don't see how the reset ioctl can do anything useful unless it > passes a device in with the scsi command -- otherwise for example > srp_reset_device() has no idea what LUN to try and reset. (added linux-scsi in CC) I hope one of the SCSI people can tell us why scsi_reset_provider() passes the value NULL in req->scmnd->device to From rdreier at cisco.com Wed Aug 5 10:48:53 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 10:48:53 -0700 Subject: [ofa-general] [PATCH linux-next 2/5] RDMA/cxgb3: Don't free the endpoint early. In-Reply-To: <20090731193230.2550.42865.stgit@build.ogc.int> (Steve Wise's message of "Fri, 31 Jul 2009 14:32:30 -0500") References: <20090731193225.2550.35448.stgit@build.ogc.int> <20090731193230.2550.42865.stgit@build.ogc.int> Message-ID: > - Endpoint flags now need to be set via atomic bitops because they can > be set on both the iw_cxgb3 workqueue thread and user disconnect threads. > + if (!test_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags)) { > + set_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags); for atomicity, should all the places that do test_bit then set_bit really be using test_and_set_bit()? it would be cleaner anyway. - R. From swise at opengridcomputing.com Wed Aug 5 10:54:41 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Aug 2009 12:54:41 -0500 Subject: [ofa-general] [PATCH linux-next 2/5] RDMA/cxgb3: Don't free the endpoint early. In-Reply-To: References: <20090731193225.2550.35448.stgit@build.ogc.int> <20090731193230.2550.42865.stgit@build.ogc.int> Message-ID: <4A79C761.6010105@opengridcomputing.com> Roland Dreier wrote: > > - Endpoint flags now need to be set via atomic bitops because they can > > be set on both the iw_cxgb3 workqueue thread and user disconnect threads. > > > + if (!test_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags)) { > > + set_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags); > > for atomicity, should all the places that do test_bit then set_bit > really be using test_and_set_bit()? > > it would be cleaner anyway. > > - R. > This particular bit is only set/read on the workq thread. But I agree I should be using test_and_set_bit(). I'll resend. Steve. From bart.vanassche at gmail.com Wed Aug 5 10:54:17 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Wed, 5 Aug 2009 19:54:17 +0200 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier wrote: > >  > The NULL pointer dereference happens when srp_reset_device() calls >  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with >  > req->scmnd->device == NULL. When the sg_reset command issues an >  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an >  > scmnd structure and sets scmnd->device to NULL. It is this scmnd >  > structure that is passed to srp_reset_device(). What I'm not sure >  > about is whether scsi_reset_provider() should set req->scmnd->device >  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to >  > handle the condition req->scmnd->device == NULL. > > Well, I don't see how the reset ioctl can do anything useful unless it > passes a device in with the scsi command -- otherwise for example > srp_reset_device() has no idea what LUN to try and reset. (added linux-scsi in CC) I hope one of the SCSI people can tell us whether the behavior that scsi_reset_provider() passes the value NULL in req->scmnd->device to scsi_try_bus_device_reset() is correct ? Bart. From rdreier at cisco.com Wed Aug 5 11:31:34 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 11:31:34 -0700 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: (Bart Van Assche's message of "Thu, 23 Jul 2009 08:35:59 +0200") References: Message-ID: So I queued up the patch below for 2.6.32... this is almost the same as the patch I proposed before except that I fixed two places where I dropped the lock *after* calling ipoib_send() -- which missed the whole point of what I was trying to do. So this patch has a much better chance of actually working! [PATCH] IPoIB: Drop priv->lock before calling ipoib_send() IPoIB currently must use irqsave locking for priv->lock, since it is taken from interrupt context in one path. However, ipoib_send() does skb_orphan(), and the network stack locking is not IRQ-safe. Therefore we need to make sure we don't hold priv->lock when calling ipoib_send() to avoid lockdep warnings (the code was almost certainly safe in practice, since the only code path that takes priv->lock from interrupt context would never call into the network stack). Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=13757 Reported-by: Bart Van Assche Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 ++++++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 ++ 2 files changed, 8 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index e319d91..2bf5116 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -604,8 +604,11 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) skb_queue_len(&neigh->queue)); goto err_drop; } - } else + } else { + spin_unlock_irqrestore(&priv->lock, flags); ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb_dst(skb)->neighbour->ha)); + return; + } } else { neigh->ah = NULL; @@ -688,7 +691,9 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); + spin_unlock_irqrestore(&priv->lock, flags); ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); + return; } else if ((path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index a0e9753..a0825fe 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -720,7 +720,9 @@ out: } } + spin_unlock_irqrestore(&priv->lock, flags); ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + return; } unlock: -- 1.6.3.3 From hal.rosenstock at gmail.com Wed Aug 5 11:49:56 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 14:49:56 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: <20090805175000.GY7993@me> References: <20090722151615.GA24576@comcast.net> <20090805094433.GQ7993@me> <20090805162236.GU7993@me> <20090805175000.GY7993@me> Message-ID: On Wed, Aug 5, 2009 at 1:50 PM, Sasha Khapyorsky wrote: > On 13:03 Wed 05 Aug , Hal Rosenstock wrote: > > > > > > Thinking more about this I suppose that an original structure is good > > > enough for doing what you need without intermediate buffers. It could > be > > > something like this: > > > > > > qsort(index....); > > > > > > for (i = 0; i < num_switches; i++) > > > lash->switches[index[i].index]->id = i; > > > > > > for (i = 0; i < num_switches; i++) { > > > s = lash->switches[i]; > > > for (j = 0; j < s->num_links; j++) > > > s->links[j]->switch_id = > > > lash->switches[s->links[j]->switch_id]->id; > > > } > > > > > > for (i = 0; i < num_switches; i++) { > > > s = lash->switches[i]; > > > while (s->id != i) { > > > s1 = lash->switches[s->id]; > > > lash->switches[s->id] = s; > > > s = s1; > > > } > > > } > > > > > > Would it work? > > > > > > Even if something like this works (haven't played with it yet), is it > worth > > iterating over lash->switches array to save a memory allocation ? > > It is single pass finally - just put everything in the places. I don't > think that this introduces more calculations than the original code did. I'll work on in it the background (and note this in the updated patch description). > > > Also, couldn't this be a subsequent > > step in the evolution of this code ? > > Yes, I think it could. Good; I'll resubmit a slightly updated version shortly. -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Wed Aug 5 11:48:22 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 5 Aug 2009 14:48:22 -0400 Subject: [ofa-general] [PATCHv3] opensm/osm_mesh.c: Reorder switches for lash Message-ID: <20090805184822.GA21614@comcast.net> The goal of this patch is to change the order of the switches in the array kept in the lash context from the original order to one in which the switches are presented in 'odometer order'. When the main routine in lash is called the switches are in an order that is likely based on the order that the switches were originally visited by SM topology discovery which is some sort of tree walk. All of the analysis up to this point is independent of the actual order of the switches, but lash will use that order to enumerate the paths in the fabric and add them to the VL bins. Odometer order means that the switches are labelled s[X0, ..., Xn-1] and ordered s[0, ..., 0], s[0, ..., 1], s[0, ..., Ln-1], s[0, .. 1, 0] etc. The dimensions are also reordered so that the dimension changing the fastest has the largest length, i.e. Ln >= Ln-1 >= ... >= L1. [All this is modulo possible end to end reversal but the basic idea is that the longest axis changes fastest.] TO INVESTIGATE: Rather than using an additional switches array in sort_switches whether it can be done in place using p_lash->switches. Signed-off-by: Robert Pearson Signed-off-by: Hal Rosenstock --- Changes since v2: Made completion struct contain context rather than context pointer Renamed variable from index to comp in sort_switches for better code clarity Changes since v1: Made change reentrant FWIW Added more to patch description Added memory allocation failure handling diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c index 23fad87..72a9aa9 100644 --- a/opensm/opensm/osm_mesh.c +++ b/opensm/opensm/osm_mesh.c @@ -185,6 +185,16 @@ typedef struct _mesh { int dim_order[MAX_DIMENSION]; } mesh_t; +typedef struct sort_ctx { + lash_t *p_lash; + mesh_t *mesh; +} sort_ctx_t; + +typedef struct comp { + int index; + sort_ctx_t ctx; +} comp_t; + /* * poly_alloc * @@ -1272,6 +1282,84 @@ static int reorder_links(lash_t *p_lash, mesh_t *mesh) } /* + * compare two switches in a sort + */ +static int compare_switches(const void *p1, const void *p2) +{ + int i, j, d; + const comp_t *cp1 = p1, *cp2 = p2; + const sort_ctx_t *ctx = &cp1->ctx; + switch_t *s1 = ctx->p_lash->switches[cp1->index]; + switch_t *s2 = ctx->p_lash->switches[cp2->index]; + + for (i = 0; i < ctx->mesh->dimension; i++) { + j = ctx->mesh->dim_order[i]; + d = s1->node->coord[j] - s2->node->coord[j]; + + if (d > 0) + return 1; + + if (d < 0) + return -1; + } + + return 0; +} + +/* + * sort_switches - reorder switch array + */ +static void sort_switches(lash_t *p_lash, mesh_t *mesh) +{ + int i, j; + int num_switches = p_lash->num_switches; + comp_t *comp; + int *reverse; + switch_t *s; + switch_t **switches; + + comp = malloc(num_switches * sizeof(comp_t)); + reverse = malloc(num_switches * sizeof(int)); + switches = malloc(num_switches * sizeof(switch_t *)); + if (!comp || !reverse || !switches) { + OSM_LOG(&p_lash->p_osm->log, OSM_LOG_ERROR, + "Failed memory allocation - switches not sorted!\n"); + goto Exit; + } + + for (i = 0; i < num_switches; i++) { + comp[i].index = i; + comp[i].ctx.mesh = mesh; + comp[i].ctx.p_lash = p_lash; + } + + qsort(comp, num_switches, sizeof(comp_t), compare_switches); + + for (i = 0; i < num_switches; i++) + reverse[comp[i].index] = i; + + for (i = 0; i < num_switches; i++) { + s = p_lash->switches[comp[i].index]; + switches[i] = s; + s->id = i; + for (j = 0; j < s->node->num_links; j++) + s->node->links[j]->switch_id = + reverse[s->node->links[j]->switch_id]; + } + + for (i = 0; i < num_switches; i++) + p_lash->switches[i] = switches[i]; + +Exit: + if (switches) + free(switches); + if (comp) + free(comp); + if (reverse) + free(reverse); +} + +/* * osm_mesh_delete - free per mesh resources */ static void mesh_delete(mesh_t *mesh) @@ -1470,6 +1558,8 @@ int osm_do_mesh_analysis(lash_t *p_lash) if (reorder_links(p_lash, mesh)) goto err; + sort_switches(p_lash, mesh); + p = buf; p += sprintf(p, "found "); for (i = 0; i < mesh->dimension; i++) From sashak at voltaire.com Wed Aug 5 12:04:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 22:04:42 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Validate trap is 144 before checking for NodeDescription changed In-Reply-To: <20090804124717.GA12236@comcast.net> References: <20090804124717.GA12236@comcast.net> Message-ID: <20090805190442.GZ7993@me> On 08:47 Tue 04 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > index bf39926..925cb27 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm, > } > } > > - /* Check for node description update. IB Spec v1.2.1 pg 823 */ > - if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && > - p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { > - OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n"); > - > - if (p_physp) { > - CL_PLOCK_ACQUIRE(sm->p_lock); > - osm_req_get_node_desc(sm, p_physp); > - CL_PLOCK_RELEASE(sm->p_lock); > - } else { > - OSM_LOG(sm->p_log, OSM_LOG_ERROR, > - "ERR 3812: No physical port found for " > - "trap 144: \"node description update\"\n"); > + if (ib_notice_is_generic(p_ntci)) { > + /* Check for node description update. IB Spec v1.2.1 pg 823 */ > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) { > + if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && > + p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { > + OSM_LOG(sm->p_log, OSM_LOG_INFO, > + "Trap 144 Node description update\n"); > + > + if (p_physp) { > + CL_PLOCK_ACQUIRE(sm->p_lock); > + osm_req_get_node_desc(sm, p_physp); > + CL_PLOCK_RELEASE(sm->p_lock); > + } else > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, > + "ERR 3812: No physical port found for " > + "trap 144: \"node description update\"\n"); > + } > } > - } > > - /* do a sweep if we received a trap */ > - if (sm->p_subn->opt.sweep_on_trap) { > - /* if this is trap number 128 or run_heavy_sweep is TRUE - > - update the force_heavy_sweep flag of the subnet. > - Sweep also on traps 144/145 - these traps signal a change of > - certain port capabilities/system image guid. > - TODO: In the future this can be changed to just getting > - PortInfo on this port instead of sweeping the entire subnet. */ > - if (ib_notice_is_generic(p_ntci) && > - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || > - run_heavy_sweep)) { > - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > - "Forcing heavy sweep. Received trap:%u\n", > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > + /* do a sweep if we received a trap */ > + if (sm->p_subn->opt.sweep_on_trap) { > + /* if this is trap number 128 or run_heavy_sweep is > + TRUE - update the force_heavy_sweep flag of the > + subnet. Also, sweep also on traps 144/145 - > + these traps signal a change of certain port > + capabilities/system image guid. > + TODO: In the future this can be changed to just > + getting PortInfo on this port instead of sweeping > + the entire subnet. */ > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || > + run_heavy_sweep) { > + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > + "Forcing heavy sweep. Received trap:%u\n", > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > > - sm->p_subn->force_heavy_sweep = TRUE; > + sm->p_subn->force_heavy_sweep = TRUE; > + } > + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); > } > - osm_sm_signal(sm, OSM_SIGNAL_SWEEP); Actually this disables sweep (light) on non generic traps. Was it desired change? Could you see any potential issues with it? Sasha > } > > /* If we reached here due to trap 129/130/131 - do not need to do > From hnrose at comcast.net Wed Aug 5 12:03:44 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 5 Aug 2009 15:03:44 -0400 Subject: [ofa-general] [PATCH] opensm/osm_mesh.h: Fix SFW copyright Message-ID: <20090805190344.GA28221@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_mesh.h b/opensm/include/opensm/osm_mesh.h index 173fa86..3800372 100644 --- a/opensm/include/opensm/osm_mesh.h +++ b/opensm/include/opensm/osm_mesh.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2088 System Fabric Works, Inc. + * Copyright (c) 2008,2009 System Fabric Works, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU From hal.rosenstock at gmail.com Wed Aug 5 12:10:59 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 15:10:59 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Validate trap is 144 before checking for NodeDescription changed In-Reply-To: <20090805190442.GZ7993@me> References: <20090804124717.GA12236@comcast.net> <20090805190442.GZ7993@me> Message-ID: On Wed, Aug 5, 2009 at 3:04 PM, Sasha Khapyorsky wrote: > On 08:47 Tue 04 Aug , Hal Rosenstock wrote: > > > > Signed-off-by: Hal Rosenstock > > --- > > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > > index bf39926..925cb27 100644 > > --- a/opensm/opensm/osm_trap_rcv.c > > +++ b/opensm/opensm/osm_trap_rcv.c > > @@ -2,6 +2,7 @@ > > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm, > > } > > } > > > > - /* Check for node description update. IB Spec v1.2.1 pg 823 */ > > - if (p_ntci->data_details.ntc_144.local_changes & > TRAP_144_MASK_OTHER_LOCAL_CHANGES && > > - p_ntci->data_details.ntc_144.change_flgs & > TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { > > - OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description > update\n"); > > - > > - if (p_physp) { > > - CL_PLOCK_ACQUIRE(sm->p_lock); > > - osm_req_get_node_desc(sm, p_physp); > > - CL_PLOCK_RELEASE(sm->p_lock); > > - } else { > > - OSM_LOG(sm->p_log, OSM_LOG_ERROR, > > - "ERR 3812: No physical port found for " > > - "trap 144: \"node description update\"\n"); > > + if (ib_notice_is_generic(p_ntci)) { > > + /* Check for node description update. IB Spec v1.2.1 pg 823 > */ > > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) { > > + if (p_ntci->data_details.ntc_144.local_changes & > TRAP_144_MASK_OTHER_LOCAL_CHANGES && > > + p_ntci->data_details.ntc_144.change_flgs & > TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { > > + OSM_LOG(sm->p_log, OSM_LOG_INFO, > > + "Trap 144 Node description > update\n"); > > + > > + if (p_physp) { > > + CL_PLOCK_ACQUIRE(sm->p_lock); > > + osm_req_get_node_desc(sm, p_physp); > > + CL_PLOCK_RELEASE(sm->p_lock); > > + } else > > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, > > + "ERR 3812: No physical port > found for " > > + "trap 144: \"node > description update\"\n"); > > + } > > } > > - } > > > > - /* do a sweep if we received a trap */ > > - if (sm->p_subn->opt.sweep_on_trap) { > > - /* if this is trap number 128 or run_heavy_sweep is TRUE - > > - update the force_heavy_sweep flag of the subnet. > > - Sweep also on traps 144/145 - these traps signal a > change of > > - certain port capabilities/system image guid. > > - TODO: In the future this can be changed to just getting > > - PortInfo on this port instead of sweeping the entire > subnet. */ > > - if (ib_notice_is_generic(p_ntci) && > > - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || > > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || > > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || > > - run_heavy_sweep)) { > > - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > > - "Forcing heavy sweep. Received trap:%u\n", > > - > cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > > + /* do a sweep if we received a trap */ > > + if (sm->p_subn->opt.sweep_on_trap) { > > + /* if this is trap number 128 or run_heavy_sweep is > > + TRUE - update the force_heavy_sweep flag of the > > + subnet. Also, sweep also on traps 144/145 - > > + these traps signal a change of certain port > > + capabilities/system image guid. > > + TODO: In the future this can be changed to just > > + getting PortInfo on this port instead of > sweeping > > + the entire subnet. */ > > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == > 128 || > > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == > 144 || > > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == > 145 || > > + run_heavy_sweep) { > > + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > > + "Forcing heavy sweep. Received > trap:%u\n", > > + > cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > > > > - sm->p_subn->force_heavy_sweep = TRUE; > > + sm->p_subn->force_heavy_sweep = TRUE; > > + } > > + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); > > } > > - osm_sm_signal(sm, OSM_SIGNAL_SWEEP); > > Actually this disables sweep (light) on non generic traps. Was it desired > change? It was unintended; I'll resubmit adding that back. -- Hal > Could you see any potential issues with it? > > Sasha > > > } > > > > /* If we reached here due to trap 129/130/131 - do not need to do > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Wed Aug 5 12:19:41 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 5 Aug 2009 15:19:41 -0400 Subject: [ofa-general] [PATCHv2] opensm/osm_trap_rcv.c: Validate trap is 144 before checking for NodeDescription changed Message-ID: <20090805191941.GA29886@comcast.net> Signed-off-by: Hal Rosenstock --- Changes since v1: Add back in light sweep on non generic traps which was inadvertently removed in original version of patch diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index bf39926..26a052e 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -546,43 +547,49 @@ trap_rcv_process_request(IN osm_sm_t * sm, } } - /* Check for node description update. IB Spec v1.2.1 pg 823 */ - if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && - p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { - OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n"); - - if (p_physp) { - CL_PLOCK_ACQUIRE(sm->p_lock); - osm_req_get_node_desc(sm, p_physp); - CL_PLOCK_RELEASE(sm->p_lock); - } else { - OSM_LOG(sm->p_log, OSM_LOG_ERROR, - "ERR 3812: No physical port found for " - "trap 144: \"node description update\"\n"); + if (ib_notice_is_generic(p_ntci)) { + /* Check for node description update. IB Spec v1.2.1 pg 823 */ + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) { + if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && + p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { + OSM_LOG(sm->p_log, OSM_LOG_INFO, + "Trap 144 Node description update\n"); + + if (p_physp) { + CL_PLOCK_ACQUIRE(sm->p_lock); + osm_req_get_node_desc(sm, p_physp); + CL_PLOCK_RELEASE(sm->p_lock); + } else + OSM_LOG(sm->p_log, OSM_LOG_ERROR, + "ERR 3812: No physical port found for " + "trap 144: \"node description update\"\n"); + } } - } - /* do a sweep if we received a trap */ - if (sm->p_subn->opt.sweep_on_trap) { - /* if this is trap number 128 or run_heavy_sweep is TRUE - - update the force_heavy_sweep flag of the subnet. - Sweep also on traps 144/145 - these traps signal a change of - certain port capabilities/system image guid. - TODO: In the future this can be changed to just getting - PortInfo on this port instead of sweeping the entire subnet. */ - if (ib_notice_is_generic(p_ntci) && - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || - run_heavy_sweep)) { - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, - "Forcing heavy sweep. Received trap:%u\n", - cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); + /* do a sweep if we received a trap */ + if (sm->p_subn->opt.sweep_on_trap) { + /* if this is trap number 128 or run_heavy_sweep is + TRUE - update the force_heavy_sweep flag of the + subnet. Also, sweep also on traps 144/145 - + these traps signal a change of certain port + capabilities/system image guid. + TODO: In the future this can be changed to just + getting PortInfo on this port instead of sweeping + the entire subnet. */ + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || + run_heavy_sweep) { + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, + "Forcing heavy sweep. Received trap:%u\n", + cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); - sm->p_subn->force_heavy_sweep = TRUE; + sm->p_subn->force_heavy_sweep = TRUE; + } + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); } + } else if (sm->p_subn->opt.sweep_on_trap) osm_sm_signal(sm, OSM_SIGNAL_SWEEP); - } /* If we reached here due to trap 129/130/131 - do not need to do the notice report. Just goto exit. We know this is the case From sashak at voltaire.com Wed Aug 5 12:46:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 22:46:11 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_trap_rcv.c: Validate trap is 144 before checking for NodeDescription changed In-Reply-To: <20090805191941.GA29886@comcast.net> References: <20090805191941.GA29886@comcast.net> Message-ID: <20090805194611.GA7993@me> On 15:19 Wed 05 Aug , Hal Rosenstock wrote: > > - /* do a sweep if we received a trap */ > - if (sm->p_subn->opt.sweep_on_trap) { > - /* if this is trap number 128 or run_heavy_sweep is TRUE - > - update the force_heavy_sweep flag of the subnet. > - Sweep also on traps 144/145 - these traps signal a change of > - certain port capabilities/system image guid. > - TODO: In the future this can be changed to just getting > - PortInfo on this port instead of sweeping the entire subnet. */ > - if (ib_notice_is_generic(p_ntci) && > - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || > - run_heavy_sweep)) { > - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > - "Forcing heavy sweep. Received trap:%u\n", > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > + /* do a sweep if we received a trap */ > + if (sm->p_subn->opt.sweep_on_trap) { > + /* if this is trap number 128 or run_heavy_sweep is > + TRUE - update the force_heavy_sweep flag of the > + subnet. Also, sweep also on traps 144/145 - > + these traps signal a change of certain port > + capabilities/system image guid. > + TODO: In the future this can be changed to just > + getting PortInfo on this port instead of sweeping > + the entire subnet. */ > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || > + run_heavy_sweep) { > + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > + "Forcing heavy sweep. Received trap:%u\n", > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > > - sm->p_subn->force_heavy_sweep = TRUE; > + sm->p_subn->force_heavy_sweep = TRUE; > + } > + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); > } > + } else if (sm->p_subn->opt.sweep_on_trap) > osm_sm_signal(sm, OSM_SIGNAL_SWEEP); > - } For me this part seems simpler in the original code, so I applied this patch as: diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index bf39926..d2e4202 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -547,7 +548,9 @@ trap_rcv_process_request(IN osm_sm_t * sm, } /* Check for node description update. IB Spec v1.2.1 pg 823 */ - if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && + if (ib_notice_is_generic(p_ntci) && + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 && + p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n"); @@ -555,11 +558,10 @@ trap_rcv_process_request(IN osm_sm_t * sm, CL_PLOCK_ACQUIRE(sm->p_lock); osm_req_get_node_desc(sm, p_physp); CL_PLOCK_RELEASE(sm->p_lock); - } else { + } else OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 3812: No physical port found for " "trap 144: \"node description update\"\n"); - } } /* do a sweep if we received a trap */ Hope it is fine for you. Sasha From hal.rosenstock at gmail.com Wed Aug 5 12:48:51 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 15:48:51 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_trap_rcv.c: Validate trap is 144 before checking for NodeDescription changed In-Reply-To: <20090805194611.GA7993@me> References: <20090805191941.GA29886@comcast.net> <20090805194611.GA7993@me> Message-ID: On Wed, Aug 5, 2009 at 3:46 PM, Sasha Khapyorsky wrote: > On 15:19 Wed 05 Aug , Hal Rosenstock wrote: > > > > - /* do a sweep if we received a trap */ > > - if (sm->p_subn->opt.sweep_on_trap) { > > - /* if this is trap number 128 or run_heavy_sweep is TRUE - > > - update the force_heavy_sweep flag of the subnet. > > - Sweep also on traps 144/145 - these traps signal a > change of > > - certain port capabilities/system image guid. > > - TODO: In the future this can be changed to just getting > > - PortInfo on this port instead of sweeping the entire > subnet. */ > > - if (ib_notice_is_generic(p_ntci) && > > - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || > > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || > > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || > > - run_heavy_sweep)) { > > - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > > - "Forcing heavy sweep. Received trap:%u\n", > > - > cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > > + /* do a sweep if we received a trap */ > > + if (sm->p_subn->opt.sweep_on_trap) { > > + /* if this is trap number 128 or run_heavy_sweep is > > + TRUE - update the force_heavy_sweep flag of the > > + subnet. Also, sweep also on traps 144/145 - > > + these traps signal a change of certain port > > + capabilities/system image guid. > > + TODO: In the future this can be changed to just > > + getting PortInfo on this port instead of > sweeping > > + the entire subnet. */ > > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == > 128 || > > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == > 144 || > > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == > 145 || > > + run_heavy_sweep) { > > + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > > + "Forcing heavy sweep. Received > trap:%u\n", > > + > cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); > > > > - sm->p_subn->force_heavy_sweep = TRUE; > > + sm->p_subn->force_heavy_sweep = TRUE; > > + } > > + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); > > } > > + } else if (sm->p_subn->opt.sweep_on_trap) > > osm_sm_signal(sm, OSM_SIGNAL_SWEEP); > > - } > > For me this part seems simpler in the original code, so I applied this > patch as: > > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > index bf39926..d2e4202 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -547,7 +548,9 @@ trap_rcv_process_request(IN osm_sm_t * sm, > } > > /* Check for node description update. IB Spec v1.2.1 pg 823 */ > - if (p_ntci->data_details.ntc_144.local_changes & > TRAP_144_MASK_OTHER_LOCAL_CHANGES && > + if (ib_notice_is_generic(p_ntci) && > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 && > + p_ntci->data_details.ntc_144.local_changes & > TRAP_144_MASK_OTHER_LOCAL_CHANGES && > p_ntci->data_details.ntc_144.change_flgs & > TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { > OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description > update\n"); > > @@ -555,11 +558,10 @@ trap_rcv_process_request(IN osm_sm_t * sm, > CL_PLOCK_ACQUIRE(sm->p_lock); > osm_req_get_node_desc(sm, p_physp); > CL_PLOCK_RELEASE(sm->p_lock); > - } else { > + } else > OSM_LOG(sm->p_log, OSM_LOG_ERROR, > "ERR 3812: No physical port found for " > "trap 144: \"node description update\"\n"); > - } > } > > /* do a sweep if we received a trap */ > > > Hope it is fine for you. Sure (that's a smaller/simpler change). I'll retest to be sure when it's in the tree. There's more coming which may head more towards where I was trying to go but we'll see what happens with the next steps with this... -- Hal > > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Aug 5 12:50:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 22:50:18 +0300 Subject: [ofa-general] Re: [PATCHv3] opensm/osm_mesh.c: Reorder switches for lash In-Reply-To: <20090805184822.GA21614@comcast.net> References: <20090805184822.GA21614@comcast.net> Message-ID: <20090805195018.GB7993@me> On 14:48 Wed 05 Aug , Hal Rosenstock wrote: > > The goal of this patch is to change the order of the switches in the array kept > in the lash context from the original order to one in which the switches are > presented in 'odometer order'. > > When the main routine in lash is called the switches are in an order that is > likely based on the order that the switches were originally visited by SM > topology discovery which is some sort of tree walk. All of the analysis up to > this point is independent of the actual order of the switches, but lash will > use that order to enumerate the paths in the fabric and add them to the VL bins. > > Odometer order means that the switches are labelled s[X0, ..., Xn-1] and > ordered s[0, ..., 0], s[0, ..., 1], s[0, ..., Ln-1], s[0, .. 1, 0] etc. > The dimensions are also reordered so that the dimension changing the fastest > has the largest length, i.e. Ln >= Ln-1 >= ... >= L1. [All this is modulo > possible end to end reversal but the basic idea is that the longest axis > changes fastest.] > > TO INVESTIGATE: Rather than using an additional switches array in > sort_switches whether it can be done in place using p_lash->switches. > > Signed-off-by: Robert Pearson > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Aug 5 12:50:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Aug 2009 22:50:37 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_mesh.h: Fix SFW copyright In-Reply-To: <20090805190344.GA28221@comcast.net> References: <20090805190344.GA28221@comcast.net> Message-ID: <20090805195037.GC7993@me> On 15:03 Wed 05 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From rdreier at cisco.com Wed Aug 5 13:04:44 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 13:04:44 -0700 Subject: [ofa-general] Re: [PATCH linux-next 1/5] RDMA/cxgb3: unregister leaks memory. In-Reply-To: <20090731193225.2550.35448.stgit@build.ogc.int> (Steve Wise's message of "Fri, 31 Jul 2009 14:32:25 -0500") References: <20090731193225.2550.35448.stgit@build.ogc.int> Message-ID: thanks, applied From rdreier at cisco.com Wed Aug 5 13:06:22 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 13:06:22 -0700 Subject: [ofa-general] Re: [PATCH linux-next 3/5] RDMA/cxgb3: wake up any waiters on peer close/abort. In-Reply-To: <20090731193235.2550.20835.stgit@build.ogc.int> (Steve Wise's message of "Fri, 31 Jul 2009 14:32:35 -0500") References: <20090731193225.2550.35448.stgit@build.ogc.int> <20090731193235.2550.20835.stgit@build.ogc.int> Message-ID: this one won't apply without 2/5, so I'll wait for you to resend both patches... From rdreier at cisco.com Wed Aug 5 13:06:28 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 13:06:28 -0700 Subject: [ofa-general] Re: [PATCH linux-next 4/5] RDMA/cxgb3: Set the appropriate IO channel in rdma_init work requests. In-Reply-To: <20090731193241.2550.43016.stgit@build.ogc.int> (Steve Wise's message of "Fri, 31 Jul 2009 14:32:41 -0500") References: <20090731193225.2550.35448.stgit@build.ogc.int> <20090731193241.2550.43016.stgit@build.ogc.int> Message-ID: thanks, applied this and 5/5. From rdreier at cisco.com Wed Aug 5 13:38:30 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 13:38:30 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: map MTU to IB_MTU_* and correctly report link state In-Reply-To: <20090710204506.GA5060@ctung-MOBL> (Chien Tung's message of "Fri, 10 Jul 2009 15:45:06 -0500") References: <20090710204506.GA5060@ctung-MOBL> Message-ID: thanks, applied From rdreier at cisco.com Wed Aug 5 13:39:58 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 13:39:58 -0700 Subject: [ofa-general] Re: [PATCH] ipath: strncpy does not null terminate string In-Reply-To: <4A608922.7060900@gmail.com> (Roel Kluin's message of "Fri, 17 Jul 2009 16:22:26 +0200") References: <4A607754.4040204@gmail.com> <4A608922.7060900@gmail.com> Message-ID: > --- a/drivers/infiniband/hw/ipath/ipath_mad.c > +++ b/drivers/infiniband/hw/ipath/ipath_mad.c > @@ -60,7 +60,7 @@ static int recv_subn_get_nodedescription(struct ib_smp *smp, > if (smp->attr_mod) > smp->status |= IB_SMP_INVALID_FIELD; > > - strncpy(smp->data, ibdev->node_desc, sizeof(smp->data)); > + strlcpy(smp->data, ibdev->node_desc, sizeof(smp->data)); > > return reply(smp); > } node_desc isn't really a string, isn't it? Seems that we should be using memcpy() here (since I think it is perfectly valid according to the IB architecture to have NULs in the node description) - R. From jgunthorpe at obsidianresearch.com Wed Aug 5 13:42:59 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 5 Aug 2009 14:42:59 -0600 Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow interfaces to correspond to each other In-Reply-To: <20090805083023.GK5599@mtls03> References: <20090805083023.GK5599@mtls03> Message-ID: <20090805204259.GB16677@obsidianresearch.com> On Wed, Aug 05, 2009 at 11:30:23AM +0300, Eli Cohen wrote: > for setting the GID table of a port has been added. Currently, each > IB port has a single GID entry in its table and that GID entery > equals the link local IPv6 address. FWIW, I like this approach, and mapping to/from this GID to the MAC without a ND operation nicely divorces the RMDAoE stuff from the IPv6 stack. What about multicast though? Switches are going to have trouble with group membership lists for non IP packets.. Even just sending a ICMPv6 packet (with an IPv6 ethertype) isn't guaranteed to fix it. Jason From sean.hefty at intel.com Wed Aug 5 13:43:12 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Aug 2009 13:43:12 -0700 Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type In-Reply-To: <20090805082808.GB5599@mtls03> References: <20090805082808.GB5599@mtls03> Message-ID: <73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com> >As a preparation to devices that, in general, support different transport >protocol for each port, specifically RDMAoE, this patch defines transport type >for each of a device's ports. As a result rdma_node_get_transport() has been >unexported and is used internally by the implementation of the new API, >rdma_port_get_transport() which gives the transport protocol of the queried >port. All references to rdma_node_get_transport() are changed to to use >rdma_port_get_transport(). Also, ib_port_attr is extended to contain enum >rdma_transport_type. Can resources (PDs, CQs, MRs, etc.) between the different transports be shared? Does QP failover between transports work? >diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >index 5130fc5..f930f1d 100644 >--- a/drivers/infiniband/core/cm.c >+++ b/drivers/infiniband/core/cm.c >@@ -3678,9 +3678,7 @@ static void cm_add_one(struct ib_device *ib_device) > unsigned long flags; > int ret; > u8 i; >- >- if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB) >- return; Did you consider modifying rdma_node_get_transport_s_() and returning a bitmask of the supported transports available on the device? I'm wondering if something like this makes sense, to allow skipping devices that are not of interest to a particular module. This would be in addition to the rdma_port_get_transport call. There's just a lot of new checks to handle the transport on a port by port basis. - Sean From rdreier at cisco.com Wed Aug 5 14:10:38 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 14:10:38 -0700 Subject: [ofa-general] [PATCH] cma: fix access to freed memory In-Reply-To: <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com> (Sean Hefty's message of "Wed, 5 Aug 2009 08:46:43 -0700") References: <20090803092528.GA25528@mtls03> <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com> Message-ID: > rdma_destroy_id and rdma_leave_multicast call ib_sa_free_multicast. This call > will block until the join callback completes or is canceled. Can you describe > the race with cma_ib_mc_handler in more detail? > > Also, cma_leave_mc_groups is only called from rdma_destroy_id. Locking around > the mc->list shouldn't be required, since calls to join/leave aren't allowed. So where does this leave things? Is any part of Eli's patch needed? - R. From sean.hefty at intel.com Wed Aug 5 14:17:10 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Aug 2009 14:17:10 -0700 Subject: [ofa-general] [PATCH] cma: fix access to freed memory In-Reply-To: References: <20090803092528.GA25528@mtls03> <04A426654441482FBE6FAB6C8234B672@amr.corp.intel.com> Message-ID: >So where does this leave things? Is any part of Eli's patch needed? I don't believe the patch is needed, and Eli agreed with this. - Sean From hal.rosenstock at gmail.com Wed Aug 5 14:48:06 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Aug 2009 17:48:06 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c: Validate trap is 144 before checking for NodeDescription changed In-Reply-To: References: <20090804124717.GA12236@comcast.net> <20090805190442.GZ7993@me> Message-ID: On Wed, Aug 5, 2009 at 3:10 PM, Hal Rosenstock wrote: > > > On Wed, Aug 5, 2009 at 3:04 PM, Sasha Khapyorsky wrote: > >> On 08:47 Tue 04 Aug , Hal Rosenstock wrote: >> > >> > Signed-off-by: Hal Rosenstock >> > --- >> > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c >> > index bf39926..925cb27 100644 >> > --- a/opensm/opensm/osm_trap_rcv.c >> > +++ b/opensm/opensm/osm_trap_rcv.c >> > @@ -2,6 +2,7 @@ >> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. >> > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights >> reserved. >> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> > + * Copyright (c) 2009 HNR Consulting. All rights reserved. >> > * >> > * This software is available to you under a choice of one of two >> > * licenses. You may choose to be licensed under the terms of the GNU >> > @@ -546,42 +547,47 @@ trap_rcv_process_request(IN osm_sm_t * sm, >> > } >> > } >> > >> > - /* Check for node description update. IB Spec v1.2.1 pg 823 */ >> > - if (p_ntci->data_details.ntc_144.local_changes & >> TRAP_144_MASK_OTHER_LOCAL_CHANGES && >> > - p_ntci->data_details.ntc_144.change_flgs & >> TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { >> > - OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node >> description update\n"); >> > - >> > - if (p_physp) { >> > - CL_PLOCK_ACQUIRE(sm->p_lock); >> > - osm_req_get_node_desc(sm, p_physp); >> > - CL_PLOCK_RELEASE(sm->p_lock); >> > - } else { >> > - OSM_LOG(sm->p_log, OSM_LOG_ERROR, >> > - "ERR 3812: No physical port found for " >> > - "trap 144: \"node description >> update\"\n"); >> > + if (ib_notice_is_generic(p_ntci)) { >> > + /* Check for node description update. IB Spec v1.2.1 pg >> 823 */ >> > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) { >> > + if (p_ntci->data_details.ntc_144.local_changes & >> TRAP_144_MASK_OTHER_LOCAL_CHANGES && >> > + p_ntci->data_details.ntc_144.change_flgs & >> TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { >> > + OSM_LOG(sm->p_log, OSM_LOG_INFO, >> > + "Trap 144 Node description >> update\n"); >> > + >> > + if (p_physp) { >> > + CL_PLOCK_ACQUIRE(sm->p_lock); >> > + osm_req_get_node_desc(sm, >> p_physp); >> > + CL_PLOCK_RELEASE(sm->p_lock); >> > + } else >> > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, >> > + "ERR 3812: No physical >> port found for " >> > + "trap 144: \"node >> description update\"\n"); >> > + } >> > } >> > - } >> > >> > - /* do a sweep if we received a trap */ >> > - if (sm->p_subn->opt.sweep_on_trap) { >> > - /* if this is trap number 128 or run_heavy_sweep is TRUE - >> > - update the force_heavy_sweep flag of the subnet. >> > - Sweep also on traps 144/145 - these traps signal a >> change of >> > - certain port capabilities/system image guid. >> > - TODO: In the future this can be changed to just getting >> > - PortInfo on this port instead of sweeping the entire >> subnet. */ >> > - if (ib_notice_is_generic(p_ntci) && >> > - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || >> > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || >> > - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || >> > - run_heavy_sweep)) { >> > - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> > - "Forcing heavy sweep. Received trap:%u\n", >> > - >> cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); >> > + /* do a sweep if we received a trap */ >> > + if (sm->p_subn->opt.sweep_on_trap) { >> > + /* if this is trap number 128 or run_heavy_sweep >> is >> > + TRUE - update the force_heavy_sweep flag of the >> > + subnet. Also, sweep also on traps 144/145 - >> > + these traps signal a change of certain port >> > + capabilities/system image guid. >> > + TODO: In the future this can be changed to just >> > + getting PortInfo on this port instead of >> sweeping >> > + the entire subnet. */ >> > + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == >> 128 || >> > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == >> 144 || >> > + cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == >> 145 || >> > + run_heavy_sweep) { >> > + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> > + "Forcing heavy sweep. Received >> trap:%u\n", >> > + >> cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); >> > >> > - sm->p_subn->force_heavy_sweep = TRUE; >> > + sm->p_subn->force_heavy_sweep = TRUE; >> > + } >> > + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); >> > } >> > - osm_sm_signal(sm, OSM_SIGNAL_SWEEP); >> >> Actually this disables sweep (light) on non generic traps. Was it desired >> change? > > > It was unintended; I'll resubmit adding that back. > > -- Hal > > >> Could you see any potential issues with it? > > In thinking about it, I'm not sure what light sweep on non generic trap accomplishes anyhow. -- Hal > >> >> Sasha >> >> > } >> > >> > /* If we reached here due to trap 129/130/131 - do not need to do >> > >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Wed Aug 5 15:06:13 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 5 Aug 2009 18:06:13 -0400 Subject: [ofa-general] [PATCH] opensm/osm_mesh.c: Remove edges in lash matrix Message-ID: <20090805220613.GA7155@comcast.net> The intent of this change to remove edge nodes is to *not* count them. The point of this heuristic is to deal with the case of small lattices which can easily have more surface than interior leading to choosing a non representative seed. This causes impossible counts to get reported. Signed-off-by: Robert Pearson Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c index 72a9aa9..b5d141d 100644 --- a/opensm/opensm/osm_mesh.c +++ b/opensm/opensm/osm_mesh.c @@ -170,6 +170,11 @@ static const struct mesh_info { {8, {2, 2, 2, 2, 2, 2, 2, 2}, 8, {-1792, -6144, -8960, -7168, -3360, -896, -112, 0, 1}, }, + /* + * mesh errors + */ + {2, {6, 6}, 4, {-192, -256, -80, 0, 1}, }, + {-1, {0,}, 0, {0, }, }, }; @@ -727,6 +732,36 @@ done: } /* + * remove_edges + * + * remove type from nodes that have fewer links + * than adjacent nodes + */ +static void remove_edges(lash_t *p_lash) +{ + int sw; + mesh_node_t *n, *nn; + int i; + + for (sw = 0; sw < p_lash->num_switches; sw++) { + n = p_lash->switches[sw]->node; + if (!n->type) + continue; + + for (i = 0; i < n->num_links; i++) { + nn = p_lash->switches[n->links[i]->switch_id]->node; + + if (nn->num_links > n->num_links) { + printf("removed edge switch %s\n", + p_lash->switches[sw]->p_sw->p_node->print_desc); + n->type = -1; + break; + } + } + } +} + +/* * get_local_geometry * * analyze the local geometry around each switch @@ -735,6 +770,7 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh) { osm_log_t *p_log = &p_lash->p_osm->log; int sw; + int status = 0; OSM_LOG_ENTER(p_log); @@ -747,15 +783,38 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh) continue; if (get_switch_metric(p_lash, sw)) { - OSM_LOG_EXIT(p_log); - return -1; + status = -1; + goto Exit; } - classify_switch(p_lash, mesh, sw); classify_mesh_type(p_lash, sw); } + remove_edges(p_lash); + + for (sw = 0; sw < p_lash->num_switches; sw++) { + if (p_lash->switches[sw]->node->type < 0) + continue; + classify_switch(p_lash, mesh, sw); + } + +Exit: OSM_LOG_EXIT(p_log); - return 0; + return status; +} + +static void print_axis(lash_t *p_lash, int sw, int port) +{ + mesh_node_t *node = p_lash->switches[sw]->node; + char *name = p_lash->switches[sw]->p_sw->p_node->print_desc; + int c = node->axes[port]; + + printf("%s[%d] = ", name, port); + if (c) + printf("%s%c -> ", ((c - 1) & 1) ? "-" : "+", 'X' + (c - 1)/2); + else + printf("N/A -> "); + printf("%s\n", + p_lash->switches[node->links[port]->switch_id]->p_sw->p_node->print_desc); } /* @@ -805,6 +864,11 @@ static void seed_axes(lash_t *p_lash, int sw) } } + for (i = 0; i < n; i++) { + printf("seed: "); + print_axis(p_lash, sw, i); + } + done: OSM_LOG_EXIT(p_log); } @@ -878,6 +942,12 @@ static void make_geometry(lash_t *p_lash, int sw) n = s1->node->num_links; /* + * ignore chain fragments + */ + if (n < seed->node->num_links && n <= 2) + continue; + + /* * only process 'mesh' switches */ if (!s1->node->matrix) @@ -908,7 +978,8 @@ static void make_geometry(lash_t *p_lash, int sw) if (j == i) continue; - if (s1->node->matrix[i][j] != 2) { + if (s1->node->matrix[i][j] != 2 && + s1->node->matrix[i][j] <= 4) { if (s1->node->axes[j]) { if (s1->node->axes[j] != opposite(seed, s1->node->axes[i])) { OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 1 mismatch\n"); From hnrose at comcast.net Wed Aug 5 15:27:37 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 5 Aug 2009 18:27:37 -0400 Subject: [ofa-general] [PATCH] opensm/osm_trap_rcv.c: In trap_rcv_process_request, no need to sweep on trap 145 and certain trap 144s Message-ID: <20090805222737.GA8523@comcast.net> NodeDescription changed trap only needs to query the new NodeDescription and not cause sweep Similarly for SystemImageGUID changed (trap 145) LinkWidth/SpeedEnabled changed traps (at least right now) and SM priority changed traps do need to sweep. In the future, LinkWidth/SpeedEnabled changed trap handling can query PortInfo (may also need to bounce port too). Also, as noted in related email thread, it's unclear what sweeping on non generic traps accomplishes but this behavior is preserved. Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index d2e4202..e5bd529 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -291,8 +291,9 @@ trap_rcv_process_request(IN osm_sm_t * sm, osm_physp_t *p_physp; cl_ptr_vector_t *p_tbl; osm_port_t *p_port; + osm_node_t *p_node; ib_net16_t source_lid = 0; - boolean_t is_gsi = TRUE; + boolean_t is_gsi = TRUE, is_trap144_sweep = FALSE; uint8_t port_num = 0; boolean_t physp_change_trap = FALSE; uint64_t event_wheel_timeout = OSM_DEFAULT_TRAP_SUPRESSION_TIMEOUT; @@ -547,44 +548,59 @@ trap_rcv_process_request(IN osm_sm_t * sm, } } - /* Check for node description update. IB Spec v1.2.1 pg 823 */ - if (ib_notice_is_generic(p_ntci) && - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 && - p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES && - p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { - OSM_LOG(sm->p_log, OSM_LOG_INFO, "Trap 144 Node description update\n"); - - if (p_physp) { - CL_PLOCK_ACQUIRE(sm->p_lock); - osm_req_get_node_desc(sm, p_physp); - CL_PLOCK_RELEASE(sm->p_lock); - } else - OSM_LOG(sm->p_log, OSM_LOG_ERROR, - "ERR 3812: No physical port found for " - "trap 144: \"node description update\"\n"); - } + if (ib_notice_is_generic(p_ntci)) { + /* Check for node description update. IB Spec v1.2.1 pg 823 */ + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144) { + /* update port's capability mask (in PortInfo) */ + p_physp->port_info.capability_mask = p_ntci->data_details.ntc_144.new_cap_mask; + if (p_ntci->data_details.ntc_144.local_changes & TRAP_144_MASK_OTHER_LOCAL_CHANGES) { + if (p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_NODE_DESCRIPTION_CHANGE) { + OSM_LOG(sm->p_log, OSM_LOG_INFO, + "Trap 144 Node description update\n"); + + if (p_physp) { + CL_PLOCK_ACQUIRE(sm->p_lock); + osm_req_get_node_desc(sm, p_physp); + CL_PLOCK_RELEASE(sm->p_lock); + } else + OSM_LOG(sm->p_log, OSM_LOG_ERROR, + "ERR 3812: No physical port found for " + "trap 144: \"node description update\"\n"); + } + } + if (p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_LINK_WIDTH_ENABLE_CHANGE || + p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_LINK_SPEED_ENABLE_CHANGE || + p_ntci->data_details.ntc_144.change_flgs & TRAP_144_MASK_SM_PRIORITY_CHANGE) + is_trap144_sweep = TRUE; + } - /* do a sweep if we received a trap */ - if (sm->p_subn->opt.sweep_on_trap) { - /* if this is trap number 128 or run_heavy_sweep is TRUE - - update the force_heavy_sweep flag of the subnet. - Sweep also on traps 144/145 - these traps signal a change of - certain port capabilities/system image guid. - TODO: In the future this can be changed to just getting - PortInfo on this port instead of sweeping the entire subnet. */ - if (ib_notice_is_generic(p_ntci) && - (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 144 || - cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145 || - run_heavy_sweep)) { - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, - "Forcing heavy sweep. Received trap:%u\n", - cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 145) { + /* update system image guid (in NodeInfo) */ + p_node = osm_physp_get_node_ptr(p_physp); + if (p_node) + p_node->node_info.node_guid = p_ntci->data_details.ntc_145.new_sys_guid; + } + + /* do a sweep if we received a trap */ + if (sm->p_subn->opt.sweep_on_trap) { + /* if this is trap number 128 or run_heavy_sweep is + TRUE - update the force_heavy_sweep flag of the + subnet. Also, sweep on certain types of trap 144. + TODO: In the future this can be changed to just + getting PortInfo on this port instead of sweeping + the entire subnet. */ + if (cl_ntoh16(p_ntci->g_or_v.generic.trap_num) == 128 || + is_trap144_sweep || run_heavy_sweep) { + OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, + "Forcing heavy sweep. Received trap:%u\n", + cl_ntoh16(p_ntci->g_or_v.generic.trap_num)); - sm->p_subn->force_heavy_sweep = TRUE; + sm->p_subn->force_heavy_sweep = TRUE; + } + osm_sm_signal(sm, OSM_SIGNAL_SWEEP); } + } else if (sm->p_subn->opt.sweep_on_trap) osm_sm_signal(sm, OSM_SIGNAL_SWEEP); - } /* If we reached here due to trap 129/130/131 - do not need to do the notice report. Just goto exit. We know this is the case From nashwath at gmail.com Wed Aug 5 17:03:04 2009 From: nashwath at gmail.com (Ashwath Narasimhan) Date: Wed, 5 Aug 2009 20:03:04 -0400 Subject: [ofa-general] Setting the rate in Infiniband. Message-ID: Hello, Problem:-- I am trying to set the "Rate" and "MTU" in infiniband using the config file (opensm QoS) but I realize that I cannot set Rates like 500Mbps or 100Mbps. The infiniband rates start from 2.5Gbps. (IBT specification vol 1 -> page 917, Table 207 "PathRecord" ) Background:-- The reason why I need such small rates is because I interface the Infiniband HCA to an FPGA via an Infiniband physical link. Imagine the FPGA as a simple repeater that simply forwards the infiniband signals to the Target HCA. The FPGA cannot handle such a high data rate and neither do I have as much memory as required to buffer it on the FPGA (I might drop packets if the buffer becomes full). Hence I wish to limit the rate to say 100Mbps instead of 2.5Gbps. Question:- How can I set rates less than 2.5Gbps? Can this be changed at all? regards, Ashwath -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Aug 5 17:20:11 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Aug 2009 17:20:11 -0700 Subject: [ofa-general] Setting the rate in Infiniband. In-Reply-To: (Ashwath Narasimhan's message of "Wed, 5 Aug 2009 20:03:04 -0400") References: Message-ID: > The reason why I need such small rates is because I interface the Infiniband > HCA to an FPGA via an Infiniband physical link. Imagine the FPGA as a > simple repeater that simply forwards the infiniband signals to the Target > HCA. The FPGA cannot handle such a high data rate and neither do I have as > much memory as required to buffer it on the FPGA (I might drop packets if > the buffer becomes full). Hence I wish to limit the rate to say 100Mbps > instead of 2.5Gbps. > > Question:- > How can I set rates less than 2.5Gbps? Can this be changed at all? The IB physical layer does not define any signaling slower than 2.5Gbps, ie 1 lane of single data rate. There is nothing slower than any real IB HCA or switch will be able to do. Therefore for your FPGA to be able to talk IB at all, you will need to be able to handle a 1X SDR link (and do 8b/10b encoding, etc). Note that the 8b/10b encoding means there is really only 2 Gbps of data on a 1X SDR link. However, it is OK (in theory) for your FPGA to handle only a the minimal IB MTU (256 bytes), and it also OK for your FPGA to give only a small number of link-layer credits (whatever it has buffering for). This should limit the data you have to buffer to what you can handle, and lets you throttle the traffic at the link level. You will still need to be able to handle link packets, idles, etc at the full data rate. With that said, I would expect any FPGA fancy enough to have a SERDES capable of doing IB signaling to be able handle 2 Gbps of real traffic, since I've seen designs doing fairly complex processing using Virtex II (ie 5+ year old FPGAs) able to handle full 4X SDR IB links. I guess it depends on the sophistication of your RTL. - R. From bart.vanassche at gmail.com Thu Aug 6 00:39:45 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 6 Aug 2009 09:39:45 +0200 Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in the SRP initiator Message-ID: On Wed, Aug 5, 2009 at 10:37 PM, James Bottomley wrote: > On Wed, 2009-08-05 at 19:54 +0200, Bart Van Assche wrote: >> On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier wrote: >> > >> >  > The NULL pointer dereference happens when srp_reset_device() calls >> >  > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with >> >  > req->scmnd->device == NULL. When the sg_reset command issues an >> >  > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an >> >  > scmnd structure and sets scmnd->device to NULL. It is this scmnd >> >  > structure that is passed to srp_reset_device(). What I'm not sure >> >  > about is whether scsi_reset_provider() should set req->scmnd->device >> >  > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to >> >  > handle the condition req->scmnd->device == NULL. >> > >> > Well, I don't see how the reset ioctl can do anything useful unless it >> > passes a device in with the scsi command -- otherwise for example >> > srp_reset_device() has no idea what LUN to try and reset. >> >> (added linux-scsi in CC) >> >> I hope one of the SCSI people can tell us whether the behavior that >> scsi_reset_provider() >> passes the value NULL in req->scmnd->device to >> scsi_try_bus_device_reset() is correct ? > > Need more information. > > cmd->device is supposed to be initialised in scsi_get_command(), which > scsi_reset_provider() calls ... why do you think it got set to null? This thread started with the observation that it is easy to trigger a NULL pointer dereference in the SRP initiator (http://bugzilla.kernel.org/show_bug.cgi?id=13893). The following sequence is sufficient: * Remove the ib_srp kernel module (doing so closes all active SRP sessions). * Insert the ib_srp kernel module. * Create a new SRP connection. * Issue the sg_reset -d ${srp_device} command in a shell. The sg_reset command issues an SG_SCSI_RESET ioctl. This ioctl is processed by invoking scsi_reset_provider(), which in turns invokes the eh_device_reset_handler method of the SRP initiator. Further analysis showed that scsi_reset_provider() passes a non-NULL cmd->device pointer to the SRP initiator, but that the SRP initiator does not use this value. Instead srp_find_req() looks up a struct srp_request pointer based on the struct scsi_cmnd * argument and continues with the struct scsi_cmnd pointer contained in the struct srp_request. While I'm not sure that the patch below makes any sense, it makes the NULL pointer dereference disappear. This made me wonder which assumptions srp_find_req() is based on ? --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c 2009-08-03 12:13:11.000000000 +0200 +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c 2009-08-06 08:50:30.000000000 +0200 @@ -1325,16 +1325,19 @@ static int srp_cm_handler(struct ib_cm_i } static int srp_send_tsk_mgmt(struct srp_target_port *target, + struct scsi_cmnd *scmnd, struct srp_request *req, u8 func) { struct srp_iu *iu; struct srp_tsk_mgmt *tsk_mgmt; + BUG_ON(!scmnd->device); + spin_lock_irq(target->scsi_host->host_lock); if (target->state == SRP_TARGET_DEAD || target->state == SRP_TARGET_REMOVED) { - req->scmnd->result = DID_BAD_TARGET << 16; + scmnd->result = DID_BAD_TARGET << 16; goto out; } @@ -1348,7 +1351,7 @@ static int srp_send_tsk_mgmt(struct srp_ memset(tsk_mgmt, 0, sizeof *tsk_mgmt); tsk_mgmt->opcode = SRP_TSK_MGMT; - tsk_mgmt->lun = cpu_to_be64((u64) req->scmnd->device->lun << 48); + tsk_mgmt->lun = cpu_to_be64((u64) scmnd->device->lun << 48); tsk_mgmt->tag = req->index | SRP_TAG_TSK_MGMT; tsk_mgmt->tsk_mgmt_func = func; tsk_mgmt->task_tag = req->index; @@ -1395,7 +1398,7 @@ static int srp_abort(struct scsi_cmnd *s return FAILED; if (srp_find_req(target, scmnd, &req)) return FAILED; - if (srp_send_tsk_mgmt(target, req, SRP_TSK_ABORT_TASK)) + if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_ABORT_TASK)) return FAILED; spin_lock_irq(target->scsi_host->host_lock); @@ -1425,7 +1428,9 @@ static int srp_reset_device(struct scsi_ return FAILED; if (srp_find_req(target, scmnd, &req)) return FAILED; - if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET)) + if (WARN_ON(!scmnd->device)) + return FAILED; + if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_LUN_RESET)) return FAILED; if (req->tsk_status) return FAILED; Bart. From bart.vanassche at gmail.com Thu Aug 6 02:58:50 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 6 Aug 2009 11:58:50 +0200 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: References: Message-ID: On Wed, Aug 5, 2009 at 8:31 PM, Roland Dreier wrote: > So I queued up the patch below for 2.6.32... this is almost the same as > the patch I proposed before except that I fixed two places where I > dropped the lock *after* calling ipoib_send() -- which missed the whole > point of what I was trying to do.  So this patch has a much better > chance of actually working! After having applied this patch it took somewhat longer before a locking inversion report was generated, but unfortunately there still was a locking inversion report generated (see also http://bugzilla.kernel.org/show_bug.cgi?id=13757 for the details): ========================================================= [ INFO: possible irq lock inversion dependency detected ] 2.6.30.4-scst-debug #1 --------------------------------------------------------- swapper/0 just changed the state of lock: (&priv->lock){-.-...}, at: [] ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib] but this lock took another, HARDIRQ-unsafe lock in the past: (&(&mad_agent_priv->timed_work)->timer){+.-...} and interrupts could create inverse lock ordering between them. [ ... ] stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.30.4-scst-debug #1 Call Trace: [] print_irq_inversion_bug+0x14c/0x1c0 [] check_usage_forwards+0x7d/0xc0 [] mark_lock+0x20f/0x6a0 [] ? check_usage_forwards+0x0/0xc0 [] __lock_acquire+0xce4/0x1c80 [] ? check_usage_forwards+0x0/0xc0 [] lock_acquire+0x108/0x150 [] ? ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib] [] _spin_lock_irqsave+0x41/0x60 [] ? ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib] [] ipoib_cm_rx_event_handler+0x4f/0xa0 [ib_ipoib] [] mlx4_ib_qp_event+0x7a/0xf0 [mlx4_ib] [] mlx4_qp_event+0x6f/0xe0 [mlx4_core] [] mlx4_eq_int+0x289/0x2e0 [mlx4_core] [] mlx4_msi_x_interrupt+0xf/0x20 [mlx4_core] [] handle_IRQ_event+0x95/0x200 [] handle_edge_irq+0xc8/0x170 [] handle_irq+0x1f/0x30 [] do_IRQ+0x6e/0xf0 [] ret_from_intr+0x0/0xf [] ? acpi_idle_enter_bm+0x27d/0x2ad [processor] [] ? acpi_idle_enter_bm+0x273/0x2ad [processor] [] ? cpuidle_idle_call+0xa5/0x100 [] ? cpu_idle+0x64/0xd0 [] ? start_secondary+0x188/0x1e7 Bart. From vlad at lists.openfabrics.org Thu Aug 6 03:18:14 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 6 Aug 2009 03:18:14 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090806-0200 daily build status Message-ID: <20090806101814.5BB66E300A1@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090806-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From vlad at dev.mellanox.co.il Thu Aug 6 07:35:57 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 06 Aug 2009 17:35:57 +0300 Subject: [ofa-general] OFED-1.4.2 GA is available Message-ID: <4A7AEA4D.6050103@dev.mellanox.co.il> I am pleased to announce that OFED-1.4.2 GA release is done The tarball is available on: http://www.openfabrics.org/downloads/OFED/ofed-1.4.2/OFED-1.4.2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4.2 Vladimir & Tziporet ======================================================================== Release information: ------------------------------ Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - RedHat EL5 up3: 2.6.18-128.el5 - OEL 4.5: 2.6.9-55.ELsmp - OEL 5.2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - SLES11 GA: 2.6.27.13-1-default - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.4.1 ============================ - NFSRDMA Fix NULL pointer dereference when calling locks_release_private due to the fl_lmops pointer never being set. crypto_alloc_hash calls are failing due to larval returning unexpected results. Reverting nfs4_make_rec_clidname to use crypto_alloc_tfm. kref safety checks were removed in previous versions due to kref behaving differently in 2.6.18 (and older). - RDS Refactor end of __conn_create for readability Fix completion notifications on blocking sockets fix for double-def of assert_spin_locked in RHEL4_U4/5/6/7 RDS/IW: Remove dead code Remove page_shift variable from iwarp transport RDS/IB: Always use PAGE_SIZE for FMR page size - MLX4 map sufficient ICM memory for EQs Failing probe function if not primary physical function Add new device ID 0x6764 Fix post send of local invalidate and fast registration packets. - MLX4_EN Fix vlan flag endianess in LRO code. - NES Make LRO as default feature fix qp refcount during disconnect backport for LRO as default feature - SDP Fix BUG1672 - Data integrity error Fix memory leak in bzcopy Fix bad credits advertised when connection initiated Fix compilation on i386 with gcc 3.4 - BACKPORTS 2.6.16_sles10_sp2: fix clear-dirty-page accounting. - Bug fixes See each component release notes for details on enhancements and bug fixes From bart.vanassche at gmail.com Thu Aug 6 08:38:18 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 6 Aug 2009 17:38:18 +0200 Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in the SRP initiator In-Reply-To: <4A7A949B.60408@panasas.com> References: <4A7A949B.60408@panasas.com> Message-ID: On Thu, Aug 6, 2009 at 10:30 AM, Boaz Harrosh wrote: > [Just out of memory, I've not inspected the code for a long time] > > It looks like an srp_request was never allocated for the reset > command. (since it never went through .queuecommand) > > static int srp_find_req(struct srp_target_port *target, >                        struct scsi_cmnd *scmnd, >                        struct srp_request **req) > { >        if (scmnd->host_scribble == (void *) -1L) >                return -1; > >        *req = &target->req_ring[(long) scmnd->host_scribble]; > >        return 0; > } > > Specifically scmnd->host_scribble can just be Zero. > When queues are active that does not matter and a device is found > since the reset does not really need the scsi_cmnd. But in above > scenario the queues were never used and the array entry is empty. Hello Boaz, Thanks for the info. Do you know by heart which SCSI drivers process the SG_SCSI_RESET ioctl correctly and that could be used as an example for fixing the SRP initiator ? Bart. From bart.vanassche at gmail.com Thu Aug 6 09:43:56 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 6 Aug 2009 18:43:56 +0200 Subject: [ofa-general] IB kernel modules and the kobject release() method Message-ID: Hello, After having enabled CONFIG_DEBUG_KOBJECT=y in the kernel config I noticed that messages appeared in the kernel log about the IB modules missing a kobject release() method. This happens both with a vanilla 2.6.30.4 kernel and with a 2.6.27.29 kernel + OFED 1.4.1. Has anyone noticed this before ? An example of the messages logged in /var/log/messages: ... kobject: 'ib_cm' (ffffffffa067a810): does not have a release() function, it is broken and must be fixed. kobject: 'iw_cm' (ffffffffa06a58d0): does not have a release() function, it is broken and must be fixed. ... See also https://bugs.openfabrics.org/show_bug.cgi?id=1702. Bart. From ssufficool at sbcounty.gov Wed Aug 5 08:42:19 2009 From: ssufficool at sbcounty.gov (Sufficool, Stanley) Date: Wed, 5 Aug 2009 08:42:19 -0700 Subject: [ofa-general] SRP and Multiple Port HCA Message-ID: When I attempt to connect WinOF SRP to an OFED SRP Target, WinOF SRP immediately disconnects after establishing the connection. This is occurring on a 2 port HCA initiator to a 2 port HCA target with redundant paths through 2 switches. IIRC the past solution was to unplug the redundant path because WinOF SRP does not have multi path DSM to handle the visibility of the disk on multiple paths. Unfortunately, we have IPoIB apps that use the redundancy, so this is not an option. I can mask off the redundant path at the target using SCST group to name assignments, but the naming of the SRP is by NODE not PORT so I do not have a unique name per port to close one of the paths. The only fix I see for Windows SRP and multi path is to provide an option to use port based naming on the SRP Target for establishing sessions. Or is this something that can be taken care of at the WinOF side with a registry entry or discover / connect tool? Does anyone have other ideas that will work with our current config? -------------- next part -------------- An HTML attachment was scrubbed... URL: From James.Bottomley at HansenPartnership.com Wed Aug 5 13:37:44 2009 From: James.Bottomley at HansenPartnership.com (James Bottomley) Date: Wed, 05 Aug 2009 15:37:44 -0500 Subject: [ofa-general] Re: [PATCH 2.6.30.4] Fix for NULL pointer dereference by SRP initiator triggered by a SCSI reset after the SRP connection has been closed In-Reply-To: References: Message-ID: <1249504664.4183.45.camel@mulgrave.site> On Wed, 2009-08-05 at 19:54 +0200, Bart Van Assche wrote: > On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier wrote: > > > > > The NULL pointer dereference happens when srp_reset_device() calls > > > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with > > > req->scmnd->device == NULL. When the sg_reset command issues an > > > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an > > > scmnd structure and sets scmnd->device to NULL. It is this scmnd > > > structure that is passed to srp_reset_device(). What I'm not sure > > > about is whether scsi_reset_provider() should set req->scmnd->device > > > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to > > > handle the condition req->scmnd->device == NULL. > > > > Well, I don't see how the reset ioctl can do anything useful unless it > > passes a device in with the scsi command -- otherwise for example > > srp_reset_device() has no idea what LUN to try and reset. > > (added linux-scsi in CC) > > I hope one of the SCSI people can tell us whether the behavior that > scsi_reset_provider() > passes the value NULL in req->scmnd->device to > scsi_try_bus_device_reset() is correct ? Need more information. cmd->device is supposed to be initialised in scsi_get_command(), which scsi_reset_provider() calls ... why do you think it got set to null? James From bharrosh at panasas.com Thu Aug 6 01:30:19 2009 From: bharrosh at panasas.com (Boaz Harrosh) Date: Thu, 06 Aug 2009 11:30:19 +0300 Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in the SRP initiator In-Reply-To: References: Message-ID: <4A7A949B.60408@panasas.com> On 08/06/2009 10:39 AM, Bart Van Assche wrote: > On Wed, Aug 5, 2009 at 10:37 PM, James > Bottomley wrote: >> On Wed, 2009-08-05 at 19:54 +0200, Bart Van Assche wrote: >>> On Wed, Aug 5, 2009 at 7:44 PM, Roland Dreier wrote: >>>> >>>> > The NULL pointer dereference happens when srp_reset_device() calls >>>> > srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET) with >>>> > req->scmnd->device == NULL. When the sg_reset command issues an >>>> > SG_SCSI_RESET ioctl, scsi_reset_provider() is invoked and allocates an >>>> > scmnd structure and sets scmnd->device to NULL. It is this scmnd >>>> > structure that is passed to srp_reset_device(). What I'm not sure >>>> > about is whether scsi_reset_provider() should set req->scmnd->device >>>> > to a non-NULL value or whether srp_send_tsk_mgmt() should be able to >>>> > handle the condition req->scmnd->device == NULL. >>>> >>>> Well, I don't see how the reset ioctl can do anything useful unless it >>>> passes a device in with the scsi command -- otherwise for example >>>> srp_reset_device() has no idea what LUN to try and reset. >>> >>> (added linux-scsi in CC) >>> >>> I hope one of the SCSI people can tell us whether the behavior that >>> scsi_reset_provider() >>> passes the value NULL in req->scmnd->device to >>> scsi_try_bus_device_reset() is correct ? >> >> Need more information. >> >> cmd->device is supposed to be initialised in scsi_get_command(), which >> scsi_reset_provider() calls ... why do you think it got set to null? > > This thread started with the observation that it is easy to trigger a > NULL pointer dereference in the SRP initiator > (http://bugzilla.kernel.org/show_bug.cgi?id=13893). The following > sequence is sufficient: > * Remove the ib_srp kernel module (doing so closes all active SRP sessions). > * Insert the ib_srp kernel module. > * Create a new SRP connection. > * Issue the sg_reset -d ${srp_device} command in a shell. > The sg_reset command issues an SG_SCSI_RESET ioctl. This ioctl is > processed by invoking scsi_reset_provider(), which in turns invokes > the eh_device_reset_handler method of the SRP initiator. Further > analysis showed that scsi_reset_provider() passes a non-NULL > cmd->device pointer to the SRP initiator, but that the SRP initiator > does not use this value. Instead srp_find_req() looks up a struct > srp_request pointer based on the struct scsi_cmnd * argument and > continues with the struct scsi_cmnd pointer contained in the struct > srp_request. > > While I'm not sure that the patch below makes any sense, it makes the > NULL pointer dereference disappear. This made me wonder which > assumptions srp_find_req() is based on ? > [Just out of memory, I've not inspected the code for a long time] It looks like an srp_request was never allocated for the reset command. (since it never went through .queuecommand) static int srp_find_req(struct srp_target_port *target, struct scsi_cmnd *scmnd, struct srp_request **req) { if (scmnd->host_scribble == (void *) -1L) return -1; *req = &target->req_ring[(long) scmnd->host_scribble]; return 0; } Specifically scmnd->host_scribble can just be Zero. When queues are active that does not matter and a device is found since the reset does not really need the scsi_cmnd. But in above scenario the queues were never used and the array entry is empty. Boaz > --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c > 2009-08-03 12:13:11.000000000 +0200 > +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c 2009-08-06 > 08:50:30.000000000 +0200 > @@ -1325,16 +1325,19 @@ static int srp_cm_handler(struct ib_cm_i > } > > static int srp_send_tsk_mgmt(struct srp_target_port *target, > + struct scsi_cmnd *scmnd, > struct srp_request *req, u8 func) > { > struct srp_iu *iu; > struct srp_tsk_mgmt *tsk_mgmt; > > + BUG_ON(!scmnd->device); > + > spin_lock_irq(target->scsi_host->host_lock); > > if (target->state == SRP_TARGET_DEAD || > target->state == SRP_TARGET_REMOVED) { > - req->scmnd->result = DID_BAD_TARGET << 16; > + scmnd->result = DID_BAD_TARGET << 16; > goto out; > } > > @@ -1348,7 +1351,7 @@ static int srp_send_tsk_mgmt(struct srp_ > memset(tsk_mgmt, 0, sizeof *tsk_mgmt); > > tsk_mgmt->opcode = SRP_TSK_MGMT; > - tsk_mgmt->lun = cpu_to_be64((u64) > req->scmnd->device->lun << 48); > + tsk_mgmt->lun = cpu_to_be64((u64) scmnd->device->lun << 48); > tsk_mgmt->tag = req->index | SRP_TAG_TSK_MGMT; > tsk_mgmt->tsk_mgmt_func = func; > tsk_mgmt->task_tag = req->index; > @@ -1395,7 +1398,7 @@ static int srp_abort(struct scsi_cmnd *s > return FAILED; > if (srp_find_req(target, scmnd, &req)) > return FAILED; > - if (srp_send_tsk_mgmt(target, req, SRP_TSK_ABORT_TASK)) > + if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_ABORT_TASK)) > return FAILED; > > spin_lock_irq(target->scsi_host->host_lock); > @@ -1425,7 +1428,9 @@ static int srp_reset_device(struct scsi_ > return FAILED; > if (srp_find_req(target, scmnd, &req)) > return FAILED; > - if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET)) > + if (WARN_ON(!scmnd->device)) > + return FAILED; > + if (srp_send_tsk_mgmt(target, scmnd, req, SRP_TSK_LUN_RESET)) > return FAILED; > if (req->tsk_status) > return FAILED; > > Bart. > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From James.Bottomley at HansenPartnership.com Thu Aug 6 08:43:45 2009 From: James.Bottomley at HansenPartnership.com (James Bottomley) Date: Thu, 06 Aug 2009 15:43:45 +0000 Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in the SRP initiator In-Reply-To: References: <4A7A949B.60408@panasas.com> Message-ID: <1249573425.7073.16.camel@mulgrave.site> On Thu, 2009-08-06 at 17:38 +0200, Bart Van Assche wrote: > On Thu, Aug 6, 2009 at 10:30 AM, Boaz Harrosh wrote: > > [Just out of memory, I've not inspected the code for a long time] > > > > It looks like an srp_request was never allocated for the reset > > command. (since it never went through .queuecommand) > > > > static int srp_find_req(struct srp_target_port *target, > > struct scsi_cmnd *scmnd, > > struct srp_request **req) > > { > > if (scmnd->host_scribble == (void *) -1L) > > return -1; > > > > *req = &target->req_ring[(long) scmnd->host_scribble]; > > > > return 0; > > } > > > > Specifically scmnd->host_scribble can just be Zero. > > When queues are active that does not matter and a device is found > > since the reset does not really need the scsi_cmnd. But in above > > scenario the queues were never used and the array entry is empty. > > Hello Boaz, > > Thanks for the info. Do you know by heart which SCSI drivers process > the SG_SCSI_RESET ioctl correctly and that could be used as an example > for fixing the SRP initiator ? Basically all of them which are in regular use for clustering; so SAN: qla2xxx; lpfc. And for legacy SPI clusters: aic7xxx;mptspi James From eli at dev.mellanox.co.il Thu Aug 6 10:18:40 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 6 Aug 2009 20:18:40 +0300 Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow interfaces to correspond to each other In-Reply-To: <20090805204259.GB16677@obsidianresearch.com> References: <20090805083023.GK5599@mtls03> <20090805204259.GB16677@obsidianresearch.com> Message-ID: <20090806171840.GA32301@mtls03> On Wed, Aug 05, 2009 at 02:42:59PM -0600, Jason Gunthorpe wrote: > > What about multicast though? Switches are going to have trouble with > group membership lists for non IP packets.. Even just sending a ICMPv6 > packet (with an IPv6 ethertype) isn't guaranteed to fix it. > In this patch set, all multicast packets use the broadcast mac. We will address this issue at a future time. From eli at dev.mellanox.co.il Thu Aug 6 10:20:35 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 6 Aug 2009 20:20:35 +0300 Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type In-Reply-To: <73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com> References: <20090805082808.GB5599@mtls03> <73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com> Message-ID: <20090806172035.GB32301@mtls03> On Wed, Aug 05, 2009 at 01:43:12PM -0700, Sean Hefty wrote: > > Can resources (PDs, CQs, MRs, etc.) between the different transports be shared? > Does QP failover between transports work? There is nothing in the architecture that precludes this; we are not currently focusing on this. > > Did you consider modifying rdma_node_get_transport_s_() and returning a bitmask > of the supported transports available on the device? I'm wondering if something > like this makes sense, to allow skipping devices that are not of interest to a > particular module. This would be in addition to the rdma_port_get_transport > call. > > There's just a lot of new checks to handle the transport on a port by port > basis. > We can use a function: rdma_is_transport_supported(ibdev, transport), which will return true if at least one port runs the given transport. Thus, as long as we have only a few transports, these checks will amount to 1-2 lines of code in each module. From sean.hefty at intel.com Thu Aug 6 10:34:19 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Aug 2009 10:34:19 -0700 Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type In-Reply-To: <20090806172035.GB32301@mtls03> References: <20090805082808.GB5599@mtls03> <73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com> <20090806172035.GB32301@mtls03> Message-ID: <69AD3F21660945D4B2D30109F8FC7A55@amr.corp.intel.com> >> Can resources (PDs, CQs, MRs, etc.) between the different transports be >shared? >> Does QP failover between transports work? > >There is nothing in the architecture that precludes this; we are not >currently focusing on this. Does the implementation allow this? Right now PDs, CQs, etc are allocated per device, not per port. I'm not immediately concerned about QP failover. However, I believe there needs to be some level of coordination between the Infiniband side of the CM and the Ethernet side of the CM, since QPs are associated with CA GUIDs. I'm just trying to understand the impact of this coordination. - Sean From rdreier at cisco.com Thu Aug 6 10:37:19 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Aug 2009 10:37:19 -0700 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: (Bart Van Assche's message of "Thu, 6 Aug 2009 18:43:56 +0200") References: Message-ID: > > After having enabled CONFIG_DEBUG_KOBJECT=y in the kernel config I > noticed that messages appeared in the kernel log about the IB modules > missing a kobject release() method. This happens both with a vanilla > 2.6.30.4 kernel and with a 2.6.27.29 kernel + OFED 1.4.1. Has anyone > noticed this before ? > > An example of the messages logged in /var/log/messages: > > ... > kobject: 'ib_cm' (ffffffffa067a810): does not have a release() > function, it is broken and must be fixed. I don't see anything similar with CONFIG_DEBUG_KOBJECT enabled on 2.6.31-rc5 (without adding in any OFED confusion). It seems as if you get this message for every module being loaded; do you see it for any non-RDMA-related modules? (Do you have any such modules in your config?) I can imagine the OFED build system messing things up, but if you're just building the modules as part of the normal kernel build (ie your vanilla 2.6.30 kernel) then I don't see anything that would make ib_cm or iw_cm any different from any other module. For example, if I load ib_cm on my kernel, the only log messages I see from "dmesg|grep ib_cm" are: kobject: 'ib_cm' (ffffffffa024c8f0): kobject_add_internal: parent: 'module', set: 'module' kobject: 'holders' (ffff88022c1c9df8): kobject_add_internal: parent: 'ib_cm', set: '' kobject: 'ib_cm' (ffffffffa024c8f0): kobject_uevent_env kobject: 'ib_cm' (ffffffffa024c8f0): fill_kobj_path: path = '/module/ib_cm' kobject: 'notes' (ffff88022c1c9be8): kobject_add_internal: parent: 'ib_cm', set: '' - R. From rdreier at cisco.com Thu Aug 6 10:38:20 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Aug 2009 10:38:20 -0700 Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allow interfaces to correspond to each other In-Reply-To: <20090806171840.GA32301@mtls03> (Eli Cohen's message of "Thu, 6 Aug 2009 20:18:40 +0300") References: <20090805083023.GK5599@mtls03> <20090805204259.GB16677@obsidianresearch.com> <20090806171840.GA32301@mtls03> Message-ID: > > What about multicast though? Switches are going to have trouble with > > group membership lists for non IP packets.. Even just sending a ICMPv6 > > packet (with an IPv6 ethertype) isn't guaranteed to fix it. > In this patch set, all multicast packets use the broadcast mac. We > will address this issue at a future time. I don't see how you can address it in the future -- if later on things are changed to use multicast addresses, then systems running this code will silently fail to receive multicasts. - R. From rdreier at cisco.com Thu Aug 6 10:41:03 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Aug 2009 10:41:03 -0700 Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in the SRP initiator In-Reply-To: <4A7A949B.60408@panasas.com> (Boaz Harrosh's message of "Thu, 06 Aug 2009 11:30:19 +0300") References: <4A7A949B.60408@panasas.com> Message-ID: > Specifically scmnd->host_scribble can just be Zero. I see at last, thanks! The issue is that SRP is using host_scribble to hold an index, and index 0 is valid for us. I guess the fix is a bit complex, but basically we should use host_scribble to point to the request, and if we don't find a request in reset_device we should allocate one. It's a bit unfortunate that the SCSI midlayer bypasses queueing for the device reset command because it means we may not have a slot in our queue for the reset request etc but I suppose that's even more involved to fix. - R. From sean.hefty at intel.com Thu Aug 6 10:52:34 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Aug 2009 10:52:34 -0700 Subject: [ofa-general] [PATCHv4 03/10] ib_core: RDMAoE support only QP1 In-Reply-To: <20090805082854.GD5599@mtls03> References: <20090805082854.GD5599@mtls03> Message-ID: <9E17A35942B547BDBAE2E56101287CBC@amr.corp.intel.com> >diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c >index 7b737c4..de83c71 100644 >--- a/drivers/infiniband/core/mad.c >+++ b/drivers/infiniband/core/mad.c >@@ -199,6 +199,16 @@ struct ib_mad_agent *ib_register_mad_agent(struct >ib_device *device, > unsigned long flags; > u8 mgmt_class, vclass; > >+ /* Validate device and port */ >+ port_priv = ib_get_mad_port(device, port_num); >+ if (!port_priv) { >+ ret = ERR_PTR(-ENODEV); >+ goto error1; >+ } >+ >+ if (!port_priv->qp_info[qp_type].qp) >+ return NULL; It seems odd that the first if has 'goto error1', but the second if simply returns NULL. >+ > /* Validate parameters */ > qpn = get_spl_qp_index(qp_type); > if (qpn == -1) >@@ -260,13 +270,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct >ib_device *device, > goto error1; > } > >- /* Validate device and port */ >- port_priv = ib_get_mad_port(device, port_num); >- if (!port_priv) { >- ret = ERR_PTR(-ENODEV); >- goto error1; >- } >- > /* Allocate structures */ > mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL); > if (!mad_agent_priv) { >@@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) > struct ib_mad_agent_private *mad_agent_priv; > struct ib_mad_snoop_private *mad_snoop_priv; > >+ if (!mad_agent) >+ return 0; Why would a kernel client call ib_unregister_mad_agent with a NULL pointer? >+ > /* If the TID is zero, the agent can only snoop. */ > if (mad_agent->hi_tid) { > mad_agent_priv = container_of(mad_agent, >@@ -2602,6 +2608,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info >*qp_info) > struct ib_mad_private *recv; > struct ib_mad_list_head *mad_list; > >+ if (!qp_info->qp) >+ return; >+ > while (!list_empty(&qp_info->recv_queue.list)) { > > mad_list = list_entry(qp_info->recv_queue.list.next, >@@ -2643,6 +2652,9 @@ static int ib_mad_port_start(struct ib_mad_port_private >*port_priv) > > for (i = 0; i < IB_MAD_QPS_CORE; i++) { > qp = port_priv->qp_info[i].qp; >+ if (!qp) >+ continue; >+ > /* > * PKey index for QP1 is irrelevant but > * one is needed for the Reset to Init transition >@@ -2684,6 +2696,9 @@ static int ib_mad_port_start(struct ib_mad_port_private >*port_priv) > } > > for (i = 0; i < IB_MAD_QPS_CORE; i++) { >+ if (!port_priv->qp_info[i].qp) >+ continue; >+ > ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); > if (ret) { > printk(KERN_ERR PFX "Couldn't post receive WRs\n"); >@@ -2762,6 +2777,9 @@ error: > > static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) > { >+ if (!qp_info->qp) >+ return; >+ > ib_destroy_qp(qp_info->qp); > kfree(qp_info->snoop_table); > } >@@ -2777,6 +2795,7 @@ static int ib_mad_port_open(struct ib_device *device, > struct ib_mad_port_private *port_priv; > unsigned long flags; > char name[sizeof "ib_mad123"]; >+ int has_smi; > > /* Create new device info */ > port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); >@@ -2793,6 +2812,10 @@ static int ib_mad_port_open(struct ib_device *device, > init_mad_qp(port_priv, &port_priv->qp_info[1]); > > cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; >+ has_smi = rdma_port_get_transport(device, port_num) == RDMA_TRANSPORT_IB; >+ if (has_smi) >+ cq_size *= 2; cq_size is doubled twice I really wish there were a cleaner way to add this support that didn't involve adding so many checks throughout the code. It's hard to know if checks were added in all the places that were needed. I can't think of a clever way to handle QP 0. - Sean From rdreier at cisco.com Thu Aug 6 10:56:26 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Aug 2009 10:56:26 -0700 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: (Bart Van Assche's message of "Thu, 6 Aug 2009 11:58:50 +0200") References: Message-ID: > After having applied this patch it took somewhat longer before a > locking inversion report was generated, but unfortunately there still > was a locking inversion report generated (see also > http://bugzilla.kernel.org/show_bug.cgi?id=13757 for the details): ummm, yikes... can you apply the hack patch I sent originally to take priv->lock from an interrupt ASAP and try that along with the fix patch to drop priv->lock before calling ipoib_send()? That might make the lockdep trace understandable. - R. From sean.hefty at intel.com Thu Aug 6 11:05:47 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Aug 2009 11:05:47 -0700 Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE ports In-Reply-To: <20090805082910.GE5599@mtls03> References: <20090805082910.GE5599@mtls03> Message-ID: <376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com> >Initialize umad context for devices that have any of their ports either IB or >RDMAoE so as to allow user space apps to send and receive MADs on QP1. Is there a need to expose QP1 to user space? The CM is in the kernel, and there's not an SA. - Sean From sean.hefty at intel.com Thu Aug 6 11:12:49 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Aug 2009 11:12:49 -0700 Subject: [ofa-general] [PATCHv4 05/10] ib/cm: Enable CM support for RDMAoE In-Reply-To: <20090805082919.GF5599@mtls03> References: <20090805082919.GF5599@mtls03> Message-ID: <397FD7F95179400BB728A5807176EF61@amr.corp.intel.com> >diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >index f930f1d..63d6de3 100644 >--- a/drivers/infiniband/core/cm.c >+++ b/drivers/infiniband/core/cm.c >@@ -3699,7 +3699,7 @@ static void cm_add_one(struct ib_device *ib_device) > set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); > for (i = 1; i <= ib_device->phys_port_cnt; i++) { > tt = rdma_port_get_transport(ib_device, i); >- if (tt != RDMA_TRANSPORT_IB) >+ if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE) > continue; > > port = kzalloc(sizeof *port, GFP_KERNEL); >diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c >index 4f5096d..21c78f5 100644 >--- a/drivers/infiniband/core/ucm.c >+++ b/drivers/infiniband/core/ucm.c >@@ -1240,13 +1240,19 @@ static void ib_ucm_add_one(struct ib_device *device) > { > struct ib_ucm_device *ucm_dev; > int i; >+ enum rdma_transport_type tt; > > if (!device->alloc_ucontext || device->node_type == RDMA_NODE_IB_SWITCH) > return; > >- for (i = 1; i <= device->phys_port_cnt; ++i) >- if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) >- return; >+ for (i = 1; i <= device->phys_port_cnt; ++i) { >+ tt = rdma_port_get_transport(device, i); >+ if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) >+ break; >+ } >+ >+ if (i > device->phys_port_cnt) >+ return; > > ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); > if (!ucm_dev) nit: There's a slight change in logic here. Previously, the cm/ucm added a device only if all ports were the correct type. Now, they add a device if any port is the correct type. In practice, this shouldn't be an issue, but other code in the cm/ucm is assuming that all ports on the device are usable. From hnrose at comcast.net Thu Aug 6 11:19:28 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 6 Aug 2009 14:19:28 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided Message-ID: <20090806181928.GA21698@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index 2505c46..a22b936 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -136,7 +137,7 @@ static int do_ucast_file_load(void *context) OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, "LFTs file name is not given; " "using default routing algorithm\n"); - return 1; + return -1; } file = fopen(file_name, "r"); From hnrose at comcast.net Thu Aug 6 11:23:15 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 6 Aug 2009 14:23:15 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash.c: In lash_core, return status -1 for all errors Message-ID: <20090806182315.GB21698@comcast.net> In lash_process, rename variable from return_status to status Also, status is not really IB_SUCCESS or not (although that works) Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index b3107f0..96bfebb 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -795,7 +795,7 @@ static int lash_core(lash_t * p_lash) int stop = 0, output_link, i_next_switch; int output_link2, i_next_switch2; int cycle_found2 = 0; - int status = 0; + int status = -1; int *switch_bitmap = NULL; /* Bitmap to check if we have processed this pair */ unsigned start_vl = p_lash->p_osm->subn.opt.lash_start_vl; @@ -810,7 +810,6 @@ static int lash_core(lash_t * p_lash) shortest_path(p_lash, i); if (generate_routing_func_for_mst(p_lash, i, &dests)) { - status = -1; OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D06: " "generate_routing_func_for_mst failed\n"); goto Exit; @@ -951,10 +950,10 @@ static int lash_core(lash_t * p_lash) OSM_LOG(p_log, OSM_LOG_INFO, "Lanes in layer %d: %d\n", i, p_lash->num_mst_in_lane[i]); + status = 0; goto Exit; Error_Not_Enough_Lanes: - status = -1; OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D02: " "Lane requirements (%d) exceed available lanes (%d)" " with starting lane (%d)\n", @@ -1222,7 +1221,7 @@ static int lash_process(void *context) { lash_t *p_lash = context; osm_log_t *p_log = &p_lash->p_osm->log; - int return_status = IB_SUCCESS; + int status = 0; OSM_LOG_ENTER(p_log); @@ -1231,18 +1230,18 @@ static int lash_process(void *context) /* everything starts here */ lash_cleanup(p_lash); - return_status = discover_network_properties(p_lash); - if (return_status != IB_SUCCESS) + status = discover_network_properties(p_lash); + if (status) goto Exit; - return_status = init_lash_structures(p_lash); - if (return_status != IB_SUCCESS) + status = init_lash_structures(p_lash); + if (status) goto Exit; process_switches(p_lash); - return_status = lash_core(p_lash); - if (return_status != IB_SUCCESS) + status = lash_core(p_lash); + if (status) goto Exit; populate_fwd_tbls(p_lash); @@ -1252,7 +1251,7 @@ Exit: free_lash_structures(p_lash); OSM_LOG_EXIT(p_log); - return return_status; + return status; } static lash_t *lash_create(osm_opensm_t * p_osm) From arlin.r.davis at intel.com Thu Aug 6 11:39:40 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 6 Aug 2009 11:39:40 -0700 Subject: [ofa-general] [ANNOUNCE] uDAPL v2.0 - dapl-2.0.21 release Message-ID: <1039212EEA944CE5A17E8C5ACFC9276E@amr.corp.intel.com> New release for uDAPL 2.0 available on the OFA download page and in my git tree. md5sum: 7874571e984c9d8ab315dcd90bfd7c44 dapl-2.0.21.tar.gz Summary of changes: v2 - scm: Fix disconnect. QP's need to move to ERROR state in v2 - dtest: modify dtest.c to cleanup CNO wait code and consolidate into v2 - common: CNO events, once triggered will not be returned during the cno wait. v2 - scm, cma: CNO support broken in both CMA and SCM providers. v2 - common osd: include winsock2.h for IPv6 definitions. v2 - common osd: include w2tcpip.h for sockaddr_in6 definitions. v2 - common: direct_wait objects pushed down to provider layer v2 - dapltest: Implement a malloc() threshold for the completion reaping. v2 - scm: handle connected state when freeing CM objects v2 - scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings. v2 - scm: set TCP_NODELAY sockopt on the server side for sends. v2 - windows: remove obsolete files in dapl/udapl source tree v2 - dtestcm: add UD type QP option to test v2 - scm: destroy QP called before disconnect v2 - cma: add support for rdma_cm TIME_WAIT event. v2 - scm: remove old udapl_scm code replaced by openib_scm. v2 - winof: fix build issues after consolidating cma, scm code base. v2 - cma: lock held when exiting as a result of a rdma_create_event_channel failurb v2 - windows: all dlist functions have been moved to the header file. v2 - dtestcm windows: add build infrastructure for new dtestcm test suite v2 - openib_common: reorganize provider code base to share common mem, cq, qp, dto v2 - scm: fixes and optimizations for connection scaling v2 - scm: double the default fd_set_size v2 - scm: EP reference in CR should be cleared during ep_destroy v2 - dtestx: fix conn establishment event checking v2 - dtestcm: new test to measure dapl connection rates. Vlad, please pull new v2 package into OFED 1.5 and install the following: NOTE: the reorder... v2 first and then v1 dapl-2.0.21-1 dapl-utils-2.0.21-1 dapl-devel-2.0.21-1 dapl-debuginfo-2.0.21-1 compat-dapl-1.2.14-1 compat-dapl-devel-1.2.14-1 See http://www.openfabrics.org/downloads/dapl/ more details. -arlin From bart.vanassche at gmail.com Thu Aug 6 11:46:25 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 6 Aug 2009 20:46:25 +0200 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: References: Message-ID: On Thu, Aug 6, 2009 at 7:37 PM, Roland Dreier wrote: > >  > >  > After having enabled CONFIG_DEBUG_KOBJECT=y in the kernel config I >  > noticed that messages appeared in the kernel log about the IB modules >  > missing a kobject release() method. This happens both with a vanilla >  > 2.6.30.4 kernel and with a 2.6.27.29 kernel + OFED 1.4.1. Has anyone >  > noticed this before ? >  > >  > An example of the messages logged in /var/log/messages: >  > >  > ... >  > kobject: 'ib_cm' (ffffffffa067a810): does not have a release() >  > function, it is broken and must be fixed. > > I don't see anything similar with CONFIG_DEBUG_KOBJECT enabled on > 2.6.31-rc5 (without adding in any OFED confusion). > > It seems as if you get this message for every module being loaded; do > you see it for any non-RDMA-related modules?  (Do you have any such > modules in your config?)  I can imagine the OFED build system messing > things up, but if you're just building the modules as part of the normal > kernel build (ie your vanilla 2.6.30 kernel) then I don't see anything > that would make ib_cm or iw_cm any different from any other module. > > For example, if I load ib_cm on my kernel, the only log messages I see > from "dmesg|grep ib_cm" are: > >    kobject: 'ib_cm' (ffffffffa024c8f0): kobject_add_internal: parent: 'module', set: 'module' >    kobject: 'holders' (ffff88022c1c9df8): kobject_add_internal: parent: 'ib_cm', set: '' >    kobject: 'ib_cm' (ffffffffa024c8f0): kobject_uevent_env >    kobject: 'ib_cm' (ffffffffa024c8f0): fill_kobj_path: path = '/module/ib_cm' >    kobject: 'notes' (ffff88022c1c9be8): kobject_add_internal: parent: 'ib_cm', set: '' Just to be sure that I'm working with the vanilla 2.6.30.4 kernel drives and not with the OFED drivers, I ran the following commands before any IB modules were loaded: rm -rf /lib/modules/$(uname -r) cd /usr/src/linux-2.6.30.4 make modules_install Next I started (/etc/init.d/openibd start; /etc/init.d/opensmd start) and then stopped (/etc/init.d/opensmd stop; /etc/init.d/openibd stop) the IB subsystem. The "broken" message was logged during module unload only, not during module load. The "broken" message was also logged for the following non-IB kernel modules: snd_seq_dummy, snd_pcm_oss, snd_mixer_oss, snd_seq, snd_seq_device and scsi_tgt. Bart. Bart. From rdreier at cisco.com Thu Aug 6 12:22:02 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Aug 2009 12:22:02 -0700 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: (Bart Van Assche's message of "Thu, 6 Aug 2009 20:46:25 +0200") References: Message-ID: > Next I started (/etc/init.d/openibd start; /etc/init.d/opensmd start) > and then stopped (/etc/init.d/opensmd stop; /etc/init.d/openibd stop) > the IB subsystem. The "broken" message was logged during module unload > only, not during module load. Oh I see... yes I get it on unload too, for any module. Seems like a shortcoming in the kobject debugging code. - R. From bart.vanassche at gmail.com Thu Aug 6 12:29:43 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 6 Aug 2009 21:29:43 +0200 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: References: Message-ID: On Thu, Aug 6, 2009 at 9:22 PM, Roland Dreier wrote: > >  > Next I started (/etc/init.d/openibd start; /etc/init.d/opensmd start) >  > and then stopped (/etc/init.d/opensmd stop; /etc/init.d/openibd stop) >  > the IB subsystem. The "broken" message was logged during module unload >  > only, not during module load. > > Oh I see... yes I get it on unload too, for any module.  Seems like a > shortcoming in the kobject debugging code. Are you sure that this indicates a shortcoming in the kobject debugging code ? The most recent messages related to the message "does not have a release() function, it is broken and must be fixed" I could find on the LKML date from July 16, 2009 (http://lkml.org/lkml/2009/7/16/306 and http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH acknowledges that if this message is logged that this indicates a problem that should be fixed. Bart. From ralph.campbell at qlogic.com Thu Aug 6 12:48:10 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 6 Aug 2009 12:48:10 -0700 Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries Message-ID: <1249588090.23162.940.camel@chromite.mv.qlogic.com> When ib_send_bw is run in bi-directional mode (-b), it doesn't create enough completion queue entries for both the send *and* the receive completions. Thus, CQ entries are lost due to the queue being full and the test can hang. Signed-off-by: Ralph Campbell diff --git a/send_bw.c b/send_bw.c index f842fb9..d5c4e63 100755 --- a/send_bw.c +++ b/send_bw.c @@ -489,7 +489,8 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, } } - ctx->cq = ibv_create_cq(ctx->context, ctx->rx_depth, NULL, ctx->channel, 0); + ctx->cq = ibv_create_cq(ctx->context, ctx->tx_depth + ctx->rx_depth, + NULL, ctx->channel, 0); if (!ctx->cq) { fprintf(stderr, "Couldn't create CQ\n"); return NULL; From rdreier at cisco.com Thu Aug 6 12:58:40 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Aug 2009 12:58:40 -0700 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: (Bart Van Assche's message of "Thu, 6 Aug 2009 21:29:43 +0200") References: Message-ID: > Are you sure that this indicates a shortcoming in the kobject > debugging code ? The most recent messages related to the message "does > not have a release() function, it is broken and must be fixed" I could > find on the LKML date from July 16, 2009 > (http://lkml.org/lkml/2009/7/16/306 and > http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH > acknowledges that if this message is logged that this indicates a > problem that should be fixed. I'm not sure -- I just assume that the core module unloading code is working OK, since it is so heavily tested. If there were really a "must be fixed" problem with module unloading then someone would surely have hit more than a warning message. - R. From hal.rosenstock at gmail.com Thu Aug 6 13:45:25 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 6 Aug 2009 16:45:25 -0400 Subject: [ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: References: <20090730232848.GA22660@comcast.net> <20090804152700.GF7993@me> <20090804201505.GI7993@me> <20090805134352.GS7993@me> <20090805163140.GW7993@me> Message-ID: On Wed, Aug 5, 2009 at 1:07 PM, Hal Rosenstock wrote: > > > On Wed, Aug 5, 2009 at 12:31 PM, Sasha Khapyorsky wrote: > >> On 10:43 Wed 05 Aug , Hal Rosenstock wrote: >> > >> > Should this be done as a separate step on the way to the LFT >> parallelization >> > across switches ? >> >> What do you mean by "separate step" (separate from what)? > > > Separate patches: first to move the osm_ucast_mgr_set_fwd_table call up a > level and a second one to the implement the LFT parallelization across > switches underneath that. > > >> >> >> I'm trying to replay the idea again: each routing engine calculates LFTs >> and fill sw->new_lfts array accordingly, after all it calls a procedure >> for sending switches' LFT blocks (and TOPs). So routing engine itself >> should not care about how exactly LFT blocks update MADs submission is >> actually implemented. >> > > > Yes, understood. > The one issue which gets in the way a bit here is the port order list (only applicable to certain engines and not others). Due to this, there are two places where the FT MAD pushing occurs. It'll be clearer when I submit the patch for this. One other thing I ran into (and related to the osm_ucast_file.c patch I sent a little while ago is the significance of > 0 returns from build_fwd_tables. Is there a reason that a routing engine would want to run its build_fwd_tables and then run the default one ? That seems to be what it does. It might be useful to document the status returns from build_lid_matrices and build_fwd_tables. -- Hal > > -- Hal > > >> >> Sasha >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liranl at mellanox.co.il Thu Aug 6 14:04:49 2009 From: liranl at mellanox.co.il (Liran Liss) Date: Fri, 7 Aug 2009 00:04:49 +0300 Subject: [ofa-general] [PATCHv4 10/10] mlx4: Add RDMAoE support - allowinterfaces to correspond to each other In-Reply-To: References: <20090805083023.GK5599@mtls03><20090805204259.GB16677@obsidianresearch.com><20090806171840.GA32301@mtls03> Message-ID: <2ED289D4E09FBD4D92D911E869B97FDD50DF98@mtlexch01.mtl.com> > > What about multicast though? Switches are going to have trouble with > > group membership lists for non IP packets.. Even just sending a ICMPv6 > > packet (with an IPv6 ethertype) isn't guaranteed to fix it. > In this patch set, all multicast packets use the broadcast mac. We > will address this issue at a future time. I don't see how you can address it in the future -- if later on things are changed to use multicast addresses, then systems running this code will silently fail to receive multicasts. - R. We initially intended to defer this to a separate patch set for brevity, but I understand your point. We will work out a solution and resend. Thanks, --Liran From sashak at voltaire.com Thu Aug 6 14:06:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 7 Aug 2009 00:06:01 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_mesh.c: Remove edges in lash matrix In-Reply-To: <20090805220613.GA7155@comcast.net> References: <20090805220613.GA7155@comcast.net> Message-ID: <20090806210601.GD7993@me> Hi Hal, On 18:06 Wed 05 Aug , Hal Rosenstock wrote: > > + > +static void print_axis(lash_t *p_lash, int sw, int port) > +{ > + mesh_node_t *node = p_lash->switches[sw]->node; > + char *name = p_lash->switches[sw]->p_sw->p_node->print_desc; > + int c = node->axes[port]; > + > + printf("%s[%d] = ", name, port); > + if (c) > + printf("%s%c -> ", ((c - 1) & 1) ? "-" : "+", 'X' + (c - 1)/2); > + else > + printf("N/A -> "); > + printf("%s\n", > + p_lash->switches[node->links[port]->switch_id]->p_sw->p_node->print_desc); > } > > /* > @@ -805,6 +864,11 @@ static void seed_axes(lash_t *p_lash, int sw) > } > } > > + for (i = 0; i < n; i++) { > + printf("seed: "); > + print_axis(p_lash, sw, i); > + } Please remove debug prints or move it to use osm_log(). Sasha From jgunthorpe at obsidianresearch.com Thu Aug 6 14:12:31 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 6 Aug 2009 15:12:31 -0600 Subject: [ofa-general] Setting the rate in Infiniband. In-Reply-To: References: Message-ID: <20090806211231.GG16677@obsidianresearch.com> On Wed, Aug 05, 2009 at 08:03:04PM -0400, Ashwath Narasimhan wrote: > The reason why I need such small rates is because I interface the > Infiniband HCA to an FPGA via an Infiniband physical link. Imagine > the FPGA as a simple repeater that simply forwards the infiniband > signals to the Target HCA. The FPGA cannot handle such a high data > rate and neither do I have as much memory as required to buffer it > on the FPGA (I might drop packets if the buffer becomes full). Hence > I wish to limit the rate to say 100Mbps instead of 2.5Gbps. The correct thing to do is manage the flow control credits you are giving to the IB network so you don't loose packets. Jason From jgunthorpe at obsidianresearch.com Thu Aug 6 14:19:29 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 6 Aug 2009 15:19:29 -0600 Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries In-Reply-To: <1249588090.23162.940.camel@chromite.mv.qlogic.com> References: <1249588090.23162.940.camel@chromite.mv.qlogic.com> Message-ID: <20090806211929.GH16677@obsidianresearch.com> On Thu, Aug 06, 2009 at 12:48:10PM -0700, Ralph Campbell wrote: > When ib_send_bw is run in bi-directional mode (-b), it doesn't > create enough completion queue entries for both the send *and* > the receive completions. Thus, CQ entries are lost due to the > queue being full and the test can hang. Is this on IB? I thought the required behavior on CQ exhaustion was for the sendq to halt and incoming recvs to return RNR (same as recvq exhaustion) - it shouldn't just hang, and CQ entries should never be lost. Clearly the patch is right, but the consequences you describe seem wrong.. Jason From sean.hefty at intel.com Thu Aug 6 14:37:19 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Aug 2009 14:37:19 -0700 Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries In-Reply-To: <1249588090.23162.940.camel@chromite.mv.qlogic.com> References: <1249588090.23162.940.camel@chromite.mv.qlogic.com> Message-ID: <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com> >- ctx->cq = ibv_create_cq(ctx->context, ctx->rx_depth, NULL, ctx->channel, >0); >+ ctx->cq = ibv_create_cq(ctx->context, ctx->tx_depth + ctx->rx_depth, >+ NULL, ctx->channel, 0); I'm looking at a windows port of this test, but at least there, rx_depth is set to rx_depth + tx_depth. From ralph.campbell at qlogic.com Thu Aug 6 14:46:44 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 06 Aug 2009 14:46:44 -0700 Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries In-Reply-To: <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com> References: <1249588090.23162.940.camel@chromite.mv.qlogic.com> <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com> Message-ID: <1249595204.23162.951.camel@chromite.mv.qlogic.com> On Thu, 2009-08-06 at 14:37 -0700, Sean Hefty wrote: > >- ctx->cq = ibv_create_cq(ctx->context, ctx->rx_depth, NULL, ctx->channel, > >0); > >+ ctx->cq = ibv_create_cq(ctx->context, ctx->tx_depth + ctx->rx_depth, > >+ NULL, ctx->channel, 0); > > I'm looking at a windows port of this test, but at least there, rx_depth is set > to rx_depth + tx_depth. Sure. Just above the call to ibv_create_cq(), ctx->rx_depth is set to ctx->rx_depth = rx_depth + tx_depth but the rest of the code does ibv_post_send() and ibv_post_recv() based on ctx->tx_depth and ctx->rx_depth which means the CQ needs to be ctx->tx_depth + ctx->rx_depth big. From sean.hefty at intel.com Thu Aug 6 14:56:08 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Aug 2009 14:56:08 -0700 Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries In-Reply-To: <1249595204.23162.951.camel@chromite.mv.qlogic.com> References: <1249588090.23162.940.camel@chromite.mv.qlogic.com> <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com> <1249595204.23162.951.camel@chromite.mv.qlogic.com> Message-ID: >Sure. Just above the call to ibv_create_cq(), ctx->rx_depth is set to > ctx->rx_depth = rx_depth + tx_depth >but the rest of the code does ibv_post_send() and ibv_post_recv() >based on ctx->tx_depth and ctx->rx_depth which means the CQ needs >to be ctx->tx_depth + ctx->rx_depth big. If the tx_depth is the same on both sides, why would there ever be more than the initial tx_depth and rx_depth completions on the CQ? How many receive completions can there be on the CQ, and what throttles the sender? - Sean From ralph.campbell at qlogic.com Thu Aug 6 15:04:27 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 06 Aug 2009 15:04:27 -0700 Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries In-Reply-To: References: <1249588090.23162.940.camel@chromite.mv.qlogic.com> <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com> <1249595204.23162.951.camel@chromite.mv.qlogic.com> Message-ID: <1249596267.23162.956.camel@chromite.mv.qlogic.com> On Thu, 2009-08-06 at 14:56 -0700, Sean Hefty wrote: > >Sure. Just above the call to ibv_create_cq(), ctx->rx_depth is set to > > ctx->rx_depth = rx_depth + tx_depth > >but the rest of the code does ibv_post_send() and ibv_post_recv() > >based on ctx->tx_depth and ctx->rx_depth which means the CQ needs > >to be ctx->tx_depth + ctx->rx_depth big. > > If the tx_depth is the same on both sides, why would there ever be more than the > initial tx_depth and rx_depth completions on the CQ? How many receive > completions can there be on the CQ, and what throttles the sender? > > - Sean Remember that this fix only affects the bi-directional test. Both client and sever are going to post ctx->rx_depth receives and ctx->tx_depth sends and then check for completions. It won't post more sends or receives until the completions are seen. From hnrose at comcast.net Thu Aug 6 15:34:17 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 6 Aug 2009 18:34:17 -0400 Subject: [ofa-general] [PATCHv2] opensm/osm_mesh.c: Remove edges in lash matrix Message-ID: <20090806223417.GA2997@comcast.net> The intent of this change is to remove edge nodes (by "not counting them). The point of this heuristic is to deal with the case of small lattices which can easily have more surface than interior, which leads to choosing a non representative seed. This causes impossible counts to get reported. Signed-off-by: Robert Pearson Signed-off-by: Hal Rosenstock --- Changes since v1: Replaced printfs with OSM_LOG calls diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c index 72a9aa9..174bd7e 100644 --- a/opensm/opensm/osm_mesh.c +++ b/opensm/opensm/osm_mesh.c @@ -170,6 +170,11 @@ static const struct mesh_info { {8, {2, 2, 2, 2, 2, 2, 2, 2}, 8, {-1792, -6144, -8960, -7168, -3360, -896, -112, 0, 1}, }, + /* + * mesh errors + */ + {2, {6, 6}, 4, {-192, -256, -80, 0, 1}, }, + {-1, {0,}, 0, {0, }, }, }; @@ -727,6 +732,42 @@ done: } /* + * remove_edges + * + * remove type from nodes that have fewer links + * than adjacent nodes + */ +static void remove_edges(lash_t *p_lash) +{ + osm_log_t *p_log = &p_lash->p_osm->log; + int sw; + mesh_node_t *n, *nn; + int i; + + OSM_LOG_ENTER(p_log); + + for (sw = 0; sw < p_lash->num_switches; sw++) { + n = p_lash->switches[sw]->node; + if (!n->type) + continue; + + for (i = 0; i < n->num_links; i++) { + nn = p_lash->switches[n->links[i]->switch_id]->node; + + if (nn->num_links > n->num_links) { + OSM_LOG(p_log, OSM_LOG_DEBUG, + "removed edge switch %s\n", + p_lash->switches[sw]->p_sw->p_node->print_desc); + n->type = -1; + break; + } + } + } + + OSM_LOG_EXIT(p_log); +} + +/* * get_local_geometry * * analyze the local geometry around each switch @@ -735,6 +776,7 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh) { osm_log_t *p_log = &p_lash->p_osm->log; int sw; + int status = 0; OSM_LOG_ENTER(p_log); @@ -747,15 +789,38 @@ static int get_local_geometry(lash_t *p_lash, mesh_t *mesh) continue; if (get_switch_metric(p_lash, sw)) { - OSM_LOG_EXIT(p_log); - return -1; + status = -1; + goto Exit; } - classify_switch(p_lash, mesh, sw); classify_mesh_type(p_lash, sw); } + remove_edges(p_lash); + + for (sw = 0; sw < p_lash->num_switches; sw++) { + if (p_lash->switches[sw]->node->type < 0) + continue; + classify_switch(p_lash, mesh, sw); + } + +Exit: OSM_LOG_EXIT(p_log); - return 0; + return status; +} + +static void print_axis(lash_t *p_lash, char *p, int sw, int port) +{ + mesh_node_t *node = p_lash->switches[sw]->node; + char *name = p_lash->switches[sw]->p_sw->p_node->print_desc; + int c = node->axes[port]; + + p += sprintf(p, "%s[%d] = ", name, port); + if (c) + p += sprintf(p, "%s%c -> ", ((c - 1) & 1) ? "-" : "+", 'X' + (c - 1)/2); + else + p += sprintf(p, "N/A -> "); + p += sprintf(p, "%s\n", + p_lash->switches[node->links[port]->switch_id]->p_sw->p_node->print_desc); } /* @@ -773,6 +838,7 @@ static void seed_axes(lash_t *p_lash, int sw) mesh_node_t *node = p_lash->switches[sw]->node; int n = node->num_links; int i, j, c; + char buf[256], *p; OSM_LOG_ENTER(p_log); if (!node->matrix || !node->dimension) @@ -805,6 +871,12 @@ static void seed_axes(lash_t *p_lash, int sw) } } + for (i = 0; i < n; i++) { + p = buf; + print_axis(p_lash, p, sw, i); + OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf); + } + done: OSM_LOG_EXIT(p_log); } @@ -878,6 +950,12 @@ static void make_geometry(lash_t *p_lash, int sw) n = s1->node->num_links; /* + * ignore chain fragments + */ + if (n < seed->node->num_links && n <= 2) + continue; + + /* * only process 'mesh' switches */ if (!s1->node->matrix) @@ -908,7 +986,8 @@ static void make_geometry(lash_t *p_lash, int sw) if (j == i) continue; - if (s1->node->matrix[i][j] != 2) { + if (s1->node->matrix[i][j] != 2 && + s1->node->matrix[i][j] <= 4) { if (s1->node->axes[j]) { if (s1->node->axes[j] != opposite(seed, s1->node->axes[i])) { OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 1 mismatch\n"); From sean.hefty at intel.com Thu Aug 6 15:40:21 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Aug 2009 15:40:21 -0700 Subject: [ofa-general] [PATCH] ib_send_bw -b can hang due to too few CQ entries In-Reply-To: <1249596267.23162.956.camel@chromite.mv.qlogic.com> References: <1249588090.23162.940.camel@chromite.mv.qlogic.com> <6A54579325764F8189F00C9E2E95BBEC@amr.corp.intel.com> <1249595204.23162.951.camel@chromite.mv.qlogic.com> <1249596267.23162.956.camel@chromite.mv.qlogic.com> Message-ID: <8103A2A3D9FB46AC85A3520AECC8897B@amr.corp.intel.com> >Remember that this fix only affects the bi-directional test. >Both client and sever are going to post ctx->rx_depth receives >and ctx->tx_depth sends and then check for completions. >It won't post more sends or receives until the completions are >seen. Okay - I think I understand what's happening. The maximum number of outstanding sends is limited to tx_depth / 2. After posting that many sends, the code waits for completions. Once some sends complete, additional sends may be posted, up to the iteration count. There's nothing that coordinates posting the sends with completing receives on the remote side. (This is what I was missing.) Eventually, all posted receives could be complete and generate CQ entries. The send side is basically throttled by RNR NACKs. Now I don't understand the purpose behind doubling the rx_depth... - Sean From weiny2 at llnl.gov Thu Aug 6 16:01:07 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 6 Aug 2009 16:01:07 -0700 Subject: [ofa-general] [PATCH] libibmad: make accessors function for retry values used in libibmad Message-ID: <20090806160107.83193923.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 6 Aug 2009 15:27:30 -0700 Subject: [PATCH] libibmad: make accessors function for retry values used in libibmad In addition use this function to determine the retries used throughout the library. Signed-off-by: Ira Weiny --- libibmad/include/infiniband/mad.h | 1 + libibmad/src/libibmad.map | 1 + libibmad/src/mad.c | 5 +++++ libibmad/src/mad_internal.h | 1 + libibmad/src/rpc.c | 12 +++--------- 5 files changed, 11 insertions(+), 9 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index c5d73d5..0d0dcf1 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -804,6 +804,7 @@ MAD_EXPORT void mad_rpc_set_timeout(struct ibmad_port *port, int timeout); MAD_EXPORT int mad_rpc_class_agent(struct ibmad_port *srcport, int cls); MAD_EXPORT int mad_get_timeout(struct ibmad_port *srcport, int override_ms); +MAD_EXPORT int mad_get_retries(struct ibmad_port *srcport); /* register.c */ diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index a8605b5..b9a890c 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -71,6 +71,7 @@ IBMAD_1.3 { mad_rpc_set_retries; mad_rpc_set_timeout; mad_get_timeout; + mad_get_retries; madrpc; madrpc_def_timeout; madrpc_init; diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index bc64a0f..7192dd6 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -70,6 +70,11 @@ int mad_get_timeout(struct ibmad_port *srcport, int override_ms) srcport->timeout ? srcport->timeout : madrpc_timeout); } +int mad_get_retries(struct ibmad_port *srcport) +{ + return (srcport->retries ? srcport->retries : madrpc_retries); +} + void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, void *data) { int is_resp = rpc->method & IB_MAD_RESPONSE; diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h index 7a16a46..475adfc 100644 --- a/libibmad/src/mad_internal.h +++ b/libibmad/src/mad_internal.h @@ -44,5 +44,6 @@ struct ibmad_port { extern struct ibmad_port *ibmp; extern int madrpc_timeout; +extern int madrpc_retries; #endif /* _MAD_INTERNAL_H_ */ diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index bb83114..b5e4441 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -53,9 +53,9 @@ struct ibmad_port *ibmp = &mad_port; static int iberrs; +int madrpc_retries = MAD_DEF_RETRIES; int madrpc_timeout = MAD_DEF_TIMEOUT_MS; -static int madrpc_retries = MAD_DEF_RETRIES; static void *save_mad; static int save_mad_len = 256; @@ -211,7 +211,6 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, { int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; - int retries; int redirect = 1; while (redirect) { @@ -221,12 +220,10 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) return NULL; - retries = port->retries ? port->retries : madrpc_retries; - if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, port->class_agents[rpc->mgtclass], len, mad_get_timeout(port, rpc->timeout), - retries)) < 0) { + mad_get_retries(port))) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return NULL; } @@ -267,7 +264,6 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, { int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; - int retries; memset(sndbuf, 0, umad_size() + IB_MAD_SIZE); @@ -276,12 +272,10 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) return NULL; - retries = port->retries ? port->retries : madrpc_retries; - if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, port->class_agents[rpc->mgtclass], len, mad_get_timeout(port, rpc->timeout), - retries)) < 0) { + mad_get_retries(port))) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return NULL; } -- 1.5.4.5 From weiny2 at llnl.gov Thu Aug 6 16:01:06 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 6 Aug 2009 16:01:06 -0700 Subject: [ofa-general] [PATCH] libibmad: make accessors function for timeout values used in libibmad Message-ID: <20090806160106.4725041e.weiny2@llnl.gov> Sasha, In using mad_send_via and mad_receive_via I have found getting the timeout and retry values from the mad layer to be beneficial. This and the patch that follows export functions to get those values as well as standardize the use of them internally. Ira From: Ira Weiny Date: Mon, 27 Jul 2009 13:48:17 -0700 Subject: [PATCH] libibmad: make accessors function for timeout values used in libibmad In addition use this function to determine the timeout to be used throughout the library. Signed-off-by: Ira Weiny --- libibmad/include/infiniband/mad.h | 3 +++ libibmad/src/libibmad.map | 1 + libibmad/src/mad.c | 8 ++++++++ libibmad/src/mad_internal.h | 1 + libibmad/src/rpc.c | 17 ++++++++--------- libibmad/src/serv.c | 8 +++++--- 6 files changed, 26 insertions(+), 12 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index ee004a9..c5d73d5 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -803,6 +803,9 @@ MAD_EXPORT void mad_rpc_set_retries(struct ibmad_port *port, int retries); MAD_EXPORT void mad_rpc_set_timeout(struct ibmad_port *port, int timeout); MAD_EXPORT int mad_rpc_class_agent(struct ibmad_port *srcport, int cls); +MAD_EXPORT int mad_get_timeout(struct ibmad_port *srcport, int override_ms); + + /* register.c */ MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 1462064..a8605b5 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -70,6 +70,7 @@ IBMAD_1.3 { mad_rpc_class_agent; mad_rpc_set_retries; mad_rpc_set_timeout; + mad_get_timeout; madrpc; madrpc_def_timeout; madrpc_init; diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index 8defabd..bc64a0f 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -44,6 +44,8 @@ #include #include +#include "mad_internal.h" + #undef DEBUG #define DEBUG if (ibdebug) IBWARN @@ -62,6 +64,12 @@ uint64_t mad_trid(void) return next; } +int mad_get_timeout(struct ibmad_port *srcport, int override_ms) +{ + return (override_ms ? override_ms : + srcport->timeout ? srcport->timeout : madrpc_timeout); +} + void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, void *data) { int is_resp = rpc->method & IB_MAD_RESPONSE; diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h index 24418cc..7a16a46 100644 --- a/libibmad/src/mad_internal.h +++ b/libibmad/src/mad_internal.h @@ -43,5 +43,6 @@ struct ibmad_port { }; extern struct ibmad_port *ibmp; +extern int madrpc_timeout; #endif /* _MAD_INTERNAL_H_ */ diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index c6fd392..bb83114 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -53,8 +53,9 @@ struct ibmad_port *ibmp = &mad_port; static int iberrs; +int madrpc_timeout = MAD_DEF_TIMEOUT_MS; + static int madrpc_retries = MAD_DEF_RETRIES; -static int madrpc_timeout = MAD_DEF_TIMEOUT_MS; static void *save_mad; static int save_mad_len = 256; @@ -210,7 +211,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, { int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; - int timeout, retries; + int retries; int redirect = 1; while (redirect) { @@ -220,13 +221,12 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, if ((len = mad_build_pkt(sndbuf, rpc, dport, 0, payload)) < 0) return NULL; - timeout = rpc->timeout ? rpc->timeout : - port->timeout ? port->timeout : madrpc_timeout; retries = port->retries ? port->retries : madrpc_retries; if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, port->class_agents[rpc->mgtclass], - len, timeout, retries)) < 0) { + len, mad_get_timeout(port, rpc->timeout), + retries)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return NULL; } @@ -267,7 +267,7 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, { int status, len; uint8_t sndbuf[1024], rcvbuf[1024], *mad; - int timeout, retries; + int retries; memset(sndbuf, 0, umad_size() + IB_MAD_SIZE); @@ -276,13 +276,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, if ((len = mad_build_pkt(sndbuf, rpc, dport, rmpp, data)) < 0) return NULL; - timeout = rpc->timeout ? rpc->timeout : - port->timeout ? port->timeout : madrpc_timeout; retries = port->retries ? port->retries : madrpc_retries; if ((len = _do_madrpc(port->port_id, sndbuf, rcvbuf, port->class_agents[rpc->mgtclass], - len, timeout, retries)) < 0) { + len, mad_get_timeout(port, rpc->timeout), + retries)) < 0) { IBWARN("_do_madrpc failed; dport (%s)", portid2str(dport)); return NULL; } diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index c9a093a..fad1e5b 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -73,7 +73,8 @@ int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, } if (umad_send(srcport->port_id, srcport->class_agents[rpc->mgtclass], - umad, IB_MAD_SIZE, rpc->timeout, 0) < 0) { + umad, IB_MAD_SIZE, mad_get_timeout(srcport, rpc->timeout), + 0) < 0) { IBWARN("send failed; %m"); return -1; } @@ -155,7 +156,7 @@ int mad_respond_via(void *umad, ib_portid_t * portid, uint32_t rstatus, if (umad_send (srcport->port_id, srcport->class_agents[rpc.mgtclass], umad, - IB_MAD_SIZE, rpc.timeout, 0) < 0) { + IB_MAD_SIZE, mad_get_timeout(srcport, rpc.timeout), 0) < 0) { DEBUG("send failed; %m"); return -1; } @@ -174,7 +175,8 @@ void *mad_receive_via(void *umad, int timeout, struct ibmad_port *srcport) int agent; int length = IB_MAD_SIZE; - if ((agent = umad_recv(srcport->port_id, mad, &length, timeout)) < 0) { + if ((agent = umad_recv(srcport->port_id, mad, &length, + mad_get_timeout(srcport, timeout))) < 0) { if (!umad) umad_free(mad); DEBUG("recv failed: %m"); -- 1.5.4.5 From weiny2 at llnl.gov Thu Aug 6 18:37:16 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 6 Aug 2009 18:37:16 -0700 Subject: [ofa-general] [PATCH] opensm/complib: account for nsec overflow in timeout values Message-ID: <20090806183716.c08bbea3.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 6 Aug 2009 18:31:46 -0700 Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values Signed-off-by: Ira Weiny --- opensm/complib/cl_event.c | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c index d14b2f4..4bc8d37 100644 --- a/opensm/complib/cl_event.c +++ b/opensm/complib/cl_event.c @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event, } else { /* Get the current time */ if (gettimeofday(&curtime, NULL) == 0) { - timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000); - timeout.tv_nsec = - (curtime.tv_usec + (wait_us % 1000000)) * 1000; + uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000)) + * 1000; + timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000) + + (n_sec % 1000000000); + timeout.tv_nsec = n_sec % 1000000000; wait_ret = pthread_cond_timedwait(&p_event->condvar, &p_event->mutex, -- 1.5.4.5 From eli at dev.mellanox.co.il Thu Aug 6 20:26:29 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 7 Aug 2009 06:26:29 +0300 Subject: [ofa-general] [PATCHv4 01/10] ib_core: Refine device personality from node type to port type In-Reply-To: <69AD3F21660945D4B2D30109F8FC7A55@amr.corp.intel.com> References: <20090805082808.GB5599@mtls03> <73235A80972A43A0A54C09DBA44CA41C@amr.corp.intel.com> <20090806172035.GB32301@mtls03> <69AD3F21660945D4B2D30109F8FC7A55@amr.corp.intel.com> Message-ID: <20090807032629.GA20589@mtls03> On Thu, Aug 06, 2009 at 10:34:19AM -0700, Sean Hefty wrote: > > Does the implementation allow this? Right now PDs, CQs, etc are allocated per > device, not per port. I'm not immediately concerned about QP failover. > However, I believe there needs to be some level of coordination between the > Infiniband side of the CM and the Ethernet side of the CM, since QPs are > associated with CA GUIDs. I'm just trying to understand the impact of this > coordination. > There is nothing in the implementation to prevent it. We did not see a reason to. The ports share a common node GUID but each one has its own GIDs. From eli at dev.mellanox.co.il Thu Aug 6 20:29:01 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 7 Aug 2009 06:29:01 +0300 Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE ports In-Reply-To: <376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com> References: <20090805082910.GE5599@mtls03> <376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com> Message-ID: <20090807032901.GB20589@mtls03> On Thu, Aug 06, 2009 at 11:05:47AM -0700, Sean Hefty wrote: > > Is there a need to expose QP1 to user space? The CM is in the kernel, and > there's not an SA. > Good point. There seems to be no reason to expose it. Will fix. From eli at dev.mellanox.co.il Thu Aug 6 20:36:05 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 7 Aug 2009 06:36:05 +0300 Subject: [ofa-general] [PATCHv4 03/10] ib_core: RDMAoE support only QP1 In-Reply-To: <9E17A35942B547BDBAE2E56101287CBC@amr.corp.intel.com> References: <20090805082854.GD5599@mtls03> <9E17A35942B547BDBAE2E56101287CBC@amr.corp.intel.com> Message-ID: <20090807033605.GC20589@mtls03> On Thu, Aug 06, 2009 at 10:52:34AM -0700, Sean Hefty wrote: > >+ /* Validate device and port */ > >+ port_priv = ib_get_mad_port(device, port_num); > >+ if (!port_priv) { > >+ ret = ERR_PTR(-ENODEV); > >+ goto error1; > >+ } > >+ > >+ if (!port_priv->qp_info[qp_type].qp) > >+ return NULL; > > It seems odd that the first if has 'goto error1', but the second if simply > returns NULL. > The original intention was to release the caller from the need to decide whether to register the mad agent or not and so the NULL returned would not be treated as error. Thinking it over I realize that it would be better to let the caller decide (according to the port protocol) whether or not to register the mad agent. Will fix. > >@@ -556,6 +559,9 @@ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) > > struct ib_mad_agent_private *mad_agent_priv; > > struct ib_mad_snoop_private *mad_snoop_priv; > > > >+ if (!mad_agent) > >+ return 0; > > Why would a kernel client call ib_unregister_mad_agent with a NULL pointer? > Same as above. Goes away after the fix. > > > > cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; > >+ has_smi = rdma_port_get_transport(device, port_num) == > RDMA_TRANSPORT_IB; > >+ if (has_smi) > >+ cq_size *= 2; > > cq_size is doubled twice > This is a bug - I'll fix it - thanks. > I really wish there were a cleaner way to add this support that didn't involve > adding so many checks throughout the code. It's hard to know if checks were > added in all the places that were needed. I can't think of a clever way to > handle QP 0. The fix discussed above will eliminate a good portion of these checks. From bart.vanassche at gmail.com Fri Aug 7 00:26:33 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Fri, 7 Aug 2009 09:26:33 +0200 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: References: Message-ID: On Thu, Aug 6, 2009 at 9:58 PM, Roland Dreier wrote: > >  > Are you sure that this indicates a shortcoming in the kobject >  > debugging code ? The most recent messages related to the message "does >  > not have a release() function, it is broken and must be fixed" I could >  > find on the LKML date from July 16, 2009 >  > (http://lkml.org/lkml/2009/7/16/306 and >  > http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH >  > acknowledges that if this message is logged that this indicates a >  > problem that should be fixed. > > I'm not sure -- I just assume that the core module unloading code is > working OK, since it is so heavily tested.  If there were really a "must > be fixed" problem with module unloading then someone would surely have > hit more than a warning message. (added Greg KH and the LKML in CC) I tried to look up more information about kobjects. The comment of commit 7a6a41615bfb2f03ce797bc24104c50b42c935e5 suggests that in the past the function kobject_cleanup() did not free the memory allocated for static kobject names but that this was the responsibility of the release() function. This should have been fixed in the current version of kobject_cleanup(). So I'm wondering whether the message that kobjects that do not have a release() function are broken still makes sense ? See also * http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a6a41615bfb2f03ce797bc24104c50b42c935e5. * http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.30.y.git;a=blob;f=lib/kobject.c Bart. From bart.vanassche at gmail.com Fri Aug 7 01:31:18 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Fri, 7 Aug 2009 10:31:18 +0200 Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in the SRP initiator In-Reply-To: References: <4A7A949B.60408@panasas.com> Message-ID: On Thu, Aug 6, 2009 at 7:41 PM, Roland Dreier wrote: > >  > Specifically scmnd->host_scribble can just be Zero. > > I see at last, thanks! > > The issue is that SRP is using host_scribble to hold an index, and index > 0 is valid for us. > > I guess the fix is a bit complex, but basically we should use > host_scribble to point to the request, and if we don't find a request in > reset_device we should allocate one. A fix like the one below ? --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c 2009-08-03 12:13:11.000000000 +0200 +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c 2009-08-07 10:23:27.000000000 +0200 @@ -1371,16 +1371,27 @@ out: return -1; } +/** + * Look up the struct srp_request that has been associated with the specified + * SCSI command by srp_queuecommand(). + * + * Returns 0 upon success and -1 upon failure. + */ static int srp_find_req(struct srp_target_port *target, struct scsi_cmnd *scmnd, struct srp_request **req) { - if (scmnd->host_scribble == (void *) -1L) - return -1; + /* + * The code below will only work if SRP_RQ_SIZE is a power of two, + * so check this first. + */ + BUILD_BUG_ON((SRP_RQ_SIZE ^ (SRP_RQ_SIZE - 1)) + != (SRP_RQ_SIZE | (SRP_RQ_SIZE - 1))); - *req = &target->req_ring[(long) scmnd->host_scribble]; + *req = &target->req_ring[(long)scmnd->host_scribble + & (SRP_RQ_SIZE - 1)]; - return 0; + return (*req)->scmnd == scmnd ? 0 : -1; } static int srp_abort(struct scsi_cmnd *scmnd) @@ -1423,8 +1434,15 @@ static int srp_reset_device(struct scsi_ if (target->qp_in_error) return FAILED; - if (srp_find_req(target, scmnd, &req)) - return FAILED; + if (srp_find_req(target, scmnd, &req)) { + /* + * scmnd has not yet been queued -- queue it now. This can + * happen e.g. when a SG_SCSI_RESET ioctl has been issued. + */ + if (srp_queuecommand(scmnd, scmnd->scsi_done) + || srp_find_req(target, scmnd, &req)) + return FAILED; + } if (srp_send_tsk_mgmt(target, req, SRP_TSK_LUN_RESET)) return FAILED; if (req->tsk_status) From bart.vanassche at gmail.com Fri Aug 7 02:58:11 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Fri, 7 Aug 2009 11:58:11 +0200 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: References: Message-ID: On Thu, Aug 6, 2009 at 7:56 PM, Roland Dreier wrote: > >  > After having applied this patch it took somewhat longer before a >  > locking inversion report was generated, but unfortunately there still >  > was a locking inversion report generated (see also >  > http://bugzilla.kernel.org/show_bug.cgi?id=13757 for the details): > > ummm, yikes... > > can you apply the hack patch I sent originally to take priv->lock from > an interrupt ASAP and try that along with the fix patch to drop > priv->lock before calling ipoib_send()?  That might make the lockdep > trace understandable. The lockdep report I obtained this morning with a 2.6.30.4 kernel and the two patches applied has been attached to the kernel bugzilla entry. This lockdep report was generated while testing the SRPT target software. I have double checked that the SRPT target implementation does not hold any spinlocks or mutexes while calling functions in the IB core. This means that the SRPT target code cannot have caused any of the reported lock cycles. By the way, I noticed that while many subsystems in the Linux kernel use event queues to report information to higher software layers, that the IB core makes extensive use of callback functions. The combination of nested locking and callback functions can easily lead to lock inversion. This effect is well known in the operating system world -- see e.g. the talk by John Ousterhout about multithreaded versus event-driven software (http://home.pacbell.net/ouster/threads.pdf, 1996). ========================================================= [ INFO: possible irq lock inversion dependency detected ] 2.6.30.4-scst-debug #2 --------------------------------------------------------- [ ... ] stack backtrace: Pid: 26040, comm: cc1 Not tainted 2.6.30.4-scst-debug #2 Call Trace: [] print_irq_inversion_bug+0x14c/0x1c0 [] check_usage_forwards+0x7d/0xc0 [] mark_lock+0x20f/0x6a0 [] ? check_usage_forwards+0x0/0xc0 [] __lock_acquire+0xce4/0x1c80 [] ? trace_hardirqs_off+0xd/0x10 [] ? release_console_sem+0x1e5/0x230 [] ? vprintk+0x2e9/0x480 [] lock_acquire+0x108/0x150 [] ? ib_cm_notify+0x102/0x2c0 [ib_cm] [] _spin_lock_irqsave+0x41/0x60 [] ? ib_cm_notify+0x102/0x2c0 [ib_cm] [] ib_cm_notify+0x102/0x2c0 [ib_cm] [] srpt_qp_event+0x4e/0x140 [ib_srpt] [] mlx4_ib_qp_event+0x7a/0xf0 [mlx4_ib] [] mlx4_qp_event+0x6f/0xe0 [mlx4_core] [] mlx4_eq_int+0x289/0x2e0 [mlx4_core] [] mlx4_msi_x_interrupt+0x6a/0x90 [mlx4_core] [] handle_IRQ_event+0x95/0x200 [] handle_edge_irq+0xc8/0x170 [] handle_irq+0x1f/0x30 [] do_IRQ+0x6e/0xf0 [] ret_from_intr+0x0/0xf <6> Bart. From vlad at lists.openfabrics.org Fri Aug 7 03:07:20 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 7 Aug 2009 03:07:20 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090807-0200 daily build status Message-ID: <20090807100720.50569E61CFA@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090807-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hnrose at comcast.net Fri Aug 7 04:08:11 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 7 Aug 2009 07:08:11 -0400 Subject: [ofa-general] [PATCHv3] opensm: Parallelize (Stripe) LFT sets across switches Message-ID: <20090807110811.GA23431@comcast.net> Currently, MADs are pipelined to a single switch at a time which effectively serializes these requests due to processing at the SMA. This patch pipelines (stripes) them across the switches first before proceeding with successive blocks. As a result of this striping, multiple switches can process the set and respond concurrently which results in an improvement to the subnet initialization time. All unicast routing protocols are updated for this. A similar subsequent change will do this for MFTs. Yevgeny Kliteynik wrote: With a small cluster of 17 IS4 switches and 11 HCAs and to artificially increase the cluster, LMC of 7 was used including EnhancedSwitchPort 0 LMC. With the new code, LFT configuration is more than twice as fast as with the old code :) Current ucast manager ran on avarage for ~250msec, with the new code - 110-120msec. Routing calculation phase of the ucast manager took ~1200 usec, the rest was sending the blocks and waiting for no more pending transactions. Here are some detailed results of different executions (the number on the left is timer value in usec): Current ucast manager (w/o the optimization): 000000 [LFT]: osm_ucast_mgr_process() - START 001131 [LFT]: ucast_mgr_process_tbl() - START 032251 [LFT]: ucast_mgr_process_tbl() - END 032263 [LFT]: osm_ucast_mgr_process() - END 253416 [LFT]: Done wait_for_pending_transactions() New algorithm: 001417 [LFT]: osm_ucast_mgr_process() - START 002690 [LFT]: ucast_mgr_process_tbl() - START 032946 [LFT]: ucast_mgr_process_tbl() - END 032948 [LFT]: osm_ucast_pipeline_tbl() - START 033846 [LFT]: osm_ucast_pipeline_tbl() - END 033858 [LFT]: osm_ucast_mgr_process() - END 108203 [LFT]: Done wait_for_pending_transactions() With IS3 based Qlogic switches, which do not handle DR packets forwarding in HW, with a fabric of ~1100 HCAs, ~280 switches: Current OSM configures LFTs in ~2 seconds. New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up). Signed-off-by: Hal Rosenstock --- Changes since v2: Eliminated max_smps_per_node Moved LFTs pushing up to ucast_mgr_route level from the individual routing engines Changes since v1: Added Yevgeny's performance data No change to actual patch diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index a040476..4ef045c 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -242,16 +242,12 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN struct osm_sm * sm); * * SYNOPSIS */ -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, - IN osm_switch_t * const p_sw); +void osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr); /* * PARAMETERS * p_mgr * [in] Pointer to an osm_ucast_mgr_t object. * -* p_mgr -* [in] Pointer to an osm_switch_t object. -* * SEE ALSO * Unicast Manager *********/ diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index 216b496..30a3c1d 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -1085,9 +1085,10 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr) memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); } - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); } + osm_ucast_mgr_set_fwd_table(p_mgr); + return 0; } diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index 2505c46..5b73ca5 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -167,9 +167,6 @@ static int do_ucast_file_load(void *context) "skipping parsing. Using default " "routing algorithm\n"); } else if (!strncmp(p, "Unicast lids", 12)) { - if (p_sw) - osm_ucast_mgr_set_fwd_table(&p_osm->sm. - ucast_mgr, p_sw); q = strstr(p, " guid 0x"); if (!q) { OSM_LOG(&p_osm->log, OSM_LOG_ERROR, @@ -220,7 +217,7 @@ static int do_ucast_file_load(void *context) return -1; } p = q; - /* additionally try to exract guid */ + /* additionally try to extract guid */ q = strstr(p, " portguid 0x"); if (!q) { OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, @@ -246,9 +243,6 @@ static int do_ucast_file_load(void *context) } } - if (p_sw) - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw); - fclose(file); return 0; } diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index bde6dbd..6ec6bc7 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -2,7 +2,7 @@ * Copyright (c) 2009 Simula Research Laboratory. All rights reserved. * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -1905,8 +1905,6 @@ static void set_sw_fwd_table(IN cl_map_item_t * const p_map_item, ftree_fabric_t *p_ftree = (ftree_fabric_t *) context; p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid; - osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, - p_sw->p_osm_sw); } /*************************************************** diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index b3107f0..0a567b3 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. @@ -990,7 +990,6 @@ static void populate_fwd_tbls(lash_t * p_lash) { osm_log_t *p_log = &p_lash->p_osm->log; osm_subn_t *p_subn = &p_lash->p_osm->subn; - osm_opensm_t *p_osm = p_lash->p_osm; osm_switch_t *p_sw, *p_next_sw, *p_dst_sw; osm_port_t *port; uint16_t max_lid_ho, lid; @@ -1054,7 +1053,6 @@ static void populate_fwd_tbls(lash_t * p_lash) physical_egress_port); } } /* for */ - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw); } OSM_LOG_EXIT(p_log); } diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 78a7031..e28752a 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -315,16 +315,13 @@ Exit: /********************************************************************** **********************************************************************/ -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr, - IN osm_switch_t * p_sw) +static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, IN osm_switch_t * p_sw) { osm_node_t *p_node; osm_dr_path_t *p_path; osm_madw_context_t context; ib_api_status_t status; ib_switch_info_t si; - uint16_t block_id_ho = 0; - uint8_t block[IB_SMP_DATA_SIZE]; boolean_t set_swinfo_require = FALSE; uint16_t lin_top; uint8_t life_state; @@ -382,48 +379,6 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr, ib_get_err_str(status)); } - /* - Send linear forwarding table blocks to the switch - as long as the switch indicates it has blocks needing - configuration. - */ - - context.lft_context.node_guid = osm_node_get_node_guid(p_node); - context.lft_context.set_method = TRUE; - - if (!p_sw->new_lft) { - /* any routing should provide the new_lft */ - CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && - p_mgr->cache_valid && !p_sw->need_update); - goto Exit; - } - - for (block_id_ho = 0; - osm_switch_get_lft_block(p_sw, block_id_ho, block); - block_id_ho++) { - if (!p_sw->need_update && !p_mgr->p_subn->need_update && - !memcmp(block, - p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, - IB_SMP_DATA_SIZE)) - continue; - - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, - "Writing FT block %u\n", block_id_ho); - - status = osm_req_set(p_mgr->sm, p_path, - p_sw->new_lft + - block_id_ho * IB_SMP_DATA_SIZE, - sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL, - cl_hton32(block_id_ho), CL_DISP_MSGID_NONE, - &context); - - if (status != IB_SUCCESS) - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: " - "Sending linear fwd. tbl. block failed (%s)\n", - ib_get_err_str(status)); - } - -Exit: OSM_LOG_EXIT(p_mgr->p_log); return 0; } @@ -508,7 +463,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, } } - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); + set_fwd_tbl_top(p_mgr, p_sw); if (p_mgr->p_subn->opt.lmc) free_ports_priv(p_mgr); @@ -516,6 +471,101 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, OSM_LOG_EXIT(p_mgr->p_log); } +static void ucast_mgr_process_top(IN cl_map_item_t * p_map_item, + IN void *context) +{ + osm_ucast_mgr_t *p_mgr = context; + osm_switch_t *const p_sw = (osm_switch_t *) p_map_item; + + set_fwd_tbl_top(p_mgr, p_sw); +} + +static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm, + IN uint8_t * p_block, + IN osm_dr_path_t * p_path, + IN uint16_t block_id_ho, + IN osm_madw_context_t * p_context) +{ + ib_api_status_t status; + boolean_t sts; + + OSM_LOG_ENTER(p_sm->p_log); + + for (; + (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block)); + block_id_ho++) { + if (!p_sw->need_update && !p_sm->p_subn->need_update && + !memcmp(p_block, + p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, + IB_SMP_DATA_SIZE)) + continue; + + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, + "Writing FT block %u to switch 0x%" PRIx64 "\n", + block_id_ho, + cl_ntoh64(p_context->lft_context.node_guid)); + + status = osm_req_set(p_sm, p_path, + p_sw->new_lft + + block_id_ho * IB_SMP_DATA_SIZE, + IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL, + cl_hton32(block_id_ho), + CL_DISP_MSGID_NONE, p_context); + + if (status != IB_SUCCESS) + OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 3A05: " + "Sending linear fwd. tbl. block failed (%s)\n", + ib_get_err_str(status)); + break; + } + + OSM_LOG_EXIT(p_sm->p_log); + return sts; +} + +static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw, + IN osm_ucast_mgr_t *p_mgr, + IN uint16_t block_id_ho) +{ + osm_dr_path_t *p_path; + osm_madw_context_t context; + uint8_t block[IB_SMP_DATA_SIZE]; + boolean_t status; + + OSM_LOG_ENTER(p_mgr->p_log); + + CL_ASSERT(p_sw && p_sw->p_node); + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Processing switch 0x%" PRIx64 "\n", + cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); + + /* + Send linear forwarding table blocks to the switch + as long as the switch indicates it has blocks needing + configuration. + */ + if (!p_sw->new_lft) { + /* any routing should provide the new_lft */ + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && + p_mgr->cache_valid && !p_sw->need_update); + status = FALSE; + goto Exit; + } + + p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); + + context.lft_context.node_guid = osm_node_get_node_guid(p_sw->p_node); + context.lft_context.set_method = TRUE; + + status = set_next_lft_block(p_sw, p_mgr->sm, &block[0], p_path, + block_id_ho, &context); + +Exit: + OSM_LOG_EXIT(p_mgr->p_log); + return status; +} + /********************************************************************** **********************************************************************/ static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item, @@ -731,7 +781,6 @@ static int ucast_mgr_setup_all_switches(osm_subn_t * p_subn) /********************************************************************** **********************************************************************/ - static int add_guid_to_order_list(void *ctx, uint64_t guid, char *p) { osm_ucast_mgr_t *m = ctx; @@ -870,6 +919,30 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t * m) add_sw_endports_to_order_list(s[i], m); } +static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * p_mgr) +{ + cl_qmap_t *p_sw_tbl; + osm_switch_t *p_sw; + uint16_t block_id_ho = 0; + int sws_notdone; + boolean_t sts; + + p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; + while (1) { + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); + sws_notdone = 0; + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { + sts = pipeline_next_lft_block(p_sw, p_mgr, block_id_ho); + if (sts) + sws_notdone++; + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); + } + if (!sws_notdone) + break; + block_id_ho++; + } +} + static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) { cl_qlist_init(&p_mgr->port_order_list); @@ -904,6 +977,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl, p_mgr); + ucast_mgr_pipeline_fwd_tbl(p_mgr); + cl_qlist_remove_all(&p_mgr->port_order_list); return 0; @@ -911,6 +986,16 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) /********************************************************************** **********************************************************************/ +void osm_ucast_mgr_set_fwd_table(osm_ucast_mgr_t * p_mgr) +{ + cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, + ucast_mgr_process_top, p_mgr); + + ucast_mgr_pipeline_fwd_tbl(p_mgr); +} + +/********************************************************************** + **********************************************************************/ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm) { int ret; @@ -940,6 +1025,9 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm) osm->routing_engine_used = osm_routing_engine_type(r->name); + if (r->ucast_build_fwd_tables) + osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); + return 0; } From dzieko at wcss.pl Fri Aug 7 04:25:26 2009 From: dzieko at wcss.pl (Pawel Dziekonski) Date: Fri, 7 Aug 2009 13:25:26 +0200 Subject: [ofa-general] ib0: multicast join failed Message-ID: <20090807112526.GD21691@cefeid.wcss.wroc.pl> Hi, today I got the following: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 and connection to Lustre was lost. I can ping IPoIB address of local iface, but I can't ping any remote IPoIB address. There is plenty of free mem so this is not the oom-killer case. There are no other noticable problems with this host. Is it a hardware problem with IB iface? regards, P # ibv_devinfo hca_id: mthca0 fw_ver: 1.2.0 node_guid: 0030:487e:0c06:0000 sys_image_guid: 0030:487e:0c06:0003 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: SM_0000000003 phys_port_cnt: 1 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 448 port_lmc: 0x00 # ofed_info OFED-1.3.1 libibverbs: git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3 commit 40b771aa6a9c0ad092b2e20775b4723d3b173792 libmthca: git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3 commit 9501e698d257949acfab2edc90812602966dbcc9 libmlx4: git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3 commit 3869d6dab7e12fe452270ca641f7dd7082b42482 libehca: git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3 commit fd898180cfa3b737f893f432a80b91bac3396325 libipathverbs: git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3 commit 82be4d81859d1fd2edf830220fe65a9923b80a46 libcxgb3: git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3 commit 6f7485feb244d8571fcab2292ef92c97bea48df0 libnes: git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3 commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3 libibcm: git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3 commit 53ec35f544bbc1838bbadc2210909c25a954a5e2 librdmacm: git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3 commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1 dapl1: git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3 commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779 dapl2: git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3 commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177 libsdp: git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3 commit c8102dccc502930442b23de658674d386456b350 sdpnetstat: git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3 commit 3341620a7259c4f7bdd4180864b98e260c3dc223 srptools: git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3 commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d perftest: git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3 commit 6321b5468f7293088cc003809049c02b176130d8 qlvnictools: git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3 commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd tvflash: git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3 commit f5e7407a7f2058448df5e5320d9843f944427429 mstflint: git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3 commit 78bbd3d521a9078553a991111ffb6f76665b9ee9 qperf: git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3 commit 6221aabd038df0b7033e035378ca190641ed2295 management: git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3 commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf ibutils: git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3 commit 7daf94fab6eaf307316326f3f49704e6080a1508 ibsim: git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3 commit 55113d9f919709c7c97ea41d29991941b9c8be70 ofa_kernel-1.3.1: Git: git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c # MPI mvapich-1.0.1-2533.src.rpm mvapich2-1.0.3-1.src.rpm openmpi-1.2.6-1.src.rpm mpitests-3.0-773.src.rpm -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From yosefe at voltaire.com Fri Aug 7 05:04:25 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Fri, 07 Aug 2009 15:04:25 +0300 Subject: [ofa-general] ib0: multicast join failed In-Reply-To: <20090807112526.GD21691@cefeid.wcss.wroc.pl> References: <20090807112526.GD21691@cefeid.wcss.wroc.pl> Message-ID: <4A7C1849.6030500@voltaire.com> On 07/08/09 14:25, Pawel Dziekonski wrote: > Hi, > > today I got the following: > > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > and connection to Lustre was lost. > > I can ping IPoIB address of local iface, but I can't ping any remote IPoIB > address. > There is plenty of free mem so this is not the oom-killer case. > There are no other noticable problems with this host. > > Is it a hardware problem with IB iface? > regards, P > Is your SM alive? From dzieko at wcss.pl Fri Aug 7 05:12:51 2009 From: dzieko at wcss.pl (Pawel Dziekonski) Date: Fri, 7 Aug 2009 14:12:51 +0200 Subject: [ofa-general] ib0: multicast join failed In-Reply-To: <4A7C1849.6030500@voltaire.com> References: <20090807112526.GD21691@cefeid.wcss.wroc.pl> <4A7C1849.6030500@voltaire.com> Message-ID: <20090807121251.GE21691@cefeid.wcss.wroc.pl> On Fri, 07 Aug 2009 at 03:04:25PM +0300, Yossi Etigin wrote: > On 07/08/09 14:25, Pawel Dziekonski wrote: > > Hi, > > > > today I got the following: > > > > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > > and connection to Lustre was lost. > > > > I can ping IPoIB address of local iface, but I can't ping any remote IPoIB > > address. > > There is plenty of free mem so this is not the oom-killer case. > > There are no other noticable problems with this host. > > > > Is it a hardware problem with IB iface? > > Is your SM alive? Well, this is a good question. My SM is on Voltaire ISR2012 switch. Today I lost contact with its web interface - I don't know why. CLI works fine. Net itself works too. So I assume that SM works. L:ISR2012-0004(utilities)# sminfo -m -e [1249654331:530476][32061] => _do_madrpc: timeout after 3 retries, 600 ms sm_lid:..........................1 sm_guid:.........................0x8f10500000007 sm_key:..........................0x0 sm_activity:.....................574044927 sm_priority:.....................14 sm_state:........................SMINFO_MASTER nodeip:.......................... nodename:........................ node_guid:.......................0x8f10500000007 devid:...........................0x5a37 vendor:..........................0x8f1 node_desc:.......................ISR2012 Voltaire sFB-2012 node_type:.......................Switch localport:.......................0 L:ISR2012-0004(utilities)# port-verify -b [1249653810:282614][26657] => _do_madrpc: timeout after 3 retries, 600 ms [1249653810:283115][26657] => madrpc: failed class 129 method 1 attr 17 DR Path: 0,18,24,13 [1249653810:283585][26657] => discover: Nodeinfo on 0,18,24,13 port 13 failed, skipping port # # Topology file: generated on Fri Aug 7 14:03:34 2009 # Printing Chassis 1 (chassis guid 0x0008f10500000004) devid=0x5a38 switchguids=0x8f104003f680a Chassis ISR2012 1 Line 9 Chip 1 Switch 24 "S-0008f104003f680a" # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 192 [13][ext 13] "S-0008f10400413b08"[11] width 4X speed 5.0 Gbs errs.remphysrcv:.................6 <- Alert !!! devid=0x5a30 switchguids=0x8f104004136c0 Switch 24 "S-0008f104004136c0" # "ISR9024D Voltaire" smalid 209 [22] "S-0008f104003f680a"[22] width 4X speed 5.0 Gbs errs.remphysrcv:.................6 <- Alert !!! devid=0x5a30 switchguids=0x8f104004136b0 Switch 24 "S-0008f104004136b0" # "ISR9024D Voltaire" smalid 204 [13] "S-000b8cffff002cc7"[12] width 4X speed 5.0 Gbs errs.sym:........................752 <- Alert !!! [24] "S-0008f104003f680a"[19] width 4X speed 5.0 Gbs errs.sym:........................1 <- Alert !!! errs.rcv:........................1 <- Alert !!! devid=0x5a30 switchguids=0x8f10400413b08 Switch 24 "S-0008f10400413b08" # "ISR9024D Voltaire" smalid 224 [13] Alert -> Could not access this port Remote Peer. devid=0xb924 switchguids=0xb8cffff002cc7 Switch 24 "S-000b8cffff002cc7" # "MT47396 Infiniscale-III Mellanox Technologies" smalid 246 [12] "S-0008f104004136b0"[13] width 4X speed 5.0 Gbs errs.sym:........................4 <- Alert !!! devid=0x6732 hcaguids=0x2c90300031878 Hca 2 "H-0002c90300031878" # "oss1 HCA-1" [1] "S-0008f104004136c0"[9] # lid 169 lmc 0 width 4X speed 5.0 Gbs errs.remphysrcv:.................6 <- Alert !!! SUMMARY: ALARM [found - 5 bad_nodes and 6 bad_ports]. (VL15Dropped errors masked out) -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From yosefe at voltaire.com Fri Aug 7 06:34:20 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Fri, 07 Aug 2009 16:34:20 +0300 Subject: [ofa-general] ib0: multicast join failed In-Reply-To: <20090807121251.GE21691@cefeid.wcss.wroc.pl> References: <20090807112526.GD21691@cefeid.wcss.wroc.pl> <4A7C1849.6030500@voltaire.com> <20090807121251.GE21691@cefeid.wcss.wroc.pl> Message-ID: <4A7C2D5C.1030005@voltaire.com> On 07/08/09 15:12, Pawel Dziekonski wrote: > On Fri, 07 Aug 2009 at 03:04:25PM +0300, Yossi Etigin wrote: >> On 07/08/09 14:25, Pawel Dziekonski wrote: >>> Hi, >>> >>> today I got the following: >>> >>> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 >>> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 >>> >>> and connection to Lustre was lost. >>> >>> I can ping IPoIB address of local iface, but I can't ping any remote IPoIB >>> address. >>> There is plenty of free mem so this is not the oom-killer case. >>> There are no other noticable problems with this host. >>> >>> Is it a hardware problem with IB iface? >> Is your SM alive? > > Well, this is a good question. > > My SM is on Voltaire ISR2012 switch. Today I lost contact with its web > interface - I don't know why. CLI works fine. Net itself works too. So > I assume that SM works. > I guess there is some physical problem in the fabric (cables?) because the host cannot reach the SM - the port is in INIT state. From hnrose at comcast.net Fri Aug 7 06:43:05 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 7 Aug 2009 09:43:05 -0400 Subject: [ofa-general] [PATCH] opensm/osm_mcast_tbl.c: osm_mcast_tbl_get_block returns boolean Message-ID: <20090807134305.GA30766@comcast.net> so use TRUE/FALSE rather than IB_INVALID_PARMETER Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c index 82850be..38c06c1 100644 --- a/opensm/opensm/osm_mcast_tbl.c +++ b/opensm/opensm/osm_mcast_tbl.c @@ -273,7 +273,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl, mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE); if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho) - return (IB_INVALID_PARAMETER); + return (TRUE); for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++) p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position]; From weiny2 at llnl.gov Fri Aug 7 09:07:03 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 7 Aug 2009 09:07:03 -0700 Subject: [ofa-general] [PATCH] libibnetdisc: fix potential memory leak of port object Message-ID: <20090807090703.2b857dea.weiny2@llnl.gov> From: Ira Weiny Date: Fri, 7 Aug 2009 09:05:44 -0700 Subject: [PATCH] libibnetdisc: fix potential memory leak of port object NOTE: This moves the port allocation below the port array allocation failure rather than free the port allocation after port array allocation fails. Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index b9e89d9..27ae9f3 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -446,14 +446,6 @@ add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd { struct ibnd_port *port; - port = malloc(sizeof(*port)); - if (!port) - return NULL; - - memcpy(port, temp, sizeof(*port)); - port->port.node = (ibnd_node_t *)node; - port->port.ext_portnum = 0; - if (node->node.ports == NULL) { node->node.ports = calloc(sizeof(*node->node.ports), node->node.numports + 1); if (!node->node.ports) { @@ -462,6 +454,14 @@ add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd } } + port = malloc(sizeof(*port)); + if (!port) + return NULL; + + memcpy(port, temp, sizeof(*port)); + port->port.node = (ibnd_node_t *)node; + port->port.ext_portnum = 0; + node->node.ports[temp->port.portnum] = (ibnd_port_t *)port; add_to_portguid_hash(port, fabric->portstbl); -- 1.5.4.5 From hnrose at comcast.net Fri Aug 7 09:41:27 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 7 Aug 2009 12:41:27 -0400 Subject: [ofa-general] [PATCH] opensm: Parallelize (Stripe) MFT sets across switches Message-ID: <20090807164127.GA795@comcast.net> Similar to previous patch to "Parallelize (Stripe) LFT sets across switches". Currently, MADs are pipelined to a single switch first which effectively serializes these requests. This patch pipelines the MFT set MADs across switches first (before cycling to the next MFT block) so that multiple switches can be responding concurrently. Speedup is dependent on number of MFT blocks in use (number of MLIDs) which is dependent on the number of multicast groups. Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 7ce28c5..e281842 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -103,6 +103,8 @@ typedef struct osm_switch { uint8_t *lft; uint8_t *new_lft; osm_mcast_tbl_t mcast_tbl; + uint32_t mft_block_num; + uint32_t mft_position; unsigned endport_links; unsigned need_update; void *priv; diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index 4dbbaa0..f91c6b6 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -325,15 +325,12 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) { osm_node_t *p_node; osm_dr_path_t *p_path; - osm_madw_context_t mad_context; + osm_madw_context_t context; ib_api_status_t status; - uint32_t block_id_ho = 0; - int16_t block_num = 0; - uint32_t position = 0; - uint32_t max_position; + uint32_t block_id_ho; osm_mcast_tbl_t *p_tbl; ib_net16_t block[IB_MCAST_BLOCK_SIZE]; - int ret = 0; + int ret = -1; CL_ASSERT(sm); @@ -353,36 +350,34 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) configuration. */ - mad_context.mft_context.node_guid = osm_node_get_node_guid(p_node); - mad_context.mft_context.set_method = TRUE; + context.mft_context.node_guid = osm_node_get_node_guid(p_node); + context.mft_context.set_method = TRUE; p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); - max_position = p_tbl->max_position; - while (osm_mcast_tbl_get_block(p_tbl, block_num, - (uint8_t) position, block)) { - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Writing MFT block 0x%X\n", block_id_ho); + if (p_sw->mft_position <= p_tbl->max_position && + osm_mcast_tbl_get_block(p_tbl, p_sw->mft_block_num, + (uint8_t) p_sw->mft_position, block)) { + + block_id_ho = p_sw->mft_block_num + (p_sw->mft_position << 28); - block_id_ho = block_num + (position << 28); + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Writing MFT block %u position %u to switch 0x%" PRIx64 "\n", + p_sw->mft_block_num, p_sw->mft_position, + cl_ntoh64(context.lft_context.node_guid)); status = osm_req_set(sm, p_path, (void *)block, sizeof(block), IB_MAD_ATTR_MCAST_FWD_TBL, cl_hton32(block_id_ho), CL_DISP_MSGID_NONE, - &mad_context); + &context); - if (status != IB_SUCCESS) { + if (status != IB_SUCCESS) OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A02: " - "Sending multicast fwd. tbl. block failed (%s)\n", + "Sending MFT block failed (%s)\n", ib_get_err_str(status)); - ret = -1; - } - if (++position > max_position) { - position = 0; - block_num++; - } - } + } else + ret = 0; OSM_LOG_EXIT(sm->p_log); return ret; @@ -1077,7 +1072,8 @@ int osm_mcast_mgr_process(osm_sm_t * sm) cl_qmap_t *p_sw_tbl; cl_qlist_t *p_list = &sm->mgrp_list; osm_mgrp_t *p_mgrp; - int i, ret = 0; + osm_mcast_tbl_t *p_tbl; + int sws_notdone, i, ret = 0; OSM_LOG_ENTER(sm->p_log); @@ -1114,11 +1110,30 @@ int osm_mcast_mgr_process(osm_sm_t * sm) */ p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { - if (mcast_mgr_set_tbl(sm, p_sw)) - ret = -1; + p_sw->mft_block_num = 0; + p_sw->mft_position = 0; p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); } + while (1) { + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); + sws_notdone = 0; + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { + if (mcast_mgr_set_tbl(sm, p_sw)) + sws_notdone++; + p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); + if (++p_sw->mft_position > p_tbl->max_position) { + p_sw->mft_position = 0; + p_sw->mft_block_num++; + } + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); + } + if (!sws_notdone) { + ret = -1; + break; + } + } + while (!cl_is_qlist_empty(p_list)) { cl_list_item_t *p = cl_qlist_remove_head(p_list); free(p); @@ -1142,9 +1157,10 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) osm_switch_t *p_sw; cl_qmap_t *p_sw_tbl; osm_mgrp_t *p_mgrp; + osm_mcast_tbl_t *p_tbl; ib_net16_t mlid; osm_mcast_mgr_ctxt_t *ctx; - int ret = 0; + int sws_notdone, ret = 0; OSM_LOG_ENTER(sm->p_log); @@ -1195,11 +1211,30 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) p_sw_tbl = &sm->p_subn->sw_guid_tbl; p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { - if (mcast_mgr_set_tbl(sm, p_sw)) - ret = -1; + p_sw->mft_block_num = 0; + p_sw->mft_position = 0; p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); } + while (1) { + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); + sws_notdone = 0; + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { + if (mcast_mgr_set_tbl(sm, p_sw)) + sws_notdone++; + p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); + if (++p_sw->mft_position > p_tbl->max_position) { + p_sw->mft_position = 0; + p_sw->mft_block_num++; + } + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); + } + if (!sws_notdone) { + ret = -1; + break; + } + } + osm_dump_mcast_routes(sm->p_subn->p_osm); exit: From rdreier at cisco.com Fri Aug 7 11:22:10 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Aug 2009 11:22:10 -0700 Subject: [ofa-general] Re: [PATCH] mlx4_core: map sufficient ICM memory for EQs In-Reply-To: <20090730130434.GA21428@mtls03> (Eli Cohen's message of "Thu, 30 Jul 2009 16:04:34 +0300") References: <20090730130434.GA21428@mtls03> Message-ID: Thanks, applied with a few cleanups: ilog2(roundup_pow_of_two()) -> order_base_2() xxx * (1 << yy) -> xxx << yy From hal.rosenstock at gmail.com Fri Aug 7 11:38:26 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 7 Aug 2009 14:38:26 -0400 Subject: [ofa-general] osm_link_mgr.c:link_mgr_get_smsl question Message-ID: Hi Sasha, osm_link_mgr.c:link_mgr_get_smsl has the following: /* Find osm_port of the source = p_physp */ slid = osm_physp_get_base_lid(p_physp); p_src_port = cl_ptr_vector_get(&sm->p_subn->port_lid_tbl, cl_ntoh16(slid)); /* Call lash to find proper SL */ sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port); It may be that this code is invoked prior to the LID being assigned so getting the p_src_port based on the LID yields NULL and then calling osm_get_lash_sl causes a seg fault. I can see two ways to fix this: 1. Replace with port GUID search 2. Have osm_get_lash_sl handle NULL for p_src_port Maybe you see other ways to deal with this. Do you have a preferred approach ? -- Hal From swise at opengridcomputing.com Fri Aug 7 12:28:11 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 07 Aug 2009 14:28:11 -0500 Subject: [ofa-general] [PATCH v2 1/2] RDMA/cxgb3: Don't free the endpoint early. Message-ID: <20090807192811.14821.11554.stgit@build.ogc.int> - Keep ref on connection request endpoints until either accepted or rejected so it doesn't get freed early. - Endpoint flags now need to be set via atomic bitops because they can be set on both the iw_cxgb3 workqueue thread and user disconnect threads. - Don't move out of CLOSING too early due to multiple calls to iwch_ep_disconnect. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 52 ++++++++++++++++++--------------- drivers/infiniband/hw/cxgb3/iwch_cm.h | 9 +++--- 2 files changed, 33 insertions(+), 28 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 52d7bb0..7f22f17 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -286,7 +286,7 @@ void __free_ep(struct kref *kref) ep = container_of(container_of(kref, struct iwch_ep_common, kref), struct iwch_ep, com); PDBG("%s ep %p state %s\n", __func__, ep, states[state_read(&ep->com)]); - if (ep->com.flags & RELEASE_RESOURCES) { + if (test_bit(RELEASE_RESOURCES, &ep->com.flags)) { cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid); dst_release(ep->dst); l2t_release(L2DATA(ep->com.tdev), ep->l2t); @@ -297,7 +297,7 @@ void __free_ep(struct kref *kref) static void release_ep_resources(struct iwch_ep *ep) { PDBG("%s ep %p tid %d\n", __func__, ep, ep->hwtid); - ep->com.flags |= RELEASE_RESOURCES; + set_bit(RELEASE_RESOURCES, &ep->com.flags); put_ep(&ep->com); } @@ -786,10 +786,12 @@ static void connect_request_upcall(struct iwch_ep *ep) event.private_data_len = ep->plen; event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); event.provider_data = ep; - if (state_read(&ep->parent_ep->com) != DEAD) + if (state_read(&ep->parent_ep->com) != DEAD) { + get_ep(&ep->com); ep->parent_ep->com.cm_id->event_handler( ep->parent_ep->com.cm_id, &event); + } put_ep(&ep->parent_ep->com); ep->parent_ep = NULL; } @@ -1156,8 +1158,7 @@ static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) * We get 2 abort replies from the HW. The first one must * be ignored except for scribbling that we need one more. */ - if (!(ep->com.flags & ABORT_REQ_IN_PROGRESS)) { - ep->com.flags |= ABORT_REQ_IN_PROGRESS; + if (!test_and_set_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags)) { return CPL_RET_BUF_DONE; } @@ -1480,7 +1481,6 @@ static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) * rejects the CR. */ __state_set(&ep->com, CLOSING); - get_ep(&ep->com); break; case MPA_REP_SENT: __state_set(&ep->com, CLOSING); @@ -1561,8 +1561,7 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) * We get 2 peer aborts from the HW. The first one must * be ignored except for scribbling that we need one more. */ - if (!(ep->com.flags & PEER_ABORT_IN_PROGRESS)) { - ep->com.flags |= PEER_ABORT_IN_PROGRESS; + if (!test_and_set_bit(PEER_ABORT_IN_PROGRESS, &ep->com.flags)) { return CPL_RET_BUF_DONE; } @@ -1591,7 +1590,6 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) * the reference on it until the ULP accepts or * rejects the CR. */ - get_ep(&ep->com); break; case MORIBUND: case CLOSING: @@ -1797,6 +1795,7 @@ int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) err = send_mpa_reject(ep, pdata, pdata_len); err = iwch_ep_disconnect(ep, 0, GFP_KERNEL); } + put_ep(&ep->com); return 0; } @@ -1810,8 +1809,10 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) struct iwch_qp *qp = get_qhp(h, conn_param->qpn); PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid); - if (state_read(&ep->com) == DEAD) - return -ECONNRESET; + if (state_read(&ep->com) == DEAD) { + err = -ECONNRESET; + goto err; + } BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); BUG_ON(!qp); @@ -1819,7 +1820,8 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) || (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) { abort_connection(ep, NULL, GFP_KERNEL); - return -EINVAL; + err = -EINVAL; + goto err; } cm_id->add_ref(cm_id); @@ -1836,8 +1838,6 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) PDBG("%s %d ird %d ord %d\n", __func__, __LINE__, ep->ird, ep->ord); - get_ep(&ep->com); - /* bind QP to EP and move to RTS */ attrs.mpa_attr = ep->mpa_attr; attrs.max_ird = ep->ird; @@ -1855,30 +1855,31 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) err = iwch_modify_qp(ep->com.qp->rhp, ep->com.qp, mask, &attrs, 1); if (err) - goto err; + goto err1; /* if needed, wait for wr_ack */ if (iwch_rqes_posted(qp)) { wait_event(ep->com.waitq, ep->com.rpl_done); err = ep->com.rpl_err; if (err) - goto err; + goto err1; } err = send_mpa_reply(ep, conn_param->private_data, conn_param->private_data_len); if (err) - goto err; + goto err1; state_set(&ep->com, FPDU_MODE); established_upcall(ep); put_ep(&ep->com); return 0; -err: +err1: ep->com.cm_id = NULL; ep->com.qp = NULL; cm_id->rem_ref(cm_id); +err: put_ep(&ep->com); return err; } @@ -2097,14 +2098,17 @@ int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) ep->com.state = CLOSING; start_ep_timer(ep); } + set_bit(CLOSE_SENT, &ep->com.flags); break; case CLOSING: - close = 1; - if (abrupt) { - stop_ep_timer(ep); - ep->com.state = ABORTING; - } else - ep->com.state = MORIBUND; + if (!test_and_set_bit(CLOSE_SENT, &ep->com.flags)) { + close = 1; + if (abrupt) { + stop_ep_timer(ep); + ep->com.state = ABORTING; + } else + ep->com.state = MORIBUND; + } break; case MORIBUND: case ABORTING: diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h index 43c0aea..b9efadf 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -145,9 +145,10 @@ enum iwch_ep_state { }; enum iwch_ep_flags { - PEER_ABORT_IN_PROGRESS = (1 << 0), - ABORT_REQ_IN_PROGRESS = (1 << 1), - RELEASE_RESOURCES = (1 << 2), + PEER_ABORT_IN_PROGRESS = 0, + ABORT_REQ_IN_PROGRESS = 1, + RELEASE_RESOURCES = 2, + CLOSE_SENT = 3, }; struct iwch_ep_common { @@ -162,7 +163,7 @@ struct iwch_ep_common { wait_queue_head_t waitq; int rpl_done; int rpl_err; - u32 flags; + unsigned long flags; }; struct iwch_listen_ep { From swise at opengridcomputing.com Fri Aug 7 12:28:17 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 07 Aug 2009 14:28:17 -0500 Subject: [ofa-general] [PATCH v2 2/2] RDMA/cxgb3: wake up any waiters on peer close/abort. In-Reply-To: <20090807192811.14821.11554.stgit@build.ogc.int> References: <20090807192811.14821.11554.stgit@build.ogc.int> Message-ID: <20090807192817.14821.70876.stgit@build.ogc.int> A close/abort while waiting for a wr_ack during connection migration can cause a hung process in iwch_accept_cr/iwch_reject_cr. The fix is to set rpl_error/rpl_done and wake up the waiters when we get a close/abort while in MPA_REQ_RCVD state. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 7f22f17..66b4135 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1478,9 +1478,14 @@ static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) /* * We're gonna mark this puppy DEAD, but keep * the reference on it until the ULP accepts or - * rejects the CR. + * rejects the CR. Also wake up anyone waiting + * in rdma connection migration (see iwch_accept_cr()). */ __state_set(&ep->com, CLOSING); + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); break; case MPA_REP_SENT: __state_set(&ep->com, CLOSING); @@ -1588,8 +1593,13 @@ static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) /* * We're gonna mark this puppy DEAD, but keep * the reference on it until the ULP accepts or - * rejects the CR. + * rejects the CR. Also wake up anyone waiting + * in rdma connection migration (see iwch_accept_cr()). */ + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); break; case MORIBUND: case CLOSING: @@ -1828,8 +1838,6 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) ep->com.cm_id = cm_id; ep->com.qp = qp; - ep->com.rpl_done = 0; - ep->com.rpl_err = 0; ep->ird = conn_param->ird; ep->ord = conn_param->ord; From rdreier at cisco.com Fri Aug 7 13:58:40 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Aug 2009 13:58:40 -0700 Subject: [ofa-general] [PATCH v2 2/2] RDMA/cxgb3: wake up any waiters on peer close/abort. In-Reply-To: <20090807192817.14821.70876.stgit@build.ogc.int> (Steve Wise's message of "Fri, 07 Aug 2009 14:28:17 -0500") References: <20090807192811.14821.11554.stgit@build.ogc.int> <20090807192817.14821.70876.stgit@build.ogc.int> Message-ID: thanks for respinning, got em both. From roel.kluin at gmail.com Fri Aug 7 14:02:34 2009 From: roel.kluin at gmail.com (Roel Kluin) Date: Fri, 07 Aug 2009 23:02:34 +0200 Subject: [ofa-general] [PATCH] IB/mthca: Read buffer overflow Message-ID: <4A7C966A.3040005@gmail.com> If the QP was found in MGM in the first iteration, and we break out of the loop, i == 0 and we read and write mgm->qp[-1]. Signed-off-by: Roel Kluin --- Not entirely sure whether it can happen diff --git a/drivers/infiniband/hw/mthca/mthca_mcg.c b/drivers/infiniband/hw/mthca/mthca_mcg.c index d4c8105..fd72665 100644 --- a/drivers/infiniband/hw/mthca/mthca_mcg.c +++ b/drivers/infiniband/hw/mthca/mthca_mcg.c @@ -272,8 +272,10 @@ int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) goto out; } - mgm->qp[loc] = mgm->qp[i - 1]; - mgm->qp[i - 1] = 0; + if (i != 0) { + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + } err = mthca_WRITE_MGM(dev, index, mailbox, &status); if (err) From rdreier at cisco.com Fri Aug 7 14:08:51 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Aug 2009 14:08:51 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: Read buffer overflow In-Reply-To: <4A7C966A.3040005@gmail.com> (Roel Kluin's message of "Fri, 07 Aug 2009 23:02:34 +0200") References: <4A7C966A.3040005@gmail.com> Message-ID: > If the QP was found in MGM in the first iteration, and we break out of > the loop, i == 0 and we read and write mgm->qp[-1]. > > Signed-off-by: Roel Kluin > --- > Not entirely sure whether it can happen I don't think it can happen. The loop and following code is: for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) loc = i; if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) break; } if (loc == -1) { mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); err = -EINVAL; goto out; } mgm->qp[loc] = mgm->qp[i - 1]; and you're worried that i == 0 at that last bit. For i == 0 there, we need to break out of the loop on the first iteration, ie hit if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) break; with i == 0, meaning (mgm->qp[0] & cpu_to_be32(1 << 31) == 0. But to get past the loc == -1 test that returns from the function, we must also hit the loc = i assignment on the first iteration of the loop, so we must have (mgm->qp[0] == cpu_to_be32(ibqp->qp_num | (1 << 31)) be true, which would mean in particular that mgm->qp[0] would have to have that high order bit set. Which contradicts the conclusion we just reached. So the bad case of accessing index -1 can never happen just from the structure of the code. - R. From rdreier at cisco.com Fri Aug 7 14:14:03 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Aug 2009 14:14:03 -0700 Subject: [ofa-general] Re: sg_reset can trigger a NULL pointer dereference in the SRP initiator In-Reply-To: (Bart Van Assche's message of "Fri, 7 Aug 2009 10:31:18 +0200") References: <4A7A949B.60408@panasas.com> Message-ID: > A fix like the one below ? I think this gets us part of the way, but not quite. > --- linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp-orig.c 2009-08-03 > 12:13:11.000000000 +0200 > +++ linux-2.6.30.4/drivers/infiniband/ulp/srp/ib_srp.c 2009-08-07 > 10:23:27.000000000 +0200 > @@ -1371,16 +1371,27 @@ out: > return -1; > } > > +/** > + * Look up the struct srp_request that has been associated with the specified > + * SCSI command by srp_queuecommand(). > + * > + * Returns 0 upon success and -1 upon failure. > + */ > static int srp_find_req(struct srp_target_port *target, > struct scsi_cmnd *scmnd, > struct srp_request **req) > { > - if (scmnd->host_scribble == (void *) -1L) > - return -1; > + /* > + * The code below will only work if SRP_RQ_SIZE is a power of two, > + * so check this first. > + */ > + BUILD_BUG_ON((SRP_RQ_SIZE ^ (SRP_RQ_SIZE - 1)) > + != (SRP_RQ_SIZE | (SRP_RQ_SIZE - 1))); could this be BUILD_BUG_ON(!is_power_of_2(SRP_RQ_SIZE)) ? > > - *req = &target->req_ring[(long) scmnd->host_scribble]; > + *req = &target->req_ring[(long)scmnd->host_scribble > + & (SRP_RQ_SIZE - 1)]; > > - return 0; > + return (*req)->scmnd == scmnd ? 0 : -1; > } > > static int srp_abort(struct scsi_cmnd *scmnd) > @@ -1423,8 +1434,15 @@ static int srp_reset_device(struct scsi_ > > if (target->qp_in_error) > return FAILED; > - if (srp_find_req(target, scmnd, &req)) > - return FAILED; > + if (srp_find_req(target, scmnd, &req)) { > + /* > + * scmnd has not yet been queued -- queue it now. This can > + * happen e.g. when a SG_SCSI_RESET ioctl has been issued. > + */ > + if (srp_queuecommand(scmnd, scmnd->scsi_done) > + || srp_find_req(target, scmnd, &req)) > + return FAILED; I don't think we can just pass the command to srp_queuecommand() here. For one thing queuecommand requires some locking, and second, we don't actually want to queue the command -- in fact I'm not sure it is set up properly with an opcode etc to execute the command. What I think needs to happen is we need to allocate a request for the command the same way srp_queuecommand() does, and in fact maybe that code could be factored out to avoid duplication. -R . From mdidomenico4 at gmail.com Fri Aug 7 14:37:55 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 7 Aug 2009 17:37:55 -0400 Subject: [ofa-general] sun x4100 with IB Message-ID: I have several Sun x4100 with Infiniband servers which appear to be running at 400MB/sec instead of 800MB/sec. It's a freshly reformatted cluster converting from solaris to linux. We also reset the bios settings with "load optimal defaults". Does anyone know which bios setting I changed to dump the BW? x4100 mellanox ib ofed-1.4.1-rc6 w/ openmpi From robertacummins at gmail.com Fri Aug 7 15:37:51 2009 From: robertacummins at gmail.com (Robert Cummins) Date: Fri, 07 Aug 2009 16:37:51 -0600 Subject: [ofa-general] sun x4100 with IB In-Reply-To: References: Message-ID: <1249684671.13945.68.camel@rockymtn.cumminsconsultants.com> Can you send the output from lspci -vvv? What card are you using? Is it an Infinihost III SDR card? What does ibdiagnet -lw 4x -ls 5 return? On Fri, 2009-08-07 at 17:37 -0400, Michael Di Domenico wrote: > I have several Sun x4100 with Infiniband servers which appear to be > running at 400MB/sec instead of 800MB/sec. It's a freshly reformatted > cluster converting from solaris to linux. We also reset the bios > settings with "load optimal defaults". Does anyone know which bios > setting I changed to dump the BW? > > x4100 > mellanox ib > ofed-1.4.1-rc6 w/ openmpi > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From gregkh at suse.de Fri Aug 7 20:48:17 2009 From: gregkh at suse.de (Greg KH) Date: Fri, 7 Aug 2009 20:48:17 -0700 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: References: Message-ID: <20090808034817.GA30697@suse.de> On Fri, Aug 07, 2009 at 09:26:33AM +0200, Bart Van Assche wrote: > On Thu, Aug 6, 2009 at 9:58 PM, Roland Dreier wrote: > > > >  > Are you sure that this indicates a shortcoming in the kobject > >  > debugging code ? The most recent messages related to the message "does > >  > not have a release() function, it is broken and must be fixed" I could > >  > find on the LKML date from July 16, 2009 > >  > (http://lkml.org/lkml/2009/7/16/306 and > >  > http://lkml.org/lkml/2009/7/16/391). As you can see Greg KH > >  > acknowledges that if this message is logged that this indicates a > >  > problem that should be fixed. > > > > I'm not sure -- I just assume that the core module unloading code is > > working OK, since it is so heavily tested.  If there were really a "must > > be fixed" problem with module unloading then someone would surely have > > hit more than a warning message. > > (added Greg KH and the LKML in CC) > > I tried to look up more information about kobjects. The comment of > commit 7a6a41615bfb2f03ce797bc24104c50b42c935e5 suggests that in the > past the function kobject_cleanup() did not free the memory allocated > for static kobject names but that this was the responsibility of the > release() function. This should have been fixed in the current version > of kobject_cleanup(). So I'm wondering whether the message that > kobjects that do not have a release() function are broken still makes > sense ? No, it still makes sense :) thanks, greg k-h From vlad at lists.openfabrics.org Sat Aug 8 03:00:39 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 8 Aug 2009 03:00:39 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090808-0200 daily build status Message-ID: <20090808100039.D9115E28264@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090808-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mdidomenico4 at gmail.com Sat Aug 8 07:22:20 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Sat, 8 Aug 2009 10:22:20 -0400 Subject: [ofa-general] sun x4100 with IB In-Reply-To: <1249684671.13945.68.camel@rockymtn.cumminsconsultants.com> References: <1249684671.13945.68.camel@rockymtn.cumminsconsultants.com> Message-ID: Yes, its an infinihost III, i believe its MT23208, but dont quote me on that, i'm not at the machine currently Is there something specific you want to see in lspci -vvv, i can't easily cut and paste from the machine On Fri, Aug 7, 2009 at 6:37 PM, Robert Cummins wrote: > Can you send the output from lspci -vvv?   What card are you using?  Is > it an Infinihost III SDR card?  What does ibdiagnet -lw 4x -ls 5 return? > > On Fri, 2009-08-07 at 17:37 -0400, Michael Di Domenico wrote: >> I have several Sun x4100 with Infiniband servers which appear to be >> running at 400MB/sec instead of 800MB/sec.  It's a freshly reformatted >> cluster converting from solaris to linux.  We also reset the bios >> settings with "load optimal defaults". Does anyone know which bios >> setting I changed to dump the BW? >> >> x4100 >> mellanox ib >> ofed-1.4.1-rc6 w/ openmpi >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From bart.vanassche at gmail.com Sat Aug 8 10:49:22 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sat, 8 Aug 2009 19:49:22 +0200 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated Message-ID: Hello, Has anyone ever encountered a message like the one below ? This message was generated while booting a 2.6.30.4 kernel with CONFIG_DMA_API_DEBUG=y and before any out-of-tree kernel modules were loaded. ------------[ cut here ]------------ WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0() Hardware name: P5Q DELUXE mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x0000000139482000] [size=4096 bytes] Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801 rtc_core hid_belkin mlx4_core( +) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx xor raid0 sd_mod crc_t10dif ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal processor thermal_sys hwmon Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6 Call Trace: [] ? check_sync+0x47c/0x4b0 [] warn_slowpath_common+0x78/0xd0 [] warn_slowpath_fmt+0x3c/0x40 [] ? _spin_lock_irqsave+0x49/0x60 [] ? check_sync+0xab/0x4b0 [] check_sync+0x47c/0x4b0 [] ? mark_held_locks+0x6c/0x90 [] debug_dma_sync_single_for_cpu+0x1d/0x20 [] mlx4_write_mtt+0x159/0x1e0 [mlx4_core] [] mlx4_create_eq+0x222/0x650 [mlx4_core] [] ? trace_hardirqs_on+0xd/0x10 [] mlx4_init_eq_table+0x1c5/0x4a0 [mlx4_core] [] mlx4_setup_hca+0x98/0x550 [mlx4_core] [] ? __mlx4_init_one+0x8d1/0x920 [mlx4_core] [] __mlx4_init_one+0x371/0x920 [mlx4_core] [] mlx4_init_one+0x22/0x44 [mlx4_core] [] ? do_work_for_cpu+0x0/0x30 [] local_pci_probe+0x12/0x20 [] do_work_for_cpu+0x13/0x30 [] kthread+0x56/0x90 [] child_rip+0xa/0x20 [] ? restore_args+0x0/0x30 [] ? kthread+0x0/0x90 [] ? child_rip+0x0/0x20 ---[ end trace 4480af29bc755c6a ]--- Bart. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Sun Aug 9 03:01:46 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 9 Aug 2009 03:01:46 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090809-0200 daily build status Message-ID: <20090809100146.50D46E2814C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:765: warning: pointer targets in passing argument 2 of 'wait_for_sndbuf' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:783: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.c:800: warning: pointer targets in passing argument 2 of 'sdp_wait_rdmardcompl' differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_zcopy.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090809-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From vlad at dev.mellanox.co.il Sun Aug 9 08:49:49 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 09 Aug 2009 18:49:49 +0300 Subject: [ofa-general] Re: [ANNOUNCE] uDAPL v2.0 - dapl-2.0.21 release In-Reply-To: <1039212EEA944CE5A17E8C5ACFC9276E@amr.corp.intel.com> References: <1039212EEA944CE5A17E8C5ACFC9276E@amr.corp.intel.com> Message-ID: <4A7EF01D.8090907@dev.mellanox.co.il> Arlin Davis wrote: > > Vlad, please pull new v2 package into OFED 1.5 and install the following: > > NOTE: the reorder... v2 first and then v1 > > dapl-2.0.21-1 > dapl-utils-2.0.21-1 > dapl-devel-2.0.21-1 > dapl-debuginfo-2.0.21-1 > compat-dapl-1.2.14-1 > compat-dapl-devel-1.2.14-1 > > See http://www.openfabrics.org/downloads/dapl/ more details. > > -arlin Done, Regards, Vladimir From worleys at gmail.com Sun Aug 9 10:09:15 2009 From: worleys at gmail.com (Chris Worley) Date: Sun, 9 Aug 2009 11:09:15 -0600 Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs Message-ID: I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off the CD.. never updated) and it's embedded IB stack (not the latest OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info"). I'm running a W2008S (fully patched) initiator w/ MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453. Using Mellanox QDR cards/switch. Writes over SRP, as measured from the initiator using IOMeter, get proper performance (i.e. 1.2GB/s). Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s). And while reading, IOMeter eventually hangs the system (Windows becomes unresponsive to GUI interaction). In this state, I see iostat reporting transfers at the same low read rate from the target... so there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it acts like it's a "skipping record" (sorry of you young folks don't know what that is... but I can't think of another way to describe it) and never moving on to the next benchmark, just endlessly repeating the same I/O over and over again. If I unload then reload the mlx4_ib driver on the target, then the Windows system quickly returns, but IOMeter remains hung and needs killed. So, I have a lot of experimentation to do on the target in 1) upgrading the target or changing the distro altogether and 2) using OFED instead of built-in IB stack on the target to try to see if I can budge this issue. But, I was wondering if somebody might have a hint on this _or_ have a known target distro/kernel setup that works reliably w/ Windows-based SRP initiators. Thanks, Chris From landman at scalableinformatics.com Sun Aug 9 10:26:29 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 09 Aug 2009 13:26:29 -0400 Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs In-Reply-To: References: Message-ID: <4A7F06C5.7030203@scalableinformatics.com> Chris Worley wrote: > I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off > the CD.. never updated) and it's embedded IB stack (not the latest > OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info"). > > I'm running a W2008S (fully patched) initiator w/ > MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453. > > Using Mellanox QDR cards/switch. > > Writes over SRP, as measured from the initiator using IOMeter, get > proper performance (i.e. 1.2GB/s). > > Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s). Chris: What is the backing store capable of? That is, if you are doing, say dd's streaming from disk, what rate do you see? Or are you doing this with a RAMDISK to check protocol performance? The dd's should look something like this on the RHEL machine write: dd if=/dev/zero of=/path/to/target bs=1M count=32k (make sure the product of count * bs is greater than 2x system ram) read: dd if=/path/to/target of=/dev/null bs=1M count=32k If you are not getting 1.6 GB/s out of the file system locally, you won't get it out of the target over the network. The backing store is usually one of the slower aspects. For our units, this is what we are seeing: dd if=/dev/zero of=/data/big.file ... 10240+0 records in 10240+0 records out 171798691840 bytes (172 GB) copied, 94.8258 seconds, 1.8 GB/s [root at jr5 ~]# dd if=/data/big.file of=/dev/null bs=16M 10240+0 records in 10240+0 records out 171798691840 bytes (172 GB) copied, 76.6224 seconds, 2.2 GB/s So our writes and reads through SCST should be less than 1.8 and 2.2 GB/s respectively. > And while reading, IOMeter eventually hangs the system (Windows > becomes unresponsive to GUI interaction). In this state, I see iostat Hmmm.... We had IOMeter running continuously over a 10GbE link to a SCST-based target at SC09. The backing store could provide ~700 MB/s, and we saw 500 MB/s for ~4 days running during the day (running benchmarks continuously all day long). > reporting transfers at the same low read rate from the target... so > there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it > acts like it's a "skipping record" (sorry of you young folks don't > know what that is... but I can't think of another way to describe it) > and never moving on to the next benchmark, just endlessly repeating > the same I/O over and over again. If I unload then reload the mlx4_ib > driver on the target, then the Windows system quickly returns, but > IOMeter remains hung and needs killed. > > So, I have a lot of experimentation to do on the target in 1) > upgrading the target or changing the distro altogether and 2) using > OFED instead of built-in IB stack on the target to try to see if I can > budge this issue. > > But, I was wondering if somebody might have a hint on this _or_ have a > known target distro/kernel setup that works reliably w/ Windows-based > SRP initiators. SCST works (the versions we have used, 1.0.0, 1.0.1, ...) reliably with Windows initiator for XP, XP64, 2003, and 2008. Look in the windows error log, and see if you are getting driver timeouts. See if you have an updated driver. Regards, Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From worleys at gmail.com Sun Aug 9 11:19:23 2009 From: worleys at gmail.com (Chris Worley) Date: Sun, 9 Aug 2009 12:19:23 -0600 Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs In-Reply-To: <4A7F06C5.7030203@scalableinformatics.com> References: <4A7F06C5.7030203@scalableinformatics.com> Message-ID: On Sun, Aug 9, 2009 at 11:26 AM, Joe Landman wrote: > Chris Worley wrote: >> >> I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off >> the CD.. never updated) and it's embedded IB stack (not the latest >> OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info"). >> >> I'm running a W2008S (fully patched) initiator w/ >> MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453. >> >> Using Mellanox QDR cards/switch. >> >> Writes over SRP, as measured from the initiator using IOMeter, get >> proper performance (i.e. 1.2GB/s). >> >> Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s). > > Chris: > >  What is the backing store capable of?  That is, if you are doing, say dd's > streaming from disk, what rate do you see?  Or are you doing this with a > RAMDISK to check protocol performance? I tested my local performance before testing SRP. These are ioDrives. I'm running two, so the local performance is 1.6GB/s for reads. I've run up to four ioDrives through one QDR IB link w/ Linux host and initiator, and get 2.7GB/s to the initiator. This was using an upgraded distro on the target, and I'm testing this on someone elses machine and don't have permission to upgrade it yet. This could also be the rev of WinOF. > >  The dd's should look something like this on the RHEL machine > >        write: > >                dd if=/dev/zero of=/path/to/target bs=1M count=32k > > (make sure the product of count * bs is greater than 2x system ram) > >        read: > >                dd if=/path/to/target of=/dev/null bs=1M count=32k > > If you are not getting 1.6 GB/s out of the file system locally, you won't > get it out of the target over the network. 1.6GB/s out of two ioDrives is no problem locally. >  The backing store is usually one > of the slower aspects. > > For our units, this is what we are seeing: > > dd if=/dev/zero of=/data/big.file ... > 10240+0 records in > 10240+0 records out > 171798691840 bytes (172 GB) copied, 94.8258 seconds, 1.8 GB/s > > [root at jr5 ~]# dd if=/data/big.file of=/dev/null bs=16M > 10240+0 records in > 10240+0 records out > 171798691840 bytes (172 GB) copied, 76.6224 seconds,  2.2 GB/s > > So our writes and reads through SCST should be less than 1.8 and 2.2 GB/s > respectively. > >> And while reading, IOMeter eventually hangs the system (Windows >> becomes unresponsive to GUI interaction).  In this state, I see iostat > > Hmmm....  We had IOMeter running continuously over a 10GbE link to a > SCST-based target at SC09.  The backing store could provide ~700 MB/s, and > we saw 500 MB/s for ~4 days  running during the day (running benchmarks > continuously all day long). > >> reporting transfers at the same low read rate from the target... so >> there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it >> acts like it's a "skipping record" (sorry of you young folks don't >> know what that is... but I can't think of another way to describe it) >> and never moving on to the next benchmark, just endlessly repeating >> the same I/O over and over again.  If I unload then reload the mlx4_ib >> driver on the target, then the Windows system quickly returns, but >> IOMeter remains hung and needs killed. >> >> So, I have a lot of experimentation to do on the target in 1) >> upgrading the target or changing the distro altogether and 2) using >> OFED instead of built-in IB stack on the target to try to see if I can >> budge this issue. >> >> But, I was wondering if somebody might have a hint on this _or_ have a >> known target distro/kernel setup that works reliably w/ Windows-based >> SRP initiators. > > SCST works (the versions we have used, 1.0.0, 1.0.1, ...) reliably with > Windows initiator for XP, XP64, 2003, and 2008.  Look in the windows error > log, and see if you are getting driver timeouts.  See if you have an updated > driver. I'm worried more about the underlying IB stack and kernel on the target side. It would be best to know exactly which distro, kernel, and OFED revisions (unless you're using the distro's built-in IB stack) you're using on the target. The WinOF version you're using on the Windows side would be helpful info too. Can you relay these? Thanks, Chris > > Regards, > > Joe > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics, Inc. > email: landman at scalableinformatics.com > web  : http://scalableinformatics.com >       http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax  : +1 866 888 3112 > cell : +1 734 612 4615 > From landman at scalableinformatics.com Sun Aug 9 12:06:03 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 09 Aug 2009 15:06:03 -0400 Subject: ***SPAM*** Re: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs In-Reply-To: References: <4A7F06C5.7030203@scalableinformatics.com> Message-ID: <4A7F1E1B.5010300@scalableinformatics.com> Chris Worley wrote: > I'm worried more about the underlying IB stack and kernel on the > target side. It would be best to know exactly which distro, kernel, We see good performance with Centos/RedHat and Ubuntu. We build our own kernel (due to performance/stability issues we see under load with distro kernels). Ours is a 2.6.28.7. We are testing some 2.6.30.x kernels now as well. OFED 1.4 now, we used 1.3.x last year. > and OFED revisions (unless you're using the distro's built-in IB > stack) you're using on the target. The WinOF version you're using on > the Windows side would be helpful info too. Can you relay these? WinOF version ... whatever was default installed on the units. > > Thanks, > > Chris >> Regards, >> >> Joe >> >> -- >> Joseph Landman, Ph.D >> Founder and CEO >> Scalable Informatics, Inc. >> email: landman at scalableinformatics.com >> web : http://scalableinformatics.com >> http://scalableinformatics.com/jackrabbit >> phone: +1 734 786 8423 x121 >> fax : +1 866 888 3112 >> cell : +1 734 612 4615 >> -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From marcin.slusarz at gmail.com Sun Aug 9 12:54:05 2009 From: marcin.slusarz at gmail.com (Marcin Slusarz) Date: Sun, 9 Aug 2009 21:54:05 +0200 Subject: [ofa-general] [PATCH 10/14] infiniband: use printk_once In-Reply-To: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> Message-ID: <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com> Signed-off-by: Marcin Slusarz Cc: Roland Dreier Cc: Sean Hefty Cc: Hal Rosenstock Cc: general at lists.openfabrics.org --- drivers/infiniband/hw/cxgb3/iwch.c | 4 +--- drivers/infiniband/hw/mlx4/main.c | 6 +----- 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 26fc0a4..9cc99df 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -105,11 +105,9 @@ static void rnic_init(struct iwch_dev *rnicp) static void open_rnic_dev(struct t3cdev *tdev) { struct iwch_dev *rnicp; - static int vers_printed; PDBG("%s t3cdev %p\n", __func__, tdev); - if (!vers_printed++) - printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", + printk_once(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", DRV_VERSION); rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); if (!rnicp) { diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index ae3d759..0b2f77a 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = { static void *mlx4_ib_add(struct mlx4_dev *dev) { - static int mlx4_ib_version_printed; struct mlx4_ib_dev *ibdev; int num_ports = 0; int i; - if (!mlx4_ib_version_printed) { - printk(KERN_INFO "%s", mlx4_ib_version); - ++mlx4_ib_version_printed; - } + printk_once(KERN_INFO "%s", mlx4_ib_version); mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) num_ports++; -- 1.6.3.3 From nashwath at gmail.com Sun Aug 9 19:07:57 2009 From: nashwath at gmail.com (Ashwath Narasimhan) Date: Sun, 9 Aug 2009 22:07:57 -0400 Subject: [ofa-general] Setting the Credits. Message-ID: Hi Jason/All,Thank you for your response. Do you mean the link layer flow control (VL's)? or the end to end flow control credits of the Transport layer? How do I set the end to end flow control credits? I looked at the driver source code and the file ipath_qp.c interested me. Here they calculate the credits based on the difference between the head and tail pointers of the 'qp' receive queue pairs (refer - drivers/infiniband/hw/ipath). should i change the size of these queues? am I even looking at the right file? regards, Ashwath. On Thu, Aug 6, 2009 at 5:12 PM, Jason Gunthorpe < jgunthorpe at obsidianresearch.com> wrote: > On Wed, Aug 05, 2009 at 08:03:04PM -0400, Ashwath Narasimhan wrote: > > > The reason why I need such small rates is because I interface the > > Infiniband HCA to an FPGA via an Infiniband physical link. Imagine > > the FPGA as a simple repeater that simply forwards the infiniband > > signals to the Target HCA. The FPGA cannot handle such a high data > > rate and neither do I have as much memory as required to buffer it > > on the FPGA (I might drop packets if the buffer becomes full). Hence > > I wish to limit the rate to say 100Mbps instead of 2.5Gbps. > > The correct thing to do is manage the flow control credits you are > giving to the IB network so you don't loose packets. > > Jason > -- regards, Ashwath -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Sun Aug 9 22:00:31 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Aug 2009 22:00:31 -0700 Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once In-Reply-To: <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com> (Marcin Slusarz's message of "Sun, 9 Aug 2009 21:54:05 +0200") References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com> Message-ID: > drivers/infiniband/hw/cxgb3/iwch.c | 4 +--- > drivers/infiniband/hw/mlx4/main.c | 6 +----- > --- a/drivers/infiniband/hw/mlx4/main.c > +++ b/drivers/infiniband/hw/mlx4/main.c > @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = { > > static void *mlx4_ib_add(struct mlx4_dev *dev) > { > - static int mlx4_ib_version_printed; > struct mlx4_ib_dev *ibdev; > int num_ports = 0; > int i; > > - if (!mlx4_ib_version_printed) { > - printk(KERN_INFO "%s", mlx4_ib_version); > - ++mlx4_ib_version_printed; > - } > + printk_once(KERN_INFO "%s", mlx4_ib_version); > > mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) > num_ports++; Looks fine but there is near-identical code in drivers/infiniband/hw/mthca/mthca_main.c that you might as well convert too. Thanks, Roland From jackm at dev.mellanox.co.il Sun Aug 9 23:36:26 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 10 Aug 2009 09:36:26 +0300 Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once In-Reply-To: References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com> Message-ID: <200908100936.26963.jackm@dev.mellanox.co.il> I'm a bit nervous about this one. printk_once will print once ONLY if CONFIG_PRINTK is set in include/linux/autoconf.h (i.e., when the kernel is configured). Otherwise, it gets defined to printk -- and it will always print in this case. (see 2.6.30.xx kernel include file "include/linux/kernel.h", lines 235, 249, and 272). Do you think that distributions will ALWAYS have CONFIG_PRINTK defined? I would prefer to wait until printk_once printing only once is not config-dependent. -Jack On Monday 10 August 2009 08:00, Roland Dreier wrote: > > > drivers/infiniband/hw/cxgb3/iwch.c | 4 +--- > > drivers/infiniband/hw/mlx4/main.c | 6 +----- > > > --- a/drivers/infiniband/hw/mlx4/main.c > > +++ b/drivers/infiniband/hw/mlx4/main.c > > @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = { > > > > static void *mlx4_ib_add(struct mlx4_dev *dev) > > { > > - static int mlx4_ib_version_printed; > > struct mlx4_ib_dev *ibdev; > > int num_ports = 0; > > int i; > > > > - if (!mlx4_ib_version_printed) { > > - printk(KERN_INFO "%s", mlx4_ib_version); > > - ++mlx4_ib_version_printed; > > - } > > + printk_once(KERN_INFO "%s", mlx4_ib_version); > > > > mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) > > num_ports++; > > Looks fine but there is near-identical code in > drivers/infiniband/hw/mthca/mthca_main.c that you might as well convert > too. > > Thanks, > Roland > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eli at dev.mellanox.co.il Mon Aug 10 01:45:27 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 10 Aug 2009 11:45:27 +0300 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated In-Reply-To: References: Message-ID: <20090810084527.GA2446@mtls03> Looking at mlx4_write_mtt_chunk() I see that it calls mlx4_table_find() with a pointer to single dma_addr_t - dma_handle - while the dma addresses for the ICM memory is actually a list of different addresses covering possibly different sizes. I think mlx4_table_find() should be changed to support that, and then we can use calls to dma_sync_single_for_cpu()/dma_sync_single_for_device() with the correct dma addresses. Roland, what do you think? On Sat, Aug 08, 2009 at 07:49:22PM +0200, Bart Van Assche wrote: > Hello, > > Has anyone ever encountered a message like the one below ? This message was > generated while booting a 2.6.30.4 kernel with CONFIG_DMA_API_DEBUG=y and > before any out-of-tree kernel modules were loaded. > > ------------[ cut here ]------------ > WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0() > Hardware name: P5Q DELUXE > mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it > has not allocated [device address=0x0000000139482000] [size=4096 bytes] > Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel > snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801 > rtc_core hid_belkin mlx4_core( > +) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev > serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx > xor raid0 sd_mod crc_t10dif > ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic > ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal > processor thermal_sys hwmon > Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6 > Call Trace: > [] ? check_sync+0x47c/0x4b0 > [] warn_slowpath_common+0x78/0xd0 > [] warn_slowpath_fmt+0x3c/0x40 > [] ? _spin_lock_irqsave+0x49/0x60 > [] ? check_sync+0xab/0x4b0 > [] check_sync+0x47c/0x4b0 > [] ? mark_held_locks+0x6c/0x90 > [] debug_dma_sync_single_for_cpu+0x1d/0x20 > [] mlx4_write_mtt+0x159/0x1e0 [mlx4_core] > [] mlx4_create_eq+0x222/0x650 [mlx4_core] > [] ? trace_hardirqs_on+0xd/0x10 > [] mlx4_init_eq_table+0x1c5/0x4a0 [mlx4_core] > [] mlx4_setup_hca+0x98/0x550 [mlx4_core] > [] ? __mlx4_init_one+0x8d1/0x920 [mlx4_core] > [] __mlx4_init_one+0x371/0x920 [mlx4_core] > [] mlx4_init_one+0x22/0x44 [mlx4_core] > [] ? do_work_for_cpu+0x0/0x30 > [] local_pci_probe+0x12/0x20 > [] do_work_for_cpu+0x13/0x30 > [] kthread+0x56/0x90 > [] child_rip+0xa/0x20 > [] ? restore_args+0x0/0x30 > [] ? kthread+0x0/0x90 > [] ? child_rip+0x0/0x20 > ---[ end trace 4480af29bc755c6a ]--- > > Bart. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at lists.openfabrics.org Mon Aug 10 03:00:05 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 10 Aug 2009 03:00:05 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090810-0200 daily build status Message-ID: <20090810100005.E7592E61E29@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_add_one': /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2365: error: 'struct device' has no member named 'class' /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.c:2367: error: implicit declaration of function 'dev_set_name' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt/ib_srpt.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/srpt] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090810-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From bart.vanassche at gmail.com Mon Aug 10 03:40:48 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 10 Aug 2009 12:40:48 +0200 Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs In-Reply-To: References: Message-ID: On Sun, Aug 9, 2009 at 7:09 PM, Chris Worley wrote: > I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off > the CD.. never updated) and it's embedded IB stack (not the latest > OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info"). > > I'm running a W2008S (fully patched) initiator w/ > MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453. > > Using Mellanox QDR cards/switch. > > Writes over SRP, as measured from the initiator using IOMeter, get > proper performance (i.e. 1.2GB/s). > > Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s). > And while reading, IOMeter eventually hangs the system (Windows > becomes unresponsive to GUI interaction). In this state, I see iostat > reporting transfers at the same low read rate from the target... so > there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it > acts like it's a "skipping record" (sorry of you young folks don't > know what that is... but I can't think of another way to describe it) > and never moving on to the next benchmark, just endlessly repeating > the same I/O over and over again. If I unload then reload the mlx4_ib > driver on the target, then the Windows system quickly returns, but > IOMeter remains hung and needs killed. > The throughput of the SRP protocol strongly depends on the block size used for I/O. The results I obtained with IOmeter are: * For a block size of 32 KB: 396 MB/s for reading and 321 MB/s for writing. * For a block size of 1 MB: 1383 MB/s for reading and 1151 MB/s for writing. These results are about 90% of the throughput obtained with dd. Setup details: * Two Mellanox ConnectX DDR cards connected back to back, operating in PCIe 2.0 mode. * Target: vanilla 2.6.30.4 kernel + SCST patches + the two patches attached to http://bugzilla.kernel.org/show_bug.cgi?id=13757 + SCST r1030. * Initiator: openSUSE 11.0 (contains a patched 2.6.27.25 kernel) with openSUSE 11.0 OFED components + Linux version of IOmeter's dynamo + IOmeter GUI running in a virtual machine. * I/O-scheduler used by SRP initiator: noop. Bart. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Mon Aug 10 06:13:20 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 10 Aug 2009 09:13:20 -0400 Subject: [ofa-general] [PATCH] opensm/osm_sm_mad_ctrl.c: In sm_mad_ctrl_send_err_cb, set init failure on PKeyTable and QoS initialization failure Message-ID: <20090810131319.GA14915@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_sm_mad_ctrl.c b/opensm/opensm/osm_sm_mad_ctrl.c index 791c848..f0bc407 100644 --- a/opensm/opensm/osm_sm_mad_ctrl.c +++ b/opensm/opensm/osm_sm_mad_ctrl.c @@ -723,7 +723,10 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw) (p_smp->attr_id == IB_MAD_ATTR_PORT_INFO || p_smp->attr_id == IB_MAD_ATTR_MCAST_FWD_TBL || p_smp->attr_id == IB_MAD_ATTR_SWITCH_INFO || - p_smp->attr_id == IB_MAD_ATTR_LIN_FWD_TBL)) { + p_smp->attr_id == IB_MAD_ATTR_LIN_FWD_TBL || + p_smp->attr_id == IB_MAD_ATTR_P_KEY_TABLE || + p_smp->attr_id == IB_MAD_ATTR_SLVL_TABLE || + p_smp->attr_id == IB_MAD_ATTR_VL_ARBITRATION)) { OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3119: " "Set method failed for attribute 0x%X (%s)\n", cl_ntoh16(p_smp->attr_id), From hal.rosenstock at gmail.com Mon Aug 10 06:45:58 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Aug 2009 09:45:58 -0400 Subject: [ofa-general] [PATCHv4 0/10] RDMAoE support In-Reply-To: <20090805082751.GA5599@mtls03> References: <20090805082751.GA5599@mtls03> Message-ID: On Wed, Aug 5, 2009 at 4:27 AM, Eli Cohen wrote: > RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using > Ethernet frames, enabling the deployment of IB semantics on lossless > Ethernet > fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned > Ethertype, a GRH, unmodified IB transport headers and payload. IB subnet > management and SA services are not required for RDMAoE operation; Ethernet > management practices are used instead. RDMAoE encodes IP addresses into its > GIDs and resolves MAC addresses using the host IP stack. For multicast > GIDs, > standard IP to MAC mappings apply. > > To support RDMAoE, a new transport protocol was added to the IB core. An > RDMA > device can have ports with different transports, which are identified by a > port > transport attribute. The RDMA Verbs API is syntactically unmodified. When > referring to RDMAoE ports, Address handles are required to contain GIDs > while > LID fields are ignored. The Ethernet L2 information is subsequently > obtained by > the vendor-specific driver (both in kernel- and user-space) while modifying > QPs > to RTR and creating address handles. As there is no SA in RDMAoE, the CMA > code > is modified to fill the necessary path record attributes locally before > sending > CM packets. Similarly, the CMA provides to the user the required address > handle > attributes when processing SIDR requests and joining multicast groups. > > In this patch set, an RDMAoE port is currently assigned a single GID, > encoding > the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE > code > temporarily uses IPv6 link-local addresses as GIDs instead of the IP > address > provided by the user, thereby supporting any IP address. In addition, > multicast > packets currently use the broadcast MAC. > > To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib > drivers must be loaded, and the netdevice for the corresponding RDMAoE port > must be running. Individual ports of a multi port HCA can be independently > configured as Ethernet (with support for RDMAoE) or IB, as is already the > case. How is port configuration (RDMAoE v. IB) accomplished ? Is it prior to boot time or dynamic ? -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Mon Aug 10 06:56:43 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Aug 2009 09:56:43 -0400 Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding In-Reply-To: <20090805082929.GG5599@mtls03> References: <20090805082929.GG5599@mtls03> Message-ID: On Wed, Aug 5, 2009 at 4:29 AM, Eli Cohen wrote: > Add support for RDMAoE device binding and IP --> GID resolution. Path > resolving > and multicast joining are implemented within cma.c by filling the responses > and > pushing the callbacks to the cma work queue. IP->GID resolution always > yield > IPv6 link local addresses - remote GIDs are derived from the destination > MAC > address of the remote port. Multicast GIDs are always mapped to broadcast > MAC > (all FFs). Some helper functions are added to ib_addr.h. > > Signed-off-by: Eli Cohen > --- > drivers/infiniband/core/cma.c | 150 > ++++++++++++++++++++++++++++++++++++++- > drivers/infiniband/core/ucma.c | 25 +++++-- > include/rdma/ib_addr.h | 87 +++++++++++++++++++++++ > 3 files changed, 251 insertions(+), 11 deletions(-) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 866ff7f..8f5675b 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -58,6 +58,7 @@ MODULE_LICENSE("Dual BSD/GPL"); > #define CMA_CM_RESPONSE_TIMEOUT 20 > #define CMA_MAX_CM_RETRIES 15 > #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) > +#define RDMAOE_PACKET_LIFETIME 18 > > static void cma_add_one(struct ib_device *device); > static void cma_remove_one(struct ib_device *device); > @@ -174,6 +175,12 @@ struct cma_ndev_work { > struct rdma_cm_event event; > }; > > +struct rdmaoe_mcast_work { > + struct work_struct work; > + struct rdma_id_private *id; > + struct cma_multicast *mc; > +}; > + > union cma_ip_addr { > struct in6_addr ip6; > struct { > @@ -348,6 +355,9 @@ static int cma_acquire_dev(struct rdma_id_private > *id_priv) > case RDMA_TRANSPORT_IWARP: > iw_addr_get_sgid(dev_addr, &gid); > break; > + case RDMA_TRANSPORT_RDMAOE: > + rdmaoe_addr_get_sgid(dev_addr, &gid); > + break; > default: > return -ENODEV; > } > @@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private > *id_priv, > { > struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; > int ret; > + u16 pkey; > + > + if (rdma_port_get_transport(id_priv->id.device, > id_priv->id.port_num) == > + RDMA_TRANSPORT_IB) > + pkey = ib_addr_get_pkey(dev_addr); > + else > + pkey = 0xffff; > > ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, > - ib_addr_get_pkey(dev_addr), > - &qp_attr->pkey_index); > + pkey, &qp_attr->pkey_index); > if (ret) > return ret; > > @@ -609,6 +625,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct > ib_qp_attr *qp_attr, > id_priv = container_of(id, struct rdma_id_private, id); > switch (rdma_port_get_transport(id_priv->id.device, > id_priv->id.port_num)) { > case RDMA_TRANSPORT_IB: > + case RDMA_TRANSPORT_RDMAOE: > if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps)) > ret = cma_ib_init_qp_attr(id_priv, qp_attr, > qp_attr_mask); > else > @@ -836,7 +853,9 @@ static void cma_leave_mc_groups(struct rdma_id_private > *id_priv) > mc = container_of(id_priv->mc_list.next, > struct cma_multicast, list); > list_del(&mc->list); > - ib_sa_free_multicast(mc->multicast.ib); > + if (rdma_port_get_transport(id_priv->cma_dev->device, > id_priv->id.port_num) == > + RDMA_TRANSPORT_IB) > + ib_sa_free_multicast(mc->multicast.ib); > kref_put(&mc->mcref, release_mc); > } > } > @@ -855,6 +874,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) > mutex_unlock(&lock); > switch (rdma_port_get_transport(id_priv->id.device, > id_priv->id.port_num)) { > case RDMA_TRANSPORT_IB: > + case RDMA_TRANSPORT_RDMAOE: > if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) > ib_destroy_cm_id(id_priv->cm_id.ib); > break; > @@ -1512,6 +1532,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog) > if (id->device) { > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: > + case RDMA_TRANSPORT_RDMAOE: > ret = cma_ib_listen(id_priv); > if (ret) > goto err; > @@ -1727,6 +1748,65 @@ static int cma_resolve_iw_route(struct > rdma_id_private *id_priv, int timeout_ms) > return 0; > } > > +static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv) > +{ > + struct rdma_route *route = &id_priv->id.route; > + struct rdma_addr *addr = &route->addr; > + struct cma_work *work; > + int ret; > + struct sockaddr_in *src_addr = (struct sockaddr_in > *)&route->addr.src_addr; > + struct sockaddr_in *dst_addr = (struct sockaddr_in > *)&route->addr.dst_addr; > + > + if (src_addr->sin_family != dst_addr->sin_family) > + return -EINVAL; > + > + work = kzalloc(sizeof *work, GFP_KERNEL); > + if (!work) > + return -ENOMEM; > + > + work->id = id_priv; > + INIT_WORK(&work->work, cma_work_handler); > + > + route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL); > + if (!route->path_rec) { > + ret = -ENOMEM; > + goto err; > + } > + > + route->num_paths = 1; > + > + rdmaoe_mac_to_ll(&route->path_rec->sgid, > addr->dev_addr.src_dev_addr); > + rdmaoe_mac_to_ll(&route->path_rec->dgid, > addr->dev_addr.dst_dev_addr); > + > + route->path_rec->hop_limit = 2; Does HopLimit need to be 2 ? Isn't this all subnet local ? > > + route->path_rec->reversible = 1; > + route->path_rec->pkey = cpu_to_be16(0xffff); > + route->path_rec->mtu_selector = 2; > + route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu); > + route->path_rec->rate_selector = 2; > + route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev); > + route->path_rec->packet_life_time_selector = 2; > + route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME; > + > + work->old_state = CMA_ROUTE_QUERY; > + work->new_state = CMA_ROUTE_RESOLVED; > + if (!route->path_rec->mtu || !route->path_rec->rate) { > + work->event.event = RDMA_CM_EVENT_ROUTE_ERROR; > + work->event.status = -1; > + } else { > + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; > + work->event.status = 0; > + } > + > + queue_work(cma_wq, &work->work); > + > + return 0; > + > +err: > + kfree(work); > + return ret; > +} > + > int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) > { > struct rdma_id_private *id_priv; > @@ -1744,6 +1824,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int > timeout_ms) > case RDMA_TRANSPORT_IWARP: > ret = cma_resolve_iw_route(id_priv, timeout_ms); > break; > + case RDMA_TRANSPORT_RDMAOE: > + ret = cma_resolve_rdmaoe_route(id_priv); > + break; > default: > ret = -ENOSYS; > break; > @@ -2419,6 +2502,7 @@ int rdma_connect(struct rdma_cm_id *id, struct > rdma_conn_param *conn_param) > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: > + case RDMA_TRANSPORT_RDMAOE: > if (cma_is_ud_ps(id->ps)) > ret = cma_resolve_ib_udp(id_priv, conn_param); > else > @@ -2532,6 +2616,7 @@ int rdma_accept(struct rdma_cm_id *id, struct > rdma_conn_param *conn_param) > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: > + case RDMA_TRANSPORT_RDMAOE: > if (cma_is_ud_ps(id->ps)) > ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS, > conn_param->private_data, > @@ -2593,6 +2678,7 @@ int rdma_reject(struct rdma_cm_id *id, const void > *private_data, > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: > + case RDMA_TRANSPORT_RDMAOE: > if (cma_is_ud_ps(id->ps)) > ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, > private_data, > private_data_len); > @@ -2624,6 +2710,7 @@ int rdma_disconnect(struct rdma_cm_id *id) > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: > + case RDMA_TRANSPORT_RDMAOE: > ret = cma_modify_qp_err(id_priv); > if (ret) > goto out; > @@ -2752,6 +2839,55 @@ static int cma_join_ib_multicast(struct > rdma_id_private *id_priv, > return 0; > } > > + > +static void rdmaoe_mcast_work_handler(struct work_struct *work) > +{ > + struct rdmaoe_mcast_work *mw = container_of(work, struct > rdmaoe_mcast_work, work); > + struct cma_multicast *mc = mw->mc; > + struct ib_sa_multicast *m = mc->multicast.ib; > + > + mc->multicast.ib->context = mc; > + cma_ib_mc_handler(0, m); > + kfree(m); > + kfree(mw); > +} > + > +static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv, > + struct cma_multicast *mc) > +{ > + struct rdmaoe_mcast_work *work; > + struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; > + > + if (cma_zero_addr((struct sockaddr *)&mc->addr)) > + return -EINVAL; > + > + work = kzalloc(sizeof *work, GFP_KERNEL); > + if (!work) > + return -ENOMEM; > + > + mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), > GFP_KERNEL); > + if (!mc->multicast.ib) { > + kfree(work); > + return -ENOMEM; > + } > + > + cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr, > &mc->multicast.ib->rec.mgid); > + mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff); > + if (id_priv->id.ps == RDMA_PS_UDP) > + mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY); > + mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev); > + mc->multicast.ib->rec.hop_limit = 1; Similar to the unicast comment above, is HopLimit 1 needed for multicast ? -- Hal > > + mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu); > + rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid); > + work->id = id_priv; > + work->mc = mc; > + INIT_WORK(&work->work, rdmaoe_mcast_work_handler); > + > + queue_work(cma_wq, &work->work); > + > + return 0; > +} > + > int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, > void *context) > { > @@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct > sockaddr *addr, > case RDMA_TRANSPORT_IB: > ret = cma_join_ib_multicast(id_priv, mc); > break; > + case RDMA_TRANSPORT_RDMAOE: > + ret = cma_rdmaoe_join_multicast(id_priv, mc); > + break; > default: > ret = -ENOSYS; > break; > @@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct > sockaddr *addr, > spin_unlock_irq(&id_priv->lock); > kfree(mc); > } > + > return ret; > } > EXPORT_SYMBOL(rdma_join_multicast); > @@ -2813,7 +2953,9 @@ void rdma_leave_multicast(struct rdma_cm_id *id, > struct sockaddr *addr) > ib_detach_mcast(id->qp, > &mc->multicast.ib->rec.mgid, > mc->multicast.ib->rec.mlid); > - ib_sa_free_multicast(mc->multicast.ib); > + if > (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num) == > + RDMA_TRANSPORT_IB) > + ib_sa_free_multicast(mc->multicast.ib); > kref_put(&mc->mcref, release_mc); > return; > } > diff --git a/drivers/infiniband/core/ucma.c > b/drivers/infiniband/core/ucma.c > index 24d9510..c7c9e92 100644 > --- a/drivers/infiniband/core/ucma.c > +++ b/drivers/infiniband/core/ucma.c > @@ -553,7 +553,8 @@ static ssize_t ucma_resolve_route(struct ucma_file > *file, > } > > static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, > - struct rdma_route *route) > + struct rdma_route *route, > + enum rdma_transport_type tt) > { > struct rdma_dev_addr *dev_addr; > > @@ -561,10 +562,17 @@ static void ucma_copy_ib_route(struct > rdma_ucm_query_route_resp *resp, > switch (route->num_paths) { > case 0: > dev_addr = &route->addr.dev_addr; > - ib_addr_get_dgid(dev_addr, > - (union ib_gid *) &resp->ib_route[0].dgid); > - ib_addr_get_sgid(dev_addr, > - (union ib_gid *) &resp->ib_route[0].sgid); > + if (tt == RDMA_TRANSPORT_IB) { > + ib_addr_get_dgid(dev_addr, > + (union ib_gid *) > &resp->ib_route[0].dgid); > + ib_addr_get_sgid(dev_addr, > + (union ib_gid *) > &resp->ib_route[0].sgid); > + } else { > + rdmaoe_mac_to_ll((union ib_gid *) > &resp->ib_route[0].dgid, > + dev_addr->dst_dev_addr); > + rdmaoe_addr_get_sgid(dev_addr, > + (union ib_gid *) > &resp->ib_route[0].sgid); > + } > resp->ib_route[0].pkey = > cpu_to_be16(ib_addr_get_pkey(dev_addr)); > break; > case 2: > @@ -589,6 +597,7 @@ static ssize_t ucma_query_route(struct ucma_file *file, > struct ucma_context *ctx; > struct sockaddr *addr; > int ret = 0; > + enum rdma_transport_type tt; > > if (out_len < sizeof(resp)) > return -ENOSPC; > @@ -614,9 +623,11 @@ static ssize_t ucma_query_route(struct ucma_file > *file, > > resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid; > resp.port_num = ctx->cm_id->port_num; > - switch (rdma_port_get_transport(ctx->cm_id->device, > ctx->cm_id->port_num)) { > + tt = rdma_port_get_transport(ctx->cm_id->device, > ctx->cm_id->port_num); > + switch (tt) { > case RDMA_TRANSPORT_IB: > - ucma_copy_ib_route(&resp, &ctx->cm_id->route); > + case RDMA_TRANSPORT_RDMAOE: > + ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt); > break; > default: > break; > diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h > index 483057b..66a848e 100644 > --- a/include/rdma/ib_addr.h > +++ b/include/rdma/ib_addr.h > @@ -39,6 +39,8 @@ > #include > #include > #include > +#include > +#include > > struct rdma_addr_client { > atomic_t refcount; > @@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct > rdma_dev_addr *dev_addr, > memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid); > } > > +static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac) > +{ > + memset(gid->raw, 0, 16); > + *((u32 *)gid->raw) = cpu_to_be32(0xfe800000); > + gid->raw[12] = 0xfe; > + gid->raw[11] = 0xff; > + memcpy(gid->raw + 13, mac + 3, 3); > + memcpy(gid->raw + 8, mac, 3); > + gid->raw[8] ^= 2; > +} > + > +static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr, > + union ib_gid *gid) > +{ > + rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr); > +} > + > +static inline enum ib_mtu rdmaoe_get_mtu(int mtu) > +{ > + /* > + * reduce IB headers from effective RDMAoE MTU. 28 stands for > + * atomic header which is the biggest possible header after BTH > + */ > + mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28; > + > + if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096)) > + return IB_MTU_4096; > + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048)) > + return IB_MTU_2048; > + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024)) > + return IB_MTU_1024; > + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512)) > + return IB_MTU_512; > + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256)) > + return IB_MTU_256; > + else > + return 0; > +} > + > +static inline int rdmaoe_get_rate(struct net_device *dev) > +{ > + struct ethtool_cmd cmd; > + > + if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings || > + dev->ethtool_ops->get_settings(dev, &cmd)) > + return IB_RATE_PORT_CURRENT; > + > + if (cmd.speed >= 40000) > + return IB_RATE_40_GBPS; > + else if (cmd.speed >= 30000) > + return IB_RATE_30_GBPS; > + else if (cmd.speed >= 20000) > + return IB_RATE_20_GBPS; > + else if (cmd.speed >= 10000) > + return IB_RATE_10_GBPS; > + else > + return IB_RATE_PORT_CURRENT; > +} > + > +static inline int rdma_link_local_addr(struct in6_addr *addr) > +{ > + if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) && > + addr->s6_addr32[1] == 0) > + return 1; > + else > + return 0; > +} > + > +static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac) > +{ > + memcpy(mac, &addr->s6_addr[8], 3); > + memcpy(mac + 3, &addr->s6_addr[13], 3); > + mac[0] ^= 2; > +} > + > +static inline int rdma_is_multicast_addr(struct in6_addr *addr) > +{ > + return addr->s6_addr[0] == 0xff ? 1 : 0; > +} > + > +static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac) > +{ > + memset(mac, 0xff, 6); > +} > + > #endif /* IB_ADDR_H */ > -- > 1.6.3.3 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Mon Aug 10 07:01:54 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Aug 2009 10:01:54 -0400 Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE ports In-Reply-To: <20090807032901.GB20589@mtls03> References: <20090805082910.GE5599@mtls03> <376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com> <20090807032901.GB20589@mtls03> Message-ID: On Thu, Aug 6, 2009 at 11:29 PM, Eli Cohen wrote: > On Thu, Aug 06, 2009 at 11:05:47AM -0700, Sean Hefty wrote: > > > > Is there a need to expose QP1 to user space? The CM is in the kernel, > and > > there's not an SA. > > > > Good point. There seems to be no reason to expose it. Might there be some GS service to expose ? Vendor MADs perhaps ? If not, then not exposing QP1 should be OK. -- Hal > Will fix. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gmpc at sanger.ac.uk Mon Aug 10 07:30:27 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Mon, 10 Aug 2009 15:30:27 +0100 Subject: [ofa-general] ofed kernel config.mk / BACKPORT_INCLUDES Message-ID: <4A802F03.2000507@sanger.ac.uk> Hi all, I am trying to build lustre 1.8.1 against OFED 1.4.2 and have uncovered a couple of bugs regarding how BACKPORT_INCLUDES is handled in the ofa-kernel config.mk file. The ofed_patch.sh script in the ofa_kernel package is incorrectly escaped, and results in a mangled BACKPORT_INCLUDES path. The lustre ./configure script is also broken, and prepends and extra "-I" infront of the BACKPORT_INCLUDES path. Patches for both are attached. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed_patch.diff Type: text/x-patch Size: 563 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.patch Type: text/x-patch Size: 480 bytes Desc: not available URL: From rgmiller at ornl.gov Mon Aug 10 07:37:25 2009 From: rgmiller at ornl.gov (Miller, Ross G.) Date: Mon, 10 Aug 2009 10:37:25 -0400 Subject: [ofa-general] Baseboard Management API Message-ID: I read the posts back in March where the bm_call_via API was discussed and accepted. I'm trying to write a simple utility that uses that function to query several IB switches, but all I get back is an error code. Has anyone else used this function, and is there any sample code available that I could reference? Thanks very much, Ross G. Miller Systems Integration Programmer National Center for Computational Sciences Oak Ridge National Laboratory From hal.rosenstock at gmail.com Mon Aug 10 07:49:01 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Aug 2009 10:49:01 -0400 Subject: [ofa-general] Baseboard Management API In-Reply-To: References: Message-ID: On Mon, Aug 10, 2009 at 10:37 AM, Miller, Ross G. wrote: > I read the posts back in March where the bm_call_via API was discussed and > accepted. I'm trying to write a simple utility that uses that function to > query several IB switches, Do those switches have BMAs ? > but all I get back is an error code. What error code ? > Has anyone else used this function, and is there any sample code > available that I could reference? AFAIK no code using this has been posted but you can look at ibping or vendstat which use vendor MADs but should be similar. -- Hal > > Thanks very much, > > Ross G. Miller > Systems Integration Programmer > National Center for Computational Sciences > Oak Ridge National Laboratory > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Mon Aug 10 07:56:54 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 10 Aug 2009 17:56:54 +0300 Subject: [ofa-general] [PATCHv4 0/10] RDMAoE support In-Reply-To: References: <20090805082751.GA5599@mtls03> Message-ID: <20090810145654.GA8688@mtls03> On Mon, Aug 10, 2009 at 09:45:58AM -0400, Hal Rosenstock wrote: > > How is port configuration (RDMAoE v. IB) accomplished ? Is it prior to boot > time or dynamic ? > mlx4 allows changing port designation dynamically by writing to the sysfs. From hal.rosenstock at gmail.com Mon Aug 10 08:03:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Aug 2009 11:03:36 -0400 Subject: [ofa-general] [PATCHv4 0/10] RDMAoE support In-Reply-To: <20090810145654.GA8688@mtls03> References: <20090805082751.GA5599@mtls03> <20090810145654.GA8688@mtls03> Message-ID: On Mon, Aug 10, 2009 at 10:56 AM, Eli Cohen wrote: > On Mon, Aug 10, 2009 at 09:45:58AM -0400, Hal Rosenstock wrote: > > > > How is port configuration (RDMAoE v. IB) accomplished ? Is it prior to > boot > > time or dynamic ? > > > > mlx4 allows changing port designation dynamically by writing to the > sysfs. > Nice feature :-) I think that currently some of the kernel components (in addition to user space handling) will need some change to support this. I don't think that was included in your patch series (unless I missed it which is entirely possible). -------------- next part -------------- An HTML attachment was scrubbed... URL: From brian at sun.com Mon Aug 10 08:17:32 2009 From: brian at sun.com (Brian J. Murrell) Date: Mon, 10 Aug 2009 11:17:32 -0400 Subject: [ofa-general] ofed kernel config.mk / BACKPORT_INCLUDES In-Reply-To: <4A802F03.2000507@sanger.ac.uk> References: <4A802F03.2000507@sanger.ac.uk> Message-ID: <1249917452.7132.192.camel@pc.interlinx.bc.ca> [ Any further responses to this thread should drop the general at lists.openfabrics.org list. I only kept it here so that anyone on that list that wishes to follow this thread knows that it will not be continued on the openfabrics list ] On Mon, 2009-08-10 at 15:30 +0100, Guy Coates wrote: > Hi all, Hi Guy, > The lustre ./configure script is also broken, and prepends and extra > "-I" > infront of the BACKPORT_INCLUDES path. I don't think so, but xtrace output from configure would verify. > --- lustre-1.8.1/configure 2009-07-24 23:28:51.000000000 +0100 > +++ configure 2009-08-10 15:08:22.316488430 +0100 > @@ -5595,7 +5595,7 @@ > fi > if test -n "$BACKPORT_INCLUDES"; then > OFED_BACKPORT_PATH=`echo $BACKPORT_INCLUDES | > sed "s#.*/src/ofa_kernel/#$O2IBPATH/#"` > - EXTRA_LNET_INCLUDE="-I$OFED_BACKPORT_PATH > $EXTRA_LNET_INCLUDE" > + EXTRA_LNET_INCLUDE="$OFED_BACKPORT_PATH > $EXTRA_LNET_INCLUDE" > echo "$as_me:$LINENO: result: yes" >&5 > echo "${ECHO_T}yes" >&6 > else Notice that it's "-I$OFED_BACKPORT_PATH" that we are adding to $EXTRA_LNET_INCLUDE, not "-I$BACKPORT_INCLUDES". Further notice what $OFED_BACKPORT_PATH actually is: OFED_BACKPORT_PATH=`echo $BACKPORT_INCLUDES | sed "s#.*/src/ofa_kernel/#$O2IBPATH/#"` Your patch failed to include enough context, but notice that configure sources config.mk: . $O2IBPATH/config.mk and then uses the $BACKPORT_INCLUDES to derive an $OFED_BACKPORT_PATH with a sed expression: sed "s#.*/src/ofa_kernel/#$O2IBPATH/#" Which would turn an example $BACKPORT_INCLUDES of "-I/usr/src/ofa_kernel/kernel_addons/backport/2.6.18-EL5.3/include/" into "foobar/kernel_addons/backport/2.6.18-EL5.3/include/" assuming $O2IBPATH="foobar". So as you can see, when adding $OFED_BACKPORT_PATH to gcc as an include path, you need to prefix it with "-I". Please do let me know if your experience is any different from my explanation. Please include some xtrace output if so. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From gmpc at sanger.ac.uk Mon Aug 10 08:35:04 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Mon, 10 Aug 2009 16:35:04 +0100 Subject: [ofa-general] ofed kernel config.mk / BACKPORT_INCLUDES In-Reply-To: <1249917452.7132.192.camel@pc.interlinx.bc.ca> References: <4A802F03.2000507@sanger.ac.uk> <1249917452.7132.192.camel@pc.interlinx.bc.ca> Message-ID: <4A803E28.7010504@sanger.ac.uk> Hi Brian; A fresh build of the ofa-kernel package gives the following in the build dir: cat config.mk BACKPORT_INCLUDES=-I${CWD}/kernel_addons/backport/2.6.22/include/ That is obviously wrong. If I run the lustre configure, I get: ./configure --with-o2ib=/usr/src/modules/ofa-kernel --with-linux=/scratch/linux-2.6.22.19 checking whether to enable OpenIB gen2 support... no configure: error: can't compile with OpenIB gen2 headers under /usr/src/modules/ofa-kernel config.log has the following set for LNET_INCLUDES: EXTRA_LNET_INCLUDE='-I-I/kernel_addons/backport/2.6.22/include/ -I/usr/src/modules/ofa-kernel/include' which results in: configure:6885: cp conftest.c build && make -d modules CC=gcc -f /tmp/lustre-1.8.1/build/Makefile LUSTRE_LINUX_CONFIG=/scratch/linux-2.6.22.19/.config LINUXINCLUDE=-I-I/usr/src/modules/ker nel_addons/backport/2.6.22/include/ -I/usr/src/modules/ofa-kernel/include -I/scratch/linux-2.6.22.19/include -I/scratch/linux-2.6.22.19/include -I/scratch/linux-2.6.22.19/include2 -include include/linux/autoconf.h -o tmp_include_depends -o scripts -o include/config/MARKER -C /scratch/linux-2.6.22.19 EXTRA_CFLAGS=-Werror-implicit-function-declaration -g -I/tmp/lustre-1.8.1/lne t/include -I/tmp/lustre-1.8.1/lnet/include -I/tmp/lustre-1.8.1/lustre/include -I/usr/src/modules/ofa-kernel/include M=/tmp/lustre-1.8.1/build In file included from /usr/src/modules/ofa-kernel/include/rdma/ib_addr.h:41, from /usr/src/modules/ofa-kernel/include/rdma/rdma_cm.h:39, from /tmp/lustre-1.8.1/build/conftest.c:36: /usr/src/modules/ofa-kernel/include/rdma/ib_verbs.h:1724: warning: 'struct dma_attrs' declared inside parameter list /usr/src/modules/ofa-kernel/include/rdma/ib_verbs.h:1724: warning: its scope is only this definition or declaration, which is probably not what you want If I fix config.mk so that the correct path is present: cat config.mk BACKPORT_INCLUDES=-I/usr/src/modules/kernel_addons/backport/2.6.22/include/ configure still fails with: checking whether to enable OpenIB gen2 support... no configure: error: can't compile with OpenIB gen2 headers under /usr/src/modules/ofa-kernel config.log has the following incorrect EXTRA_LNET_INCLUDE: EXTRA_LNET_INCLUDE='-I-I/usr/src/modules/kernel_addons/backport/2.6.22/include/ -I/usr/src/modules/ofa-kernel/include' With the two patches previously sent, everything builds (modulo a separate bug in the OFED 2.6.22 backport includes). Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From rgmiller at ornl.gov Mon Aug 10 08:44:08 2009 From: rgmiller at ornl.gov (Miller, Ross G.) Date: Mon, 10 Aug 2009 11:44:08 -0400 Subject: [ofa-general] Baseboard Management API In-Reply-To: Message-ID: Supposedly, the switches have BMA's, but I guess I'd better go check with the sysadmins and find out for sure. Regarding the error code: If I set the method to IB_MAD_METHOD_GET and the attrid to IB_BM_ATTR_BKEYINFO, then I get an error out of mad_rpc saying the status field was 0xC. If I'm reading the architecture spec right, 0xC means that method/attrid combination isn't supported. If I try IB_MAD_SEND and IB_BM_ATTR_GET_MODULE_STATUS, I receive no errors back, but I also receive no data. FWIW: The code I've written is based loosely on vendstat.c, but for simplicity I've stripped it down quite a bit and just hard-coded the parameters (method, attrid, LID, etc...) I'm trying to write a simple utility that will query the status of the redundant power supplies on the switches so our admins don't have to physically check each morning looking for failures. Byte 5 of IB_BM_ATTR_GET_MODULE_STATUS should tell me exactly what I need, I think. The only reason for trying the BKEYINFO was just to see if I could get anything to work at all. Thanks, Ross G. Miller Systems Integration Programmer National Center for Computational Sciences Oak Ridge National Laboratory On 8/10/09 10:49 AM, "Hal Rosenstock" wrote: On Mon, Aug 10, 2009 at 10:37 AM, Miller, Ross G. wrote: I read the posts back in March where the bm_call_via API was discussed and accepted. I'm trying to write a simple utility that uses that function to query several IB switches, Do those switches have BMAs ? but all I get back is an error code. What error code ? Has anyone else used this function, and is there any sample code available that I could reference? AFAIK no code using this has been posted but you can look at ibping or vendstat which use vendor MADs but should be similar. -- Hal Thanks very much, Ross G. Miller Systems Integration Programmer National Center for Computational Sciences Oak Ridge National Laboratory _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From marcin.slusarz at gmail.com Mon Aug 10 09:07:25 2009 From: marcin.slusarz at gmail.com (Marcin Slusarz) Date: Mon, 10 Aug 2009 18:07:25 +0200 Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once In-Reply-To: References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com> Message-ID: <4A8045BD.8010803@gmail.com> Roland Dreier wrote: > > drivers/infiniband/hw/cxgb3/iwch.c | 4 +--- > > drivers/infiniband/hw/mlx4/main.c | 6 +----- > > > --- a/drivers/infiniband/hw/mlx4/main.c > > +++ b/drivers/infiniband/hw/mlx4/main.c > > @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = { > > > > static void *mlx4_ib_add(struct mlx4_dev *dev) > > { > > - static int mlx4_ib_version_printed; > > struct mlx4_ib_dev *ibdev; > > int num_ports = 0; > > int i; > > > > - if (!mlx4_ib_version_printed) { > > - printk(KERN_INFO "%s", mlx4_ib_version); > > - ++mlx4_ib_version_printed; > > - } > > + printk_once(KERN_INFO "%s", mlx4_ib_version); > > > > mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) > > num_ports++; > > Looks fine but there is near-identical code in > drivers/infiniband/hw/mthca/mthca_main.c that you might as well convert > too. Thanks for a hint. Updated patch below. --- From: Marcin Slusarz Date: Mon, 10 Aug 2009 18:01:49 +0200 Subject: [PATCH 10/14 v2] infiniband: use printk_once Signed-off-by: Marcin Slusarz Cc: Roland Dreier Cc: Sean Hefty Cc: Hal Rosenstock Cc: general at lists.openfabrics.org --- drivers/infiniband/hw/cxgb3/iwch.c | 4 +--- drivers/infiniband/hw/mlx4/main.c | 6 +----- drivers/infiniband/hw/mthca/mthca_main.c | 6 +----- 3 files changed, 3 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 26fc0a4..9cc99df 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -105,11 +105,9 @@ static void rnic_init(struct iwch_dev *rnicp) static void open_rnic_dev(struct t3cdev *tdev) { struct iwch_dev *rnicp; - static int vers_printed; PDBG("%s t3cdev %p\n", __func__, tdev); - if (!vers_printed++) - printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", + printk_once(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", DRV_VERSION); rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); if (!rnicp) { diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index ae3d759..0b2f77a 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -540,15 +540,11 @@ static struct device_attribute *mlx4_class_attributes[] = { static void *mlx4_ib_add(struct mlx4_dev *dev) { - static int mlx4_ib_version_printed; struct mlx4_ib_dev *ibdev; int num_ports = 0; int i; - if (!mlx4_ib_version_printed) { - printk(KERN_INFO "%s", mlx4_ib_version); - ++mlx4_ib_version_printed; - } + printk_once(KERN_INFO "%s", mlx4_ib_version); mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) num_ports++; diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 13da9f1..2e4e043 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -1215,15 +1215,11 @@ int __mthca_restart_one(struct pci_dev *pdev) static int __devinit mthca_init_one(struct pci_dev *pdev, const struct pci_device_id *id) { - static int mthca_version_printed = 0; int ret; mutex_lock(&mthca_device_mutex); - if (!mthca_version_printed) { - printk(KERN_INFO "%s", mthca_version); - ++mthca_version_printed; - } + printk_once(KERN_INFO "%s", mthca_version); if (id->driver_data >= ARRAY_SIZE(mthca_hca_table)) { printk(KERN_ERR PFX "%s has invalid driver data %lx\n", -- 1.6.3.3 From nashwath at gmail.com Mon Aug 10 09:11:22 2009 From: nashwath at gmail.com (Ashwath Narasimhan) Date: Mon, 10 Aug 2009 12:11:22 -0400 Subject: [ofa-general] Manipulating Credits in Infiniband Message-ID: Hi, I looked into the infiniband driver files. As I understand, in order to limit the data rate we manipulate the credits on either ends. Since the number of credits available depends on the receiver's work receive queue size, I decided to limit the queue size to say 5 instead of 8192 (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my higher layer protocol is ipoib). I just want to confirm if I am doing the right thing? -- regards, Ashwath -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Mon Aug 10 09:12:45 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 10 Aug 2009 12:12:45 -0400 Subject: [ofa-general] sun x4100 with IB In-Reply-To: References: <1249684671.13945.68.camel@rockymtn.cumminsconsultants.com> Message-ID: The cards are MT23108. ibdiagnet -lw 4x -ls 2.5 shows no fabric errors also confirmed on my silverstorm switch. The rack contains X4100 and X4100 M2 servers. The M2 servers do not exhibit this behavior, they're showing 750MB/sec local loopback on an IMB pingpong run. Errata 56CLK is enabled in the bios and mostly all other PCI-X or HT Bridge settings are set at they're defaults (ie auto) I fiddled with a few of the settings, but they neither made it worse or better. So i'm really not sure what option got changed when i reset the bios to optimal defaults, but i wish i did... On Sat, Aug 8, 2009 at 10:22 AM, Michael Di Domenico wrote: > Yes, its an infinihost III, i believe its MT23208, but dont quote me > on that, i'm not at the machine currently > > Is there something specific you want to see in lspci -vvv, i can't > easily cut and paste from the machine > > On Fri, Aug 7, 2009 at 6:37 PM, Robert Cummins wrote: >> Can you send the output from lspci -vvv? What card are you using? Is >> it an Infinihost III SDR card? What does ibdiagnet -lw 4x -ls 5 return? >> >> On Fri, 2009-08-07 at 17:37 -0400, Michael Di Domenico wrote: >>> I have several Sun x4100 with Infiniband servers which appear to be >>> running at 400MB/sec instead of 800MB/sec. It's a freshly reformatted >>> cluster converting from solaris to linux. We also reset the bios >>> settings with "load optimal defaults". Does anyone know which bios >>> setting I changed to dump the BW? >>> >>> x4100 >>> mellanox ib >>> ofed-1.4.1-rc6 w/ openmpi >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > From brian at sun.com Mon Aug 10 09:28:25 2009 From: brian at sun.com (Brian J. Murrell) Date: Mon, 10 Aug 2009 12:28:25 -0400 Subject: [ofa-general] ofed kernel config.mk / BACKPORT_INCLUDES In-Reply-To: <4A803E28.7010504@sanger.ac.uk> References: <4A802F03.2000507@sanger.ac.uk> <1249917452.7132.192.camel@pc.interlinx.bc.ca> <4A803E28.7010504@sanger.ac.uk> Message-ID: <1249921705.7132.306.camel@pc.interlinx.bc.ca> On Mon, 2009-08-10 at 16:35 +0100, Guy Coates wrote: > Hi Brian; Hi Guy, > cat config.mk > BACKPORT_INCLUDES=-I${CWD}/kernel_addons/backport/2.6.22/include/ I believe that is wrong and is a result of the first patch in your previous e-mail. Certainly in the 1.4.1 build I did here for all of my testing, I have: $ cat config.mk BACKPORT_INCLUDES=-I/usr/src/ofa_kernel/kernel_addons/backport/2.6.18-EL5.3/include/ And of course, once the first issue is fixed, your second issue, with the lustre configure script, will go away. > That is obviously wrong. > > > If I run the lustre configure, I get: > > > ./configure --with-o2ib=/usr/src/modules/ofa-kernel > --with-linux=/scratch/linux-2.6.22.19 > > > checking whether to enable OpenIB gen2 support... no > configure: error: can't compile with OpenIB gen2 headers under > /usr/src/modules/ofa-kernel Of course you do, because the config.mk is wrong. > EXTRA_LNET_INCLUDE='-I-I/kernel_addons/backport/2.6.22/include/ > -I/usr/src/modules/ofa-kernel/include' Right. Because the sed failed to accomplish it's replacement and took the value from the config.mk verbatim. As I said, once the root issue, with config.mk is fixed, the lustre configure issue will also resolve. > If I fix config.mk so that the correct path is present: > > > cat config.mk > BACKPORT_INCLUDES=-I/usr/src/modules/kernel_addons/backport/2.6.22/include/ ^^^^^^^ That's because you are relocating the sources during your ofa_kernel build to something other than the default. The code in the lustre configure is assuming the default location. Arguably lustre's configure should handle this. Please file a bug. > With the two patches previously sent, everything builds For you. It still does not accomplish the goals of the original design of that code in the configure script. But the lustre configure discussion really does not belong on this list. After you have filed your bug you should summarize in a followup to lustre-discuss (removing this list from your followup) given that it was not included on the CC list of this message. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From sean.hefty at intel.com Mon Aug 10 09:32:25 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 10 Aug 2009 09:32:25 -0700 Subject: [ofa-general] [PATCHv4 04/10] IB/umad: Enable support for RDMAoE ports In-Reply-To: References: <20090805082910.GE5599@mtls03> <376E5C8569F4456FBDD942F907DF919A@amr.corp.intel.com> <20090807032901.GB20589@mtls03> Message-ID: <1BE4788B64784417A2043CE63F6B09A3@amr.corp.intel.com> >Might there be some GS service to expose ? Vendor MADs perhaps ? If not, then >not exposing QP1 should be OK. At some point, exposing QP1 may make sense. I was thinking more along the lines of limiting the user space interfaces until things can be standardized. From rdreier at cisco.com Mon Aug 10 10:42:18 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Aug 2009 10:42:18 -0700 Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once In-Reply-To: <200908100936.26963.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 10 Aug 2009 09:36:26 +0300") References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com> <200908100936.26963.jackm@dev.mellanox.co.il> Message-ID: > I'm a bit nervous about this one. > printk_once will print once ONLY if CONFIG_PRINTK is set in include/linux/autoconf.h > (i.e., when the kernel is configured). Otherwise, it gets defined to printk -- > and it will always print in this case. > (see 2.6.30.xx kernel include file "include/linux/kernel.h", lines 235, 249, and 272). Umm... if CONFIG_PRINTK is turned off nothing prints, right? > Do you think that distributions will ALWAYS have CONFIG_PRINTK defined? Yes, I suspect they do want to get kernel messages. - R. From sean.hefty at intel.com Mon Aug 10 12:31:19 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 10 Aug 2009 12:31:19 -0700 Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding In-Reply-To: <20090805082929.GG5599@mtls03> References: <20090805082929.GG5599@mtls03> Message-ID: <55EE694A802442EA9F05B1007A044E02@amr.corp.intel.com> >@@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private >*id_priv, > { > struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; > int ret; >+ u16 pkey; >+ >+ if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num) >== nit: It looks like the if is indented by spaces, instead of a tab. >+static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv) >+{ >+ struct rdma_route *route = &id_priv->id.route; >+ struct rdma_addr *addr = &route->addr; >+ struct cma_work *work; >+ int ret; >+ struct sockaddr_in *src_addr = (struct sockaddr_in *)&route- >>addr.src_addr; >+ struct sockaddr_in *dst_addr = (struct sockaddr_in *)&route- >>addr.dst_addr; >+ >+ if (src_addr->sin_family != dst_addr->sin_family) >+ return -EINVAL; >+ >+ work = kzalloc(sizeof *work, GFP_KERNEL); >+ if (!work) >+ return -ENOMEM; >+ >+ work->id = id_priv; >+ INIT_WORK(&work->work, cma_work_handler); >+ >+ route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL); >+ if (!route->path_rec) { >+ ret = -ENOMEM; >+ goto err; >+ } >+ >+ route->num_paths = 1; >+ >+ rdmaoe_mac_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr); >+ rdmaoe_mac_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr); >+ >+ route->path_rec->hop_limit = 2; >+ route->path_rec->reversible = 1; >+ route->path_rec->pkey = cpu_to_be16(0xffff); >+ route->path_rec->mtu_selector = 2; >+ route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu); >+ route->path_rec->rate_selector = 2; >+ route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev); >+ route->path_rec->packet_life_time_selector = 2; >+ route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME; >+ >+ work->old_state = CMA_ROUTE_QUERY; >+ work->new_state = CMA_ROUTE_RESOLVED; >+ if (!route->path_rec->mtu || !route->path_rec->rate) { >+ work->event.event = RDMA_CM_EVENT_ROUTE_ERROR; >+ work->event.status = -1; Any reason not to fail immediately here and leave the id state unchanged? >+ } else { >+ work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; >+ work->event.status = 0; >+ } >+ >+ queue_work(cma_wq, &work->work); >+ >+ return 0; >+ >+err: >+ kfree(work); >+ return ret; >+} >+ > int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) > { > struct rdma_id_private *id_priv; >@@ -1744,6 +1824,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int >timeout_ms) > case RDMA_TRANSPORT_IWARP: > ret = cma_resolve_iw_route(id_priv, timeout_ms); > break; >+ case RDMA_TRANSPORT_RDMAOE: >+ ret = cma_resolve_rdmaoe_route(id_priv); >+ break; > default: > ret = -ENOSYS; > break; >@@ -2419,6 +2502,7 @@ int rdma_connect(struct rdma_cm_id *id, struct >rdma_conn_param *conn_param) > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: >+ case RDMA_TRANSPORT_RDMAOE: > if (cma_is_ud_ps(id->ps)) > ret = cma_resolve_ib_udp(id_priv, conn_param); > else >@@ -2532,6 +2616,7 @@ int rdma_accept(struct rdma_cm_id *id, struct >rdma_conn_param *conn_param) > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: >+ case RDMA_TRANSPORT_RDMAOE: > if (cma_is_ud_ps(id->ps)) > ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS, > conn_param->private_data, >@@ -2593,6 +2678,7 @@ int rdma_reject(struct rdma_cm_id *id, const void >*private_data, > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: >+ case RDMA_TRANSPORT_RDMAOE: > if (cma_is_ud_ps(id->ps)) > ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, > private_data, private_data_len); >@@ -2624,6 +2710,7 @@ int rdma_disconnect(struct rdma_cm_id *id) > > switch (rdma_port_get_transport(id->device, id->port_num)) { > case RDMA_TRANSPORT_IB: >+ case RDMA_TRANSPORT_RDMAOE: > ret = cma_modify_qp_err(id_priv); > if (ret) > goto out; >@@ -2752,6 +2839,55 @@ static int cma_join_ib_multicast(struct rdma_id_private >*id_priv, > return 0; > } > >+ >+static void rdmaoe_mcast_work_handler(struct work_struct *work) >+{ >+ struct rdmaoe_mcast_work *mw = container_of(work, struct >rdmaoe_mcast_work, work); >+ struct cma_multicast *mc = mw->mc; >+ struct ib_sa_multicast *m = mc->multicast.ib; >+ >+ mc->multicast.ib->context = mc; >+ cma_ib_mc_handler(0, m); >+ kfree(m); >+ kfree(mw); >+} >+ >+static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv, >+ struct cma_multicast *mc) >+{ >+ struct rdmaoe_mcast_work *work; >+ struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; >+ >+ if (cma_zero_addr((struct sockaddr *)&mc->addr)) >+ return -EINVAL; >+ >+ work = kzalloc(sizeof *work, GFP_KERNEL); >+ if (!work) >+ return -ENOMEM; >+ >+ mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL); >+ if (!mc->multicast.ib) { >+ kfree(work); >+ return -ENOMEM; >+ } nit: I'd prefer to goto a common cleanup area to make it easier to add changes in the future. >+ >+ cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr, &mc->multicast.ib- >>rec.mgid); >+ mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff); >+ if (id_priv->id.ps == RDMA_PS_UDP) >+ mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY); >+ mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev); >+ mc->multicast.ib->rec.hop_limit = 1; >+ mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu); Do we need to check the rate/mtu here, like in resolve route? Or should we be good since we could successfully resolve the route? Actually, can we just read the data from the path record that gets stored with the id? >+ rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid); >+ work->id = id_priv; >+ work->mc = mc; >+ INIT_WORK(&work->work, rdmaoe_mcast_work_handler); >+ >+ queue_work(cma_wq, &work->work); >+ >+ return 0; >+} >+ > int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, > void *context) > { >@@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct >sockaddr *addr, > case RDMA_TRANSPORT_IB: > ret = cma_join_ib_multicast(id_priv, mc); > break; >+ case RDMA_TRANSPORT_RDMAOE: >+ ret = cma_rdmaoe_join_multicast(id_priv, mc); >+ break; > default: > ret = -ENOSYS; > break; >@@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct >sockaddr *addr, > spin_unlock_irq(&id_priv->lock); > kfree(mc); > } >+ > return ret; > } > EXPORT_SYMBOL(rdma_join_multicast); >@@ -2813,7 +2953,9 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct >sockaddr *addr) > ib_detach_mcast(id->qp, > &mc->multicast.ib->rec.mgid, > mc->multicast.ib->rec.mlid); >- ib_sa_free_multicast(mc->multicast.ib); >+ if (rdma_port_get_transport(id_priv->cma_dev->device, >id_priv->id.port_num) == >+ RDMA_TRANSPORT_IB) >+ ib_sa_free_multicast(mc->multicast.ib); > kref_put(&mc->mcref, release_mc); > return; > } >diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c >index 24d9510..c7c9e92 100644 >--- a/drivers/infiniband/core/ucma.c >+++ b/drivers/infiniband/core/ucma.c >@@ -553,7 +553,8 @@ static ssize_t ucma_resolve_route(struct ucma_file *file, > } > > static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, >- struct rdma_route *route) >+ struct rdma_route *route, >+ enum rdma_transport_type tt) > { > struct rdma_dev_addr *dev_addr; > >@@ -561,10 +562,17 @@ static void ucma_copy_ib_route(struct >rdma_ucm_query_route_resp *resp, > switch (route->num_paths) { > case 0: > dev_addr = &route->addr.dev_addr; >- ib_addr_get_dgid(dev_addr, >- (union ib_gid *) &resp->ib_route[0].dgid); >- ib_addr_get_sgid(dev_addr, >- (union ib_gid *) &resp->ib_route[0].sgid); >+ if (tt == RDMA_TRANSPORT_IB) { >+ ib_addr_get_dgid(dev_addr, >+ (union ib_gid *) &resp->ib_route[0].dgid); >+ ib_addr_get_sgid(dev_addr, >+ (union ib_gid *) &resp->ib_route[0].sgid); >+ } else { >+ rdmaoe_mac_to_ll((union ib_gid *) &resp->ib_route[0].dgid, >+ dev_addr->dst_dev_addr); >+ rdmaoe_addr_get_sgid(dev_addr, >+ (union ib_gid *) &resp->ib_route[0].sgid); >+ } > resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); > break; > case 2: >@@ -589,6 +597,7 @@ static ssize_t ucma_query_route(struct ucma_file *file, > struct ucma_context *ctx; > struct sockaddr *addr; > int ret = 0; >+ enum rdma_transport_type tt; > > if (out_len < sizeof(resp)) > return -ENOSPC; >@@ -614,9 +623,11 @@ static ssize_t ucma_query_route(struct ucma_file *file, > > resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid; > resp.port_num = ctx->cm_id->port_num; >- switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id- >>port_num)) { >+ tt = rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num); >+ switch (tt) { > case RDMA_TRANSPORT_IB: >- ucma_copy_ib_route(&resp, &ctx->cm_id->route); >+ case RDMA_TRANSPORT_RDMAOE: >+ ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt); It seems simpler to just add a new call ucma_copy_rdmaoe_route, rather than merging those two transports into a single copy function that then branches based on the transport. > break; > default: > break; >diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h >index 483057b..66a848e 100644 >--- a/include/rdma/ib_addr.h >+++ b/include/rdma/ib_addr.h >@@ -39,6 +39,8 @@ > #include > #include > #include >+#include >+#include > > struct rdma_addr_client { > atomic_t refcount; >@@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr >*dev_addr, > memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid); > } > >+static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac) >+{ >+ memset(gid->raw, 0, 16); >+ *((u32 *)gid->raw) = cpu_to_be32(0xfe800000); >+ gid->raw[12] = 0xfe; >+ gid->raw[11] = 0xff; >+ memcpy(gid->raw + 13, mac + 3, 3); >+ memcpy(gid->raw + 8, mac, 3); >+ gid->raw[8] ^= 2; >+} >+ >+static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr, >+ union ib_gid *gid) >+{ >+ rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr); >+} >+ >+static inline enum ib_mtu rdmaoe_get_mtu(int mtu) >+{ >+ /* >+ * reduce IB headers from effective RDMAoE MTU. 28 stands for >+ * atomic header which is the biggest possible header after BTH >+ */ >+ mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28; >+ >+ if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096)) >+ return IB_MTU_4096; >+ else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048)) >+ return IB_MTU_2048; >+ else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024)) >+ return IB_MTU_1024; >+ else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512)) >+ return IB_MTU_512; >+ else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256)) >+ return IB_MTU_256; >+ else >+ return 0; >+} >+ >+static inline int rdmaoe_get_rate(struct net_device *dev) >+{ >+ struct ethtool_cmd cmd; >+ >+ if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings || >+ dev->ethtool_ops->get_settings(dev, &cmd)) >+ return IB_RATE_PORT_CURRENT; >+ >+ if (cmd.speed >= 40000) >+ return IB_RATE_40_GBPS; >+ else if (cmd.speed >= 30000) >+ return IB_RATE_30_GBPS; >+ else if (cmd.speed >= 20000) >+ return IB_RATE_20_GBPS; >+ else if (cmd.speed >= 10000) >+ return IB_RATE_10_GBPS; >+ else >+ return IB_RATE_PORT_CURRENT; >+} >+ >+static inline int rdma_link_local_addr(struct in6_addr *addr) >+{ >+ if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) && >+ addr->s6_addr32[1] == 0) >+ return 1; >+ else >+ return 0; >+} just replace the 'if' with 'return' >+ >+static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac) >+{ >+ memcpy(mac, &addr->s6_addr[8], 3); >+ memcpy(mac + 3, &addr->s6_addr[13], 3); >+ mac[0] ^= 2; >+} >+ >+static inline int rdma_is_multicast_addr(struct in6_addr *addr) >+{ >+ return addr->s6_addr[0] == 0xff ? 1 : 0; >+} >+ >+static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac) >+{ >+ memset(mac, 0xff, 6); >+} I don't think we want all of these inline, in particular rdmaoe_mac_to_ll, rdmaoe_get_mtu , rdmaoe_get_rate. From rdreier at cisco.com Mon Aug 10 13:30:44 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Aug 2009 13:30:44 -0700 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated In-Reply-To: <20090810084527.GA2446@mtls03> (Eli Cohen's message of "Mon, 10 Aug 2009 11:45:27 +0300") References: <20090810084527.GA2446@mtls03> Message-ID: > Looking at mlx4_write_mtt_chunk() I see that it calls > mlx4_table_find() with a pointer to single dma_addr_t - dma_handle - > while the dma addresses for the ICM memory is actually a list of > different addresses covering possibly different sizes. I think > mlx4_table_find() should be changed to support that, and then we can > use calls to dma_sync_single_for_cpu()/dma_sync_single_for_device() > with the correct dma addresses. No, I think we're careful that we write MTT ranges that don't cross a page so there shouldn't be any problem. From rdreier at cisco.com Mon Aug 10 13:32:32 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Aug 2009 13:32:32 -0700 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated In-Reply-To: (Bart Van Assche's message of "Sat, 8 Aug 2009 19:49:22 +0200") References: Message-ID: > Has anyone ever encountered a message like the one below ? This message was > generated while booting a 2.6.30.4 kernel with CONFIG_DMA_API_DEBUG=y and > before any out-of-tree kernel modules were loaded. > > ------------[ cut here ]------------ > WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0() > Hardware name: P5Q DELUXE > mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it > has not allocated [device address=0x0000000139482000] [size=4096 bytes] > Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel > snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801 > rtc_core hid_belkin mlx4_core( > +) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev > serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx > xor raid0 sd_mod crc_t10dif > ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic > ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal > processor thermal_sys hwmon > Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6 > Call Trace: > [] ? check_sync+0x47c/0x4b0 > [] warn_slowpath_common+0x78/0xd0 > [] warn_slowpath_fmt+0x3c/0x40 > [] ? _spin_lock_irqsave+0x49/0x60 > [] ? check_sync+0xab/0x4b0 > [] check_sync+0x47c/0x4b0 > [] ? mark_held_locks+0x6c/0x90 > [] debug_dma_sync_single_for_cpu+0x1d/0x20 > [] mlx4_write_mtt+0x159/0x1e0 [mlx4_core] I think the problem is that there really isn't any way truly supported by the DMA API to do a partial sync on something we mapped with "map_sg". I guess we really should just give up on virtual mapping etc and use dma_map_single to map the ICM memory; I doubt it has any measurable downside, even on platforms where dma_sync_single is a NOP. - R. From rdreier at cisco.com Mon Aug 10 13:33:40 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Aug 2009 13:33:40 -0700 Subject: [ofa-general] IB kernel modules and the kobject release() method In-Reply-To: <20090808034817.GA30697@suse.de> (Greg KH's message of "Fri, 7 Aug 2009 20:48:17 -0700") References: <20090808034817.GA30697@suse.de> Message-ID: > No, it still makes sense :) So what's the fix for this? If even you have trouble understanding kobject lifetimes and the requirement for a release function, is there hope for anyone else? - R. From rdreier at cisco.com Mon Aug 10 13:48:28 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Aug 2009 13:48:28 -0700 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: (Bart Van Assche's message of "Fri, 7 Aug 2009 11:58:11 +0200") References: Message-ID: > The lockdep report I obtained this morning with a 2.6.30.4 kernel and > the two patches applied has been attached to the kernel bugzilla > entry. This lockdep report was generated while testing the SRPT target > software. I have double checked that the SRPT target implementation > does not hold any spinlocks or mutexes while calling functions in the > IB core. This means that the SRPT target code cannot have caused any > of the reported lock cycles. Lockdep is not quite so simple as what you checked, but yes, in this case it does appear to be pointing a real (albeit spectacularly unlikely) deadlock in the core IB stack: ib_cm takes cm_id_priv->lock and calls ib_post_send_mad() from there, ib_mad takes mad_agent_priv->lock in another context, ib_mad takes mad_agent_priv->lock and does cancel_delayed_work(&mad_agent_priv->timed_work) (and internally cancel_delayed_work() does del_timer_sync()) finally, in another context a communication established event can occur and generate a callback (in interrupt context) to ib_cm where it takes cm_id_priv->lock So there can be a chain that deadlocks: if the timer for the timed_work is running on a CPU, and the interrupt for the communication established event occurs while the timer is running, then that interrupt handler can try to take cm_id_priv->lock. However on another CPU, someone could already be holding cm_id_priv->lock and call into ib_post_send_mad(), and spinning on mad_agent_priv->lock, while on yet another CPU, someone could be holding mad_agent_priv->lock and doing cancel_delayed_work(). And that will deadlock waiting in del_timer_sync() since the timer has been interrupted by an interrupt handler that will spin on a spinlock that is part of this chain. I'm not sure what the right fix is. It does seem to me that this should be fixed within the ib_mad module, since doing del_timer_sync() within a spinlocked region seems like the fundamental problem. However I'm not sure what the best way to rewrite the ib_mad usage is. > By the way, I noticed that while many subsystems in the Linux kernel > use event queues to report information to higher software layers, that > the IB core makes extensive use of callback functions. The combination > of nested locking and callback functions can easily lead to lock > inversion. This effect is well known in the operating system world -- > see e.g. the talk by John Ousterhout about multithreaded versus > event-driven software (http://home.pacbell.net/ouster/threads.pdf, > 1996). I'm not sure what you mean by this. What would be an example of a subsystem that uses event queues to report information? I think the design of the RDMA stack is quite parallel to most other Linux subsystems, and we don't have anything as deadlock prone as, say, the network stack's rtnl. Trying to queue events up instead of calling back from interrupt context is not all that simple, since one cannot reliably allocate memory, and one must deal with synchonization with the consuming context etc. It's probably at least as deadlock-prone to try and queue as it is to just call back. Osterhout's talk certainly makes sense for a certain class of userspace apps, but he explicitly says that event driven programming only uses one CPU, and of course userspace doesn't have hard interrupt handlers or anything like that. So the kernel is more complex just because the environment it runs under is a little trickier than what the kernel provides for userspace. - R. From sean.hefty at intel.com Mon Aug 10 16:03:30 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 10 Aug 2009 16:03:30 -0700 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: References: Message-ID: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> >And that will deadlock waiting in del_timer_sync() since the timer has >been interrupted by an interrupt handler that will spin on a spinlock >that is part of this chain. > >I'm not sure what the right fix is. It does seem to me that this should >be fixed within the ib_mad module, since doing del_timer_sync() within a >spinlocked region seems like the fundamental problem. However I'm not >sure what the best way to rewrite the ib_mad usage is. If I followed this correctly, will moving calls to cancel_delayed_work() outside of any spinlocks fix this? (If so, it's not immediately obvious to me what the best fix is either.) - Sean From rdreier at cisco.com Mon Aug 10 18:59:03 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Aug 2009 18:59:03 -0700 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> (Sean Hefty's message of "Mon, 10 Aug 2009 16:03:30 -0700") References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: > >And that will deadlock waiting in del_timer_sync() since the timer has > >been interrupted by an interrupt handler that will spin on a spinlock > >that is part of this chain. > > > >I'm not sure what the right fix is. It does seem to me that this should > >be fixed within the ib_mad module, since doing del_timer_sync() within a > >spinlocked region seems like the fundamental problem. However I'm not > >sure what the best way to rewrite the ib_mad usage is. > > If I followed this correctly, will moving calls to cancel_delayed_work() outside > of any spinlocks fix this? (If so, it's not immediately obvious to me what the > best fix is either.) Yes, I think that if cancel_delayed_work() and hence del_timer_sync() is outside of any other locks then there is no deadlock -- you can think of del_timer_sync() as being like a lock (which is how lockdep tracks it too). But of course we can't really do that because that leaves the timeout tracking unlocked and racy in the mad module. The best idea I can come up with so far is to move to an explicit timer in the mad module, so that we can do mod_timer() inside the lock rather than having to do the equivalent of del_timer_sync() + add_timer() (implicitly through the delayed work API). But that unfortunately is somewhat invasive surgery for the mad module... definitely doable but ideally there would be an easier way. I guess we could add a "requeue_delayed_work()" API to the kernel workqueue stuff that does mod_timer() instead of adding it, but it might be tricky to get the interface to that right. - R. From jackm at dev.mellanox.co.il Tue Aug 11 00:17:21 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 11 Aug 2009 10:17:21 +0300 Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once In-Reply-To: References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> <200908100936.26963.jackm@dev.mellanox.co.il> Message-ID: <200908111017.22153.jackm@dev.mellanox.co.il> On Monday 10 August 2009 20:42, Roland Dreier wrote: > > > I'm a bit nervous about this one. > > printk_once will print once ONLY if CONFIG_PRINTK is set in include/linux/autoconf.h > > (i.e., when the kernel is configured). Otherwise, it gets defined to printk -- > > and it will always print in this case. > > (see 2.6.30.xx kernel include file "include/linux/kernel.h", lines 235, 249, and 272). > > Umm... if CONFIG_PRINTK is turned off nothing prints, right? Jiri Slaby pointed that out to me -- i.e., that printk itself is defined to do nothing but return 0 if CONFIG_PRINTK is not defined. (I missed that when looking at file kernel.h). I thought no answer was needed (sorry about that) -- Jiri was so obviously correct. I've got no problem with the patch. -Jack From jackm at dev.mellanox.co.il Tue Aug 11 00:21:01 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 11 Aug 2009 10:21:01 +0300 Subject: [ofa-general] [PATCH V2] mlx4: Do not allow ib userspace open while device is being removed Message-ID: <200908111021.01612.jackm@dev.mellanox.co.il> Userspace apps are supposed to release all ib device resources if they receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However, the app has no way of knowing when the device has come back up, except to repeatedly attempt ibv_open_device() until it succeeds. However, currently there is no protection against open succeeding when the device is in the midst of the removal following the fatal event. In this case, the open will succeed, but as a result the device waits in the middle of its removal until the new app releases its ib resources -- and the new app will not do so, since the open succeeded at a point following the fatal event generation. This patch adds an "active" flag to the device. The active flag is set to false (in the fatal event flow) before the "fatal" event is generated, so any subsequent ibv_dev_open() call to the device will fail until the device comes back up, thus preventing the above deadlock. V2: move active flag from net to hw/mlx4, and use only for fatal event flow. (per feedback from Roland). Signed-off-by: Jack Morgenstein --- Roland, this is a continuation of thread: http://lists.openfabrics.org/pipermail/general/2009-July/060668.html diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index ae3d759..4effc19 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -342,6 +342,9 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev, struct mlx4_ib_alloc_ucontext_resp resp; int err; + if (!dev->ib_active) + return ERR_PTR(-EAGAIN); + resp.qp_tab_size = dev->dev->caps.num_qps; resp.bf_reg_size = dev->dev->caps.bf_reg_size; resp.bf_regs_per_page = dev->dev->caps.bf_regs_per_page; @@ -673,6 +676,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) goto err_reg; } + ibdev->ib_active = 1; + return ibdev; err_reg: @@ -729,6 +734,7 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr, break; case MLX4_DEV_EVENT_CATASTROPHIC_ERROR: + ibdev->ib_active = 0; ibev.event = IB_EVENT_DEVICE_FATAL; break; diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 8a7dd67..b22df97 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -175,6 +175,7 @@ struct mlx4_ib_dev { spinlock_t sm_lock; struct mutex cap_mask_mutex; + int ib_active; }; static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev) From vlad at lists.openfabrics.org Tue Aug 11 03:01:35 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 11 Aug 2009 03:01:35 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090811-0200 daily build status Message-ID: <20090811100136.03E06E4026E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:300: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:311: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090811-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ofedrnicuser at yahoo.com Tue Aug 11 03:58:51 2009 From: ofedrnicuser at yahoo.com (Bill N) Date: Tue, 11 Aug 2009 03:58:51 -0700 (PDT) Subject: [ofa-general] which ofed-1.4 module ensure compliance with NSMR for having same size pbl entries? Message-ID: <272817.39129.qm@web111203.mail.gq1.yahoo.com> Hi, I am trying to figure out which module (usr/kernel) ensures the compliance with RDMA iWarp verb specification 9.2.6.2 (Register Non-Shared MR)? ibv_reg_mr() of libibverbs takes address and length. So someone has to ensure that given address-length region(s) are exactly divided in to the small same size regions before it reaches the lowest layer iwarp/ib drivers. (I guess thats page size!) After looking at ib_uverbs_reg_mr() below if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK)) return -EINVAL; Looks like this is achieved by having page aligned start address. As start address is page aligned all the incoming blocks of memory will be of same size (provided length is multiple of page size). 1. Is caller of the ibv_reg_mr() should ensure that it allocates page aligned memory? 2. Should caller of ibv_reg_mr() ensure that length always multiple of page size? 3. Assuming yes, to Q-2, which function/module in the kernel checks that length is multiple of page size? Regards, Bill From rdreier at cisco.com Tue Aug 11 09:23:11 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Aug 2009 09:23:11 -0700 Subject: [ofa-general] Re: [PATCH V2] mlx4: Do not allow ib userspace open while device is being removed In-Reply-To: <200908111021.01612.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 11 Aug 2009 10:21:01 +0300") References: <200908111021.01612.jackm@dev.mellanox.co.il> Message-ID: > this is a continuation of thread: > http://lists.openfabrics.org/pipermail/general/2009-July/060668.html Thanks for the pointer... it lets me reload my context. I see you didn't answer the question about mthca -- does it suffer from this problem as well? - R. From rdreier at cisco.com Tue Aug 11 09:40:19 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Aug 2009 09:40:19 -0700 Subject: [ofa-general] Re: [PATCH 10/14] infiniband: use printk_once In-Reply-To: <4A8045BD.8010803@gmail.com> (Marcin Slusarz's message of "Mon, 10 Aug 2009 18:07:25 +0200") References: <1249847649-11631-1-git-send-email-marcin.slusarz@gmail.com> <1249847649-11631-11-git-send-email-marcin.slusarz@gmail.com> <4A8045BD.8010803@gmail.com> Message-ID: thanks, applied. From weiny2 at llnl.gov Tue Aug 11 10:04:28 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 11 Aug 2009 10:04:28 -0700 Subject: [ofa-general] [PATCH] libibmad: clear packet buffer correctly before formating and sending Message-ID: <20090811100428.c4fb6c5e.weiny2@llnl.gov> I found this bug a while back but forgot to submit a patch. I don't think this will affect the issues Mr. Miller was having with BM, as I believe the BM stuff he was trying all expected a response (thereby calling mad_rpc instead). But it could be worth a try. Ira From: Ira Weiny Date: Tue, 11 Aug 2009 10:00:25 -0700 Subject: [PATCH] libibmad: clear packet buffer correctly before formating and sending Signed-off-by: Ira Weiny --- libibmad/src/serv.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index fad1e5b..4d557c2 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -59,7 +59,7 @@ int mad_send_via(ib_rpc_t * rpc, ib_portid_t * dport, ib_rmpp_hdr_t * rmpp, uint8_t pktbuf[1024]; void *umad = pktbuf; - memset(pktbuf, 0, umad_size()); + memset(pktbuf, 0, umad_size() + IB_MAD_SIZE); DEBUG("rmpp %p data %p", rmpp, data); -- 1.5.4.5 From hnrose at comcast.net Tue Aug 11 11:40:11 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 11 Aug 2009 14:40:11 -0400 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: Some minor simplifications Message-ID: <20090811184011.GA5666@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 7826578..febd7f6 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -135,10 +135,8 @@ osm_qos_port_t *osm_qos_policy_port_create(osm_physp_t *p_physp) { osm_qos_port_t *p = (osm_qos_port_t *) calloc(1, sizeof(osm_qos_port_t)); - if (!p) - return NULL; - - p->p_physp = p_physp; + if (p) + p->p_physp = p_physp; return p; } @@ -149,11 +147,8 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() { osm_qos_port_group_t *p = (osm_qos_port_group_t *) calloc(1, sizeof(osm_qos_port_group_t)); - if (!p) - return NULL; - - cl_qmap_init(&p->port_map); - + if (p) + cl_qmap_init(&p->port_map); return p; } @@ -192,14 +187,12 @@ osm_qos_vlarb_scope_t *osm_qos_policy_vlarb_scope_create() { osm_qos_vlarb_scope_t *p = (osm_qos_vlarb_scope_t *) calloc(1, sizeof(osm_qos_vlarb_scope_t)); - if (!p) - return NULL; - - cl_list_init(&p->group_list, 10); - cl_list_init(&p->across_list, 10); - cl_list_init(&p->vlarb_high_list, 10); - cl_list_init(&p->vlarb_low_list, 10); - + if (p) { + cl_list_init(&p->group_list, 10); + cl_list_init(&p->across_list, 10); + cl_list_init(&p->vlarb_high_list, 10); + cl_list_init(&p->vlarb_low_list, 10); + } return p; } @@ -236,13 +229,11 @@ osm_qos_sl2vl_scope_t *osm_qos_policy_sl2vl_scope_create() { osm_qos_sl2vl_scope_t *p = (osm_qos_sl2vl_scope_t *) calloc(1, sizeof(osm_qos_sl2vl_scope_t)); - if (!p) - return NULL; - - cl_list_init(&p->group_list, 10); - cl_list_init(&p->across_from_list, 10); - cl_list_init(&p->across_to_list, 10); - + if (p) { + cl_list_init(&p->group_list, 10); + cl_list_init(&p->across_from_list, 10); + cl_list_init(&p->across_to_list, 10); + } return p; } @@ -276,8 +267,6 @@ osm_qos_level_t *osm_qos_policy_qos_level_create() { osm_qos_level_t *p = (osm_qos_level_t *) calloc(1, sizeof(osm_qos_level_t)); - if (!p) - return NULL; return p; } @@ -355,14 +344,12 @@ osm_qos_match_rule_t *osm_qos_policy_match_rule_create() { osm_qos_match_rule_t *p = (osm_qos_match_rule_t *) calloc(1, sizeof(osm_qos_match_rule_t)); - if (!p) - return NULL; - - cl_list_init(&p->source_list, 10); - cl_list_init(&p->source_group_list, 10); - cl_list_init(&p->destination_list, 10); - cl_list_init(&p->destination_group_list, 10); - + if (p) { + cl_list_init(&p->source_list, 10); + cl_list_init(&p->source_group_list, 10); + cl_list_init(&p->destination_list, 10); + cl_list_init(&p->destination_group_list, 10); + } return p; } From bart.vanassche at gmail.com Tue Aug 11 13:29:42 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Tue, 11 Aug 2009 22:29:42 +0200 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: References: Message-ID: On Mon, Aug 10, 2009 at 10:48 PM, Roland Dreier wrote: > >  > The lockdep report I obtained this morning with a 2.6.30.4 kernel and >  > the two patches applied has been attached to the kernel bugzilla >  > entry. This lockdep report was generated while testing the SRPT target >  > software. I have double checked that the SRPT target implementation >  > does not hold any spinlocks or mutexes while calling functions in the >  > IB core. This means that the SRPT target code cannot have caused any >  > of the reported lock cycles. > > Lockdep is not quite so simple as what you checked, but yes, in this > case it does appear to be pointing a real (albeit spectacularly > unlikely) deadlock in the core IB stack: > >  ib_cm takes cm_id_priv->lock and calls ib_post_send_mad() >  from there, ib_mad takes mad_agent_priv->lock > >  in another context, ib_mad takes mad_agent_priv->lock and does >  cancel_delayed_work(&mad_agent_priv->timed_work) (and internally >  cancel_delayed_work() does del_timer_sync()) > >  finally, in another context a communication established event can >  occur and generate a callback (in interrupt context) to ib_cm where it >  takes cm_id_priv->lock > > So there can be a chain that deadlocks: if the timer for the timed_work > is running on a CPU, and the interrupt for the communication established > event occurs while the timer is running, then that interrupt handler can > try to take cm_id_priv->lock. > > However on another CPU, someone could already be holding > cm_id_priv->lock and call into ib_post_send_mad(), and spinning on > mad_agent_priv->lock, while on yet another CPU, someone could be holding > mad_agent_priv->lock and doing cancel_delayed_work(). > > And that will deadlock waiting in del_timer_sync() since the timer has > been interrupted by an interrupt handler that will spin on a spinlock > that is part of this chain. > > I'm not sure what the right fix is.  It does seem to me that this should > be fixed within the ib_mad module, since doing del_timer_sync() within a > spinlocked region seems like the fundamental problem.  However I'm not > sure what the best way to rewrite the ib_mad usage is. It's already good news that the potential lock cycle has been deduced from the lockdep reports. I know that it can take a lot of work to analyze such reports. Even if it is really unlikely that this lock cycle would cause a deadlock, it would be great if this lock cycle could be removed. I'm not the only developer of kernel modules who runs tests with lockdep enabled, and it is unpractical to analyze long logfiles full of known lock cycles to find a single lock cycle caused by newly added or recently modified code. >  > By the way, I noticed that while many subsystems in the Linux kernel >  > use event queues to report information to higher software layers, that >  > the IB core makes extensive use of callback functions. The combination >  > of nested locking and callback functions can easily lead to lock >  > inversion. This effect is well known in the operating system world -- >  > see e.g. the talk by John Ousterhout about multithreaded versus >  > event-driven software (http://home.pacbell.net/ouster/threads.pdf, >  > 1996). > > I'm not sure what you mean by this.  What would be an example of a > subsystem that uses event queues to report information?  I think the > design of the RDMA stack is quite parallel to most other Linux > subsystems, and we don't have anything as deadlock prone as, say, the > network stack's rtnl. What I had in mind as an example is the netlink socket mechanism, although this is a mechanism for sending notifications from the kernel to userspace. > Trying to queue events up instead of calling back from interrupt context > is not all that simple, since one cannot reliably allocate memory, and > one must deal with synchonization with the consuming context etc.  It's > probably at least as deadlock-prone to try and queue as it is to just > call back. One possible approach when having to queue events from interrupt context is to queue these events in a fixed size queue that has been allocated outside interrupt context, and make it possible for the event consumer to detect the queue overflow condition. When a queue overflow happens it is the responsibility of the event consumer to query the state of the event producer. This is a more complex approach than callback functions but has the advantage that there never can be a lock cycle involving locks of both the event producer and the event consumer. I'm not inventing anything new here -- this is exactly how netlink sockets work. Bart. From halves at linux.vnet.ibm.com Tue Aug 11 13:37:54 2009 From: halves at linux.vnet.ibm.com (Higor Aparecido Vieira Alves) Date: Tue, 11 Aug 2009 17:37:54 -0300 Subject: [ofa-general] Chelsio cards Message-ID: <1250023074.16631.1.camel@halves-ltc> Hi guys, I have a doubt, which ofed features are supported by Chelsio cards? Thanks, -- Higor Aparecido Vieira Alves Software Engineer Linux Technology Center IBM Systems & Technology Group From swise at opengridcomputing.com Tue Aug 11 14:15:42 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 11 Aug 2009 16:15:42 -0500 Subject: [ofa-general] Chelsio cards In-Reply-To: <1250023074.16631.1.camel@halves-ltc> References: <1250023074.16631.1.camel@halves-ltc> Message-ID: <4A81DF7E.6020907@opengridcomputing.com> Higor Aparecido Vieira Alves wrote: > Hi guys, > > I have a doubt, which ofed features are supported by Chelsio cards? > > Thanks, > > The following are supported via rdma on chelsio cards with ofed-1.4.x and 1.5: User mode: openmpi mvapich2 udapl 1.2 and 2.0 (and thus most ULPs using udapl like Intel MPI, HP MPI, Scali MPI) rdmacm (required for connection setup) ibverbs (RC QP only) perftest (ib_rdma_bw and ib_rdma_lat only) qperf rping Kernel mode: core verbs (RC QP only) rdmacm (required for connection setup) nfsrdma Steve. From sashak at voltaire.com Tue Aug 11 14:22:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 00:22:21 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: <20090806181928.GA21698@comcast.net> References: <20090806181928.GA21698@comcast.net> Message-ID: <20090811212221.GG25501@me> Hi Hal, On 14:19 Thu 06 Aug , Hal Rosenstock wrote: > @@ -136,7 +137,7 @@ static int do_ucast_file_load(void *context) > OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, > "LFTs file name is not given; " > "using default routing algorithm\n"); > - return 1; > + return -1; This "fix" is not correct. Routing engine method returns "> 0" value when fallback to default is requested. In particular in case of 'file' engine it is legal to provide only LFTs file and not provide LID matrix file. Sasha From hal.rosenstock at gmail.com Tue Aug 11 14:33:20 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 11 Aug 2009 17:33:20 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: <20090811212221.GG25501@me> References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> Message-ID: Hi Sasha, On Tue, Aug 11, 2009 at 5:22 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 14:19 Thu 06 Aug , Hal Rosenstock wrote: > > @@ -136,7 +137,7 @@ static int do_ucast_file_load(void *context) > > OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, > > "LFTs file name is not given; " > > "using default routing algorithm\n"); > > - return 1; > > + return -1; > > This "fix" is not correct. Routing engine method returns "> 0" value > when fallback to default is requested. In particular in case of 'file' > engine it is legal to provide only LFTs file and not provide LID matrix > file. Is it supposed to use file when no files (LFT or LID matrix) are supplied ? That's what seems to happen (with no fallback). -- Hal > > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From worleys at gmail.com Tue Aug 11 14:52:13 2009 From: worleys at gmail.com (Chris Worley) Date: Tue, 11 Aug 2009 15:52:13 -0600 Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs In-Reply-To: References: Message-ID: On Mon, Aug 10, 2009 at 4:40 AM, Bart Van Assche wrote: > On Sun, Aug 9, 2009 at 7:09 PM, Chris Worley wrote: >> >> I'm running a target comprised of: RHEL5.2/2.6.18-92.el5 (fresh off >> the CD.. never updated) and it's embedded IB stack (not the latest >> OFED) w/ SCST rev 1029 8-Aug-2009 ("svn info"). >> >> I'm running a W2008S (fully patched) initiator w/ >> MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453. >> >> Using Mellanox QDR cards/switch. >> >> Writes over SRP, as measured from the initiator using IOMeter, get >> proper performance (i.e. 1.2GB/s). >> >> Reads get about 30% performance (i.e. 500MB/s instead of 1.6GB/s). >> And while reading, IOMeter eventually hangs the system (Windows >> becomes unresponsive to GUI interaction).  In this state, I see iostat >> reporting transfers at the same low read rate from the target... so >> there's IB traffic, but, given IOMeter's tasks are 10 minutes each, it >> acts like it's a "skipping record" (sorry of you young folks don't >> know what that is... but I can't think of another way to describe it) >> and never moving on to the next benchmark, just endlessly repeating >> the same I/O over and over again.  If I unload then reload the mlx4_ib >> driver on the target, then the Windows system quickly returns, but >> IOMeter remains hung and needs killed. > > The throughput of the SRP protocol strongly depends on the block size used > for I/O. The results I obtained with IOmeter are: > * For a block size of 32 KB: 396 MB/s for reading and 321 MB/s for writing. > * For a block size of 1 MB: 1383 MB/s for reading and 1151 MB/s for writing. > These results are about 90% of the throughput obtained with dd. > > Setup details: > * Two Mellanox ConnectX DDR cards connected back to back, operating in PCIe > 2.0 mode. > * Target: vanilla 2.6.30.4 kernel + SCST patches + the two patches attached > to http://bugzilla.kernel.org/show_bug.cgi?id=13757 + SCST r1030. > * Initiator: openSUSE 11.0 (contains a patched 2.6.27.25 kernel) with > openSUSE 11.0 OFED components + Linux version of IOmeter's dynamo + IOmeter > GUI running in a virtual machine. > * I/O-scheduler used by SRP initiator: noop. Thanks for the recommendations (to both Bart and Joe). I setup my target exactly as you prescribe... but my initiator is still Windows (version of WInOF at top): performance as relayed by IOMeter starts high and the average slowly decreases. Watching the instantaneous throughput, there seem to be longer and longer lags of poor performance. between moments of good performance. I need to run this against a Linux initiator to see if the problems are w/ WinOF. Using OFED 1.4.1 (w/ the stock RHEL kernel) on the target, the performance was steady and getting close to acceptable. In a 15 hour test that cycles through sequential and random LBA's and R/W mixes from block sizes from 1MB to 512B, it worked well and got decent performance until it hit 1KB sequential reads which hung IOMeter; no messages on the Linux side (all looked okay). IBSRP on the Windows side just said "a reset to device was issued" every 15 to 30 seconds after the problem started. I reloaded the IB stack on the Linux side, and was able to get it restarted. Still a lot of combinations to test. Thanks, Chris > > Bart. > From chenyon1 at iit.edu Fri Aug 7 12:55:38 2009 From: chenyon1 at iit.edu (Yong Chen) Date: Fri, 07 Aug 2009 19:55:38 +0000 (GMT) Subject: [ofa-general] [hpc-announce] Call for Attendance: P2S2-2009 Workshop Message-ID: Dear Colleagues, The Second International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) will be held in Vienna, Austria, on Sept. 22nd, 2009 in conjunction with The 38th International Conference on Parallel Processing (ICPP-2009). The workshop program has been finalized and can be found here: http://www.mcs.anl.gov/events/workshops/p2s2/pro.html (listed below for your reference). We welcome you attend the P2S2-2009 workshop and look forward to seeing you in Vienna, Austria! =============================================================================== Session 1: Opening Time: 09:00 - 10:30, Location: Room F3 (89), Chair: Pavan Balaji, Argonne National Laboratory Opening Remarks (D. K. Panda, Pavan Balaji and Abhinav Vishnu) Invited Keynote by Dr. Pete Beckman, Argonne National Laboratory, "Challenges for System Software on Exascale Platforms" 10:30 - 11:00 Coffee Break Session 2: Software for Large-scale Systems Time: 11:00 - 12:30, Location: Room F3 (89), Chair: Tom Peterka, Argonne National Laboratory 1. "Characterizing the Performance of Big Memory on Blue Gene Linux" Kazutomo Yoshii, Kamil Iskra, P. Chris Broekema, Harish Naik and Pete Beckman 2. "Optimization of Preconditioned Parallel Iterative Solvers for Finite-Element Applications using Hybrid Parallel Programming Models on T2K Open Supercomputer (Todai Combined Cluster)" Kengo Nakajima 3. "Analyzing Checkpointing Trends for Applications on Peta-scale Systems" Harish Naik, Rinku Gupta and Pete Beckman 12:30 - 14:00 Lunch Session 3: Communication and I/O Time: 14:00 - 15:30, Location: Room F3 (89), Chair: Abhinav Vishnu, Pacific Northwest National Laboratory 1. "Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand" Tejus Gangadharappa, Matthew Koop and Dhabaleswar K Panda 2. "CkDirect: Unsynchronized One-Sided Communication in a Message-Driven Paradigm" Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele and Laxmikant Kale 3. "Exploiting Latent I/O Asynchrony in Petascale Science Applications" Patrick Widener, Matthew Wolf, Hasan Abbasi, Scott McManus, Mary Payne, Patrick Bridges and Karsten Schwan 4. "Gears4Net - An Asynchronous Programming Model" Martin Saternus, Torben Weis, Sebastian Holzapfel and Arno Wacker 15:30 - 16:00 Coffee Break Session 4: Software for Multicore Architectures Time: 16:00 - 17:30, Location: Room F3 (89), Chair: Ron Brightwell, Sandia National Laboratory 1. "Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms" Changjun Hu, Yali Liu and Jianjiang Li 2. "Open Source Software Support for the OpenMP Runtime API for Profiling" Oscar Hernandez, Van Bui, Richard Kufrin and Barbara Chapman 3. "Just-In-Time Renaming and Lazy Write-Back on the Cell/B.E." Pieter Bellens, Rosa Badia and Jesus Labarta From niftyompi at niftyegg.com Tue Aug 11 19:37:59 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Tue, 11 Aug 2009 19:37:59 -0700 Subject: [ofa-general] Manipulating Credits in Infiniband In-Reply-To: References: Message-ID: <20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net> On Mon, Aug 10, 2009 at 12:11:22PM -0400, Ashwath Narasimhan wrote: > > I looked into the infiniband driver files. As I understand, in order to > limit the data rate we manipulate the credits on either ends. Since the > number of credits available depends on the receiver's work receive > queue size, I decided to limit the queue size to say 5 instead of 8192 > (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my higher > layer protocol is ipoib). I just want to confirm if I am doing the > right thing? Data rate is not manipulated by credits. Credits and queue sizes are different and have different purposes. Visit the Infiniband Trade Association web site and grab the IB specifications to understand some of the hardware level parts. http://www.infinibandta.org/ InfiniBand offers credit based flow control and given the nature of modern IB switches and processors a very small credit count can still result in full data rate. Having said that flow control is the lowest level throttle in the system. Reducing the credit count forces the higher levels in the protocol stack to source or sink the data through the hardware before any more can be delivered. Thus flow control can simplify the implementation of higher level protocols. It can also be used to cost reduce or simplify hardware design (smaller hardware buffers). The IB specifications are way too long. Start with this FAQ. http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf The IB specification is way too full of optional features. A vendor may have XYZ working fine and dandy on one card and since it is optional not at all on another. The various queue sizes for the various protocols built on top of IB establish transfer behavior in keeping with system interrupt, system process time slice, system kernel activity loads and needs. It is counter intuitive but in some cases small queues result in more responsive and agile systems, especially in the presence of errors. Since there are often multiple protocols on the IB stack all protocols will be impacted by credit tinkering. Most vendors know their hardware so most drivers will have credit related code optimum. In the case of TCP/IP the interaction between IB bandwidths&MTU (IPoIB), ethernet bandwidth&MTU and even localhost (127.0.0.1) bandwidth&MTU can be "interesting" depending on host names, subnets, routing etc. TCP/IP has lots of tuning flags well above the IB driver. I see 500+ net.* sysctl knobs on this system. As you change things do make the changes on all the moving parts, benchmark and keep a log. Since there are multiple IB hardware vendors it is important to track hardware specifics. "lspci" is a good tool to gather chip info. With some cards you also need specifics about the active firmware. So go forth (RPN forever) and conquer. -- T o m M i t c h e l l Found me a new hat, now what? From rdreier at cisco.com Tue Aug 11 20:34:04 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Aug 2009 20:34:04 -0700 Subject: [ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected In-Reply-To: (Bart Van Assche's message of "Tue, 11 Aug 2009 22:29:42 +0200") References: Message-ID: > Even if it is really unlikely that this lock cycle would cause a > deadlock, it would be great if this lock cycle could be removed. I'm > not the only developer of kernel modules who runs tests with lockdep > enabled, and it is unpractical to analyze long logfiles full of known > lock cycles to find a single lock cycle caused by newly added or > recently modified code. I agree that we should fix all lockdep issues -- the impact of them is even worse than you realize, because once a single cycle is detected, lockdep must turn itself off until a reboot, because of course you can't detect new cycles in a graph that already has a cycle. > One possible approach when having to queue events from interrupt > context is to queue these events in a fixed size queue that has been > allocated outside interrupt context, and make it possible for the > event consumer to detect the queue overflow condition. When a queue > overflow happens it is the responsibility of the event consumer to > query the state of the event producer. This is a more complex approach > than callback functions but has the advantage that there never can be > a lock cycle involving locks of both the event producer and the event > consumer. I think in most cases dealing with queue overflow is going to lead to way more bugs than callbacks in interrupt context. Of course, when passing events on to userspace, we don't have the luxury of being able to call userspace in interrupt context, so we have to look for the next best thing. But within the kernel we can take the simpler more robust approach. - R. From alekseys at voltaire.com Tue Aug 11 23:13:35 2009 From: alekseys at voltaire.com (Aleksey Senin) Date: Wed, 12 Aug 2009 09:13:35 +0300 Subject: [ofa-general] Chelsio cards In-Reply-To: <4A81DF7E.6020907@opengridcomputing.com> References: <1250023074.16631.1.camel@halves-ltc> <4A81DF7E.6020907@opengridcomputing.com> Message-ID: <4A825D8F.8030102@voltaire.com> > > User mode: > openmpi > mvapich2 > udapl 1.2 and 2.0 (and thus most ULPs using udapl like Intel MPI, HP > MPI, Scali MPI) > rdmacm (required for connection setup) > ibverbs (RC QP only) > perftest (ib_rdma_bw and ib_rdma_lat only) > qperf > rping Another two binaries in user space ib_rdma_lat ib_rdma_bw From bart.vanassche at gmail.com Tue Aug 11 23:15:03 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Wed, 12 Aug 2009 08:15:03 +0200 Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs In-Reply-To: References: Message-ID: On Tue, Aug 11, 2009 at 11:52 PM, Chris Worley wrote: > I setup my target exactly as you prescribe... but my initiator is > still Windows (version of WInOF at top): performance as relayed by > IOMeter starts high and the average slowly decreases.  Watching the > instantaneous throughput, there seem to be longer and longer lags of > poor performance. between moments of good performance.  I need to run > this against a Linux initiator to see if the problems are w/ WinOF. > > Using OFED 1.4.1 (w/ the stock RHEL kernel) on the target, the > performance was steady and getting close to acceptable.  In a 15 hour > test that cycles through sequential and random LBA's and R/W mixes > from block sizes from 1MB to 512B, it worked well and got decent > performance until it hit 1KB sequential reads which hung IOMeter; no > messages on the Linux side (all looked okay).  IBSRP on the Windows > side just said "a reset to device was issued" every 15 to 30 seconds > after the problem started. I reloaded the IB stack on the Linux side, > and was able to get it restarted. > > Still a lot of combinations to test. Which trace settings are you using on the target ? Enabling the proper trace settings via /proc/scsi_tgt/trace_level might reveal whether you are e.g. hitting the QUEUE_FULL condition. See also scst/README. Bart. From eli at dev.mellanox.co.il Tue Aug 11 23:17:49 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 12 Aug 2009 09:17:49 +0300 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated In-Reply-To: References: <20090810084527.GA2446@mtls03> Message-ID: <20090812061749.GA20719@mtls03> On Mon, Aug 10, 2009 at 01:30:44PM -0700, Roland Dreier wrote: > > > Looking at mlx4_write_mtt_chunk() I see that it calls > > mlx4_table_find() with a pointer to single dma_addr_t - dma_handle - > > while the dma addresses for the ICM memory is actually a list of > > different addresses covering possibly different sizes. I think > > mlx4_table_find() should be changed to support that, and then we can > > use calls to dma_sync_single_for_cpu()/dma_sync_single_for_device() > > with the correct dma addresses. > > No, I think we're careful that we write MTT ranges that don't cross a > page so there shouldn't be any problem. But a contiguous ICM memory does not map, in general, to a contiguous DMA memory, so if dma_sync_single_for_*() does not harm anything than it does not do anything useful either. From rdreier at cisco.com Tue Aug 11 23:23:48 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Aug 2009 23:23:48 -0700 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated In-Reply-To: <20090812061749.GA20719@mtls03> (Eli Cohen's message of "Wed, 12 Aug 2009 09:17:49 +0300") References: <20090810084527.GA2446@mtls03> <20090812061749.GA20719@mtls03> Message-ID: > But a contiguous ICM memory does not map, in general, to a contiguous > DMA memory, so if dma_sync_single_for_*() does not harm anything > than it does not do anything useful either. Maybe I'm missing your point, but mlx4_table_find() does go to some trouble to find the right DMA address for the object being looked up. Of course it could be buggy but I still don't see why we would need a list of DMA addresses when we know we are only going to sync part of one page? I think the right thing to do to fix this is to switch to using map_single instead of map_sg, and then use dma_sync_single_range_for_xxx to sync the subset we care about. - R. From eli at dev.mellanox.co.il Wed Aug 12 00:05:58 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 12 Aug 2009 10:05:58 +0300 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated In-Reply-To: References: <20090810084527.GA2446@mtls03> <20090812061749.GA20719@mtls03> Message-ID: <20090812070558.GA23123@mtls03> On Tue, Aug 11, 2009 at 11:23:48PM -0700, Roland Dreier wrote: > > Maybe I'm missing your point, but mlx4_table_find() does go to some > trouble to find the right DMA address for the object being looked up. > Of course it could be buggy but I still don't see why we would need a > list of DMA addresses when we know we are only going to sync part of one > page? Is this is always true? What if you allocated an ICM buffer that uses none adjacent pages? In this case you would need more than one call to dma_sync_single_for_*(), isn't it? > > I think the right thing to do to fix this is to switch to using > map_single instead of map_sg, and then use dma_sync_single_range_for_xxx > to sync the subset we care about. > > - R. From eli at dev.mellanox.co.il Wed Aug 12 01:20:45 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 12 Aug 2009 11:20:45 +0300 Subject: [ofa-general] [PATCHv4 06/10] ib_core: CMA device binding In-Reply-To: <55EE694A802442EA9F05B1007A044E02@amr.corp.intel.com> References: <20090805082929.GG5599@mtls03> <55EE694A802442EA9F05B1007A044E02@amr.corp.intel.com> Message-ID: <20090812082045.GB23123@mtls03> On Mon, Aug 10, 2009 at 12:31:19PM -0700, Sean Hefty wrote: > >@@ -576,10 +586,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private > >*id_priv, > > { > > struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; > > int ret; > >+ u16 pkey; > >+ > >+ if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num) > >== > > nit: It looks like the if is indented by spaces, instead of a tab. Will fix, thanks. > > >+static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv) > >+{ > >+ work->old_state = CMA_ROUTE_QUERY; > >+ work->new_state = CMA_ROUTE_RESOLVED; > >+ if (!route->path_rec->mtu || !route->path_rec->rate) { > >+ work->event.event = RDMA_CM_EVENT_ROUTE_ERROR; > >+ work->event.status = -1; > > Any reason not to fail immediately here and leave the id state unchanged? No real reason. Immediate failure is just as good as a deffered one. I will change that and also remove change the rule to ommit "!route->path_rec->rate". > > >+ } else { > >+ work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; > >+ work->event.status = 0; > >+ } > >+ > >+ queue_work(cma_wq, &work->work); > >+ > >+ kfree(mw); > >+} > >+ > >+static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv, > >+ struct cma_multicast *mc) > >+{ > >+ struct rdmaoe_mcast_work *work; > >+ struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; > >+ > >+ if (cma_zero_addr((struct sockaddr *)&mc->addr)) > >+ return -EINVAL; > >+ > >+ work = kzalloc(sizeof *work, GFP_KERNEL); > >+ if (!work) > >+ return -ENOMEM; > >+ > >+ mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL); > >+ if (!mc->multicast.ib) { > >+ kfree(work); > >+ return -ENOMEM; > >+ } > > nit: I'd prefer to goto a common cleanup area to make it easier to add changes > in the future. Will change that. > > >+ > >+ cma_set_mgid(id_priv, (struct sockaddr *)&mc->addr, &mc->multicast.ib- > >>rec.mgid); > >+ mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff); > >+ if (id_priv->id.ps == RDMA_PS_UDP) > >+ mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY); > >+ mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev); > >+ mc->multicast.ib->rec.hop_limit = 1; > >+ mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu); > > Do we need to check the rate/mtu here, like in resolve route? Or should we be > good since we could successfully resolve the route? Actually, can we just read > the data from the path record that gets stored with the id? > I believe that querying the mtu again to get an up to date vlaue will be better, plus adding a check for the mtu to be none zero or returning immediately with -EINVAL > >+ rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid); > >+ work->id = id_priv; > >+ work->mc = mc; > >+ INIT_WORK(&work->work, rdmaoe_mcast_work_handler); > >+ > >+ queue_work(cma_wq, &work->work); > >+ > >+ return 0; > >+} > >+ > > int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, > > void *context) > > { > >@@ -2782,6 +2918,9 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct > >sockaddr *addr, > > case RDMA_TRANSPORT_IB: > > ret = cma_join_ib_multicast(id_priv, mc); > > break; > >+ case RDMA_TRANSPORT_RDMAOE: > >+ ret = cma_rdmaoe_join_multicast(id_priv, mc); > >+ break; > > default: > > ret = -ENOSYS; > > break; > >@@ -2793,6 +2932,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct > >sockaddr *addr, > > spin_unlock_irq(&id_priv->lock); > > kfree(mc); > > } > >+ > > return ret; > > } > > EXPORT_SYMBOL(rdma_join_multicast); > >>port_num)) { > >+ tt = rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num); > >+ switch (tt) { > > case RDMA_TRANSPORT_IB: > >- ucma_copy_ib_route(&resp, &ctx->cm_id->route); > >+ case RDMA_TRANSPORT_RDMAOE: > >+ ucma_copy_ib_route(&resp, &ctx->cm_id->route, tt); > > It seems simpler to just add a new call ucma_copy_rdmaoe_route, rather than > merging those two transports into a single copy function that then branches > based on the transport. Agree, will change. > > > break; > > default: > > break; > >diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h > >index 483057b..66a848e 100644 > >--- a/include/rdma/ib_addr.h > >+++ b/include/rdma/ib_addr.h > >@@ -39,6 +39,8 @@ > > #include > > #include > > #include > >+#include > >+#include > > > > struct rdma_addr_client { > > atomic_t refcount; > >@@ -157,4 +159,89 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr > >*dev_addr, > > memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid); > > } > > > >+ > >+static inline int rdma_link_local_addr(struct in6_addr *addr) > >+{ > >+ if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) && > >+ addr->s6_addr32[1] == 0) > >+ return 1; > >+ else > >+ return 0; > >+} > > just replace the 'if' with 'return' Will do. > > >+ > >+static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac) > >+{ > >+ memcpy(mac, &addr->s6_addr[8], 3); > >+ memcpy(mac + 3, &addr->s6_addr[13], 3); > >+ mac[0] ^= 2; > >+} > >+ > >+static inline int rdma_is_multicast_addr(struct in6_addr *addr) > >+{ > >+ return addr->s6_addr[0] == 0xff ? 1 : 0; > >+} > >+ > >+static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac) > >+{ > >+ memset(mac, 0xff, 6); > >+} > > I don't think we want all of these inline, in particular rdmaoe_mac_to_ll, > rdmaoe_get_mtu , rdmaoe_get_rate. They're quite simple functions - what would you prefer, export them? Why? > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Wed Aug 12 01:51:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 11:51:11 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> Message-ID: <20090812085111.GH25501@me> On 17:33 Tue 11 Aug , Hal Rosenstock wrote: > > Is it supposed to use file when no files (LFT or LID matrix) are supplied ? No. > That's what seems to happen (with no fallback). Look in osm_ucast_mgr.c how ucast_mgr_route() works - it will run default ucast mgr methods in case if 'file' returns '1'. Sasha From sashak at voltaire.com Wed Aug 12 02:03:29 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 12:03:29 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c: osm_mcast_tbl_get_block returns boolean In-Reply-To: <20090807134305.GA30766@comcast.net> References: <20090807134305.GA30766@comcast.net> Message-ID: <20090812090329.GI25501@me> On 09:43 Fri 07 Aug , Hal Rosenstock wrote: > > so use TRUE/FALSE rather than IB_INVALID_PARMETER > > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c > index 82850be..38c06c1 100644 > --- a/opensm/opensm/osm_mcast_tbl.c > +++ b/opensm/opensm/osm_mcast_tbl.c > @@ -273,7 +273,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl, > mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE); > > if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho) > - return (IB_INVALID_PARAMETER); > + return (TRUE); In this case p_block array is not initialized, so just returning 'TRUE' is not a good idea. Actually if we are hitting this case it can indicate an inconsistent mcast_tbl initialization - I would suggest to rework this part and likely to drop this check at all. Sasha From sashak at voltaire.com Wed Aug 12 02:10:22 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 12:10:22 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_qos_policy.c: Some minor simplifications In-Reply-To: <20090811184011.GA5666@comcast.net> References: <20090811184011.GA5666@comcast.net> Message-ID: <20090812091022.GL25501@me> On 14:40 Tue 11 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From jackm at dev.mellanox.co.il Wed Aug 12 02:15:39 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 12 Aug 2009 12:15:39 +0300 Subject: [ofa-general] Re: [PATCH V2] mlx4: Do not allow ib userspace open while device is being removed In-Reply-To: References: <200908111021.01612.jackm@dev.mellanox.co.il> Message-ID: <200908121215.39767.jackm@dev.mellanox.co.il> On Tuesday 11 August 2009 19:23, Roland Dreier wrote: > > > this is a continuation of thread: > > http://lists.openfabrics.org/pipermail/general/2009-July/060668.html > > I see you > didn't answer the question about mthca -- does it suffer from this > problem as well? > Sorry about that. Yes, mthca also suffers from this problem. I'm sending a patch right now: mthca: Do not allow ib userspace open following device internal error -Jack From jackm at dev.mellanox.co.il Wed Aug 12 02:15:46 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 12 Aug 2009 12:15:46 +0300 Subject: [ofa-general] [PATCH] mthca: Do not allow ib userspace open following device internal error Message-ID: <200908121215.46221.jackm@dev.mellanox.co.il> Userspace apps are supposed to release all ib device resources if they receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However, the app has no way of knowing when the device has come back up, except to repeatedly attempt ibv_open_device() until it succeeds. However, currently there is no protection against open succeeding when the device is in the midst of the removal following the fatal event. In this case, the open will succeed, but as a result the device waits in the middle of its removal until the new app releases its ib resources -- and the new app will not do so, since the open succeeded at a point following the fatal event generation. This patch adds an "active" flag to the device. The active flag is set to false (in the fatal event flow) before the "fatal" event is generated, so any subsequent ibv_dev_open() call to the device will fail until the device comes back up, thus preventing the above deadlock. Signed-off-by: Jack Morgenstein --- Roland, You are right, mthca also needs such a patch. This will prevent user-level apps from allocating a device context following a device internal catastrophic error. BTW, if the administrator has disabled device reset on fatal (by default, it is enabled), user-apps will simply need to wait for admin intervention (rmmod and insmod on low-level driver). IMHO, this is OK -- following an internal error, the device must be reset anyway, so there is no point in allowing new apps to attempt to run. diff --git a/drivers/infiniband/hw/mthca/mthca_catas.c b/drivers/infiniband/hw/mthca/mthca_catas.c index 65ad359..ad8b26b 100644 --- a/drivers/infiniband/hw/mthca/mthca_catas.c +++ b/drivers/infiniband/hw/mthca/mthca_catas.c @@ -88,6 +88,7 @@ static void handle_catas(struct mthca_dev *dev) event.device = &dev->ib_dev; event.event = IB_EVENT_DEVICE_FATAL; event.element.port_num = 0; + dev->active = 0; ib_dispatch_event(&event); diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 9ef611f..c1e2bcb 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -357,6 +357,7 @@ struct mthca_dev { struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; u8 rate[MTHCA_MAX_PORTS]; + int active; }; #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 13da9f1..118a386 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -1116,6 +1116,8 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type) pci_set_drvdata(pdev, mdev); mdev->hca_type = hca_type; + mdev->active = 1; + return 0; err_unregister: diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 87ad889..bcf7a40 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -334,6 +334,9 @@ static struct ib_ucontext *mthca_alloc_ucontext(struct ib_device *ibdev, struct mthca_ucontext *context; int err; + if (!(to_mdev(ibdev)->active)) + return ERR_PTR(-EAGAIN); + memset(&uresp, 0, sizeof uresp); uresp.qp_tab_size = to_mdev(ibdev)->limits.num_qps; From vlad at lists.openfabrics.org Wed Aug 12 03:03:46 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 12 Aug 2009 03:03:46 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090812-0200 daily build status Message-ID: <20090812100346.B57A9E61D61@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:300: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:311: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090812-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hal.rosenstock at gmail.com Wed Aug 12 03:06:48 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 12 Aug 2009 06:06:48 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: <20090812085111.GH25501@me> References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> <20090812085111.GH25501@me> Message-ID: On Wed, Aug 12, 2009 at 4:51 AM, Sasha Khapyorsky wrote: > > On 17:33 Tue 11 Aug , Hal Rosenstock wrote: > > > > Is it supposed to use file when no files (LFT or LID matrix) are supplied ? > > No. > > > That's what seems to happen (with no fallback). > > Look in osm_ucast_mgr.c how ucast_mgr_route() works - it will run > default ucast mgr methods in case if 'file' returns '1'. Yes, I had looked at that. What I see is the following when no files are specified: osm_ucast_mgr_process: file tables configured on all switches so file doesn't appear to be falling back in this case. -- Hal > > > > Sasha From sashak at voltaire.com Wed Aug 12 03:25:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 13:25:01 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> <20090812085111.GH25501@me> Message-ID: <20090812102501.GN25501@me> On 06:06 Wed 12 Aug , Hal Rosenstock wrote: > > What I see is the following when no files are specified: > osm_ucast_mgr_process: file tables configured on all switches > > so file doesn't appear to be falling back in this case. if (!r->build_lid_matrices || (ret = r->build_lid_matrices(r->context)) > 0) ret = osm_ucast_mgr_build_lid_matrices(&osm->sm.ucast_mgr); So when method is defined and it returns a positive value (file name is not specified) OpenSM will build lid matrices using default algorithm and will continue with LFT file. This is how things were supposed to work. Sasha From sashak at voltaire.com Wed Aug 12 03:26:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 13:26:32 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: clear packet buffer correctly before formating and sending In-Reply-To: <20090811100428.c4fb6c5e.weiny2@llnl.gov> References: <20090811100428.c4fb6c5e.weiny2@llnl.gov> Message-ID: <20090812102632.GO25501@me> On 10:04 Tue 11 Aug , Ira Weiny wrote: > I found this bug a while back but forgot to submit a patch. > > I don't think this will affect the issues Mr. Miller was having with BM, as I believe the BM stuff he was trying all expected a response (thereby calling mad_rpc instead). But it could be worth a try. > > Ira > > > From: Ira Weiny > Date: Tue, 11 Aug 2009 10:00:25 -0700 > Subject: [PATCH] libibmad: clear packet buffer correctly before formating and sending > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From hal.rosenstock at gmail.com Wed Aug 12 03:27:56 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 12 Aug 2009 06:27:56 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c: osm_mcast_tbl_get_block returns boolean In-Reply-To: <20090812090329.GI25501@me> References: <20090807134305.GA30766@comcast.net> <20090812090329.GI25501@me> Message-ID: On Wed, Aug 12, 2009 at 5:03 AM, Sasha Khapyorsky wrote: > On 09:43 Fri 07 Aug , Hal Rosenstock wrote: > > > > so use TRUE/FALSE rather than IB_INVALID_PARMETER > > > > Signed-off-by: Hal Rosenstock > > --- > > diff --git a/opensm/opensm/osm_mcast_tbl.c > b/opensm/opensm/osm_mcast_tbl.c > > index 82850be..38c06c1 100644 > > --- a/opensm/opensm/osm_mcast_tbl.c > > +++ b/opensm/opensm/osm_mcast_tbl.c > > @@ -273,7 +273,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const > p_tbl, > > mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE); > > > > if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho) > > - return (IB_INVALID_PARAMETER); > > + return (TRUE); > > In this case p_block array is not initialized, so just returning 'TRUE' > is not a good idea. That's how it's handled now in the code. > > > Actually if we are hitting this case it can indicate an inconsistent > mcast_tbl initialization That's one possibility. The other would be a SA query for a specific block that is not validated first. > - I would suggest to rework this part and likely to drop this check at all. Dropping it should be OK as long as all callers validate prior to calling (and it looks like they do). -- Hal > > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Aug 12 03:36:22 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 12 Aug 2009 06:36:22 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: <20090812102501.GN25501@me> References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> <20090812085111.GH25501@me> <20090812102501.GN25501@me> Message-ID: On Wed, Aug 12, 2009 at 6:25 AM, Sasha Khapyorsky wrote: > On 06:06 Wed 12 Aug , Hal Rosenstock wrote: > > > > What I see is the following when no files are specified: > > osm_ucast_mgr_process: file tables configured on all switches > > > > so file doesn't appear to be falling back in this case. > > if (!r->build_lid_matrices || > (ret = r->build_lid_matrices(r->context)) > 0) > ret = osm_ucast_mgr_build_lid_matrices(&osm->sm.ucast_mgr); > > So when method is defined and it returns a positive value (file name is > not specified) OpenSM will build lid matrices using default algorithm > and will continue with LFT file. This is how things were supposed to > work. Does that really make sense for this case when there no files supplied and file is specified ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Aug 12 03:42:39 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 13:42:39 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c: osm_mcast_tbl_get_block returns boolean In-Reply-To: References: <20090807134305.GA30766@comcast.net> <20090812090329.GI25501@me> Message-ID: <20090812104239.GP25501@me> On 06:27 Wed 12 Aug , Hal Rosenstock wrote: > > That's one possibility. The other would be a SA query for a specific > block that is not validated first. And there is no any reason to return "success" status with a garbage data. Sasha From sashak at voltaire.com Wed Aug 12 03:46:49 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 13:46:49 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> <20090812085111.GH25501@me> <20090812102501.GN25501@me> Message-ID: <20090812104649.GQ25501@me> On 06:36 Wed 12 Aug , Hal Rosenstock wrote: > > Does that really make sense for this case when there no files supplied and > file is specified ? This is for case when one of the files (or both) is *not* specified, like this: opensm ... -R file -U lft-file , which is pretty useful. Sasha From hal.rosenstock at gmail.com Wed Aug 12 03:48:30 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 12 Aug 2009 06:48:30 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: <20090812104649.GQ25501@me> References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> <20090812085111.GH25501@me> <20090812102501.GN25501@me> <20090812104649.GQ25501@me> Message-ID: On Wed, Aug 12, 2009 at 6:46 AM, Sasha Khapyorsky wrote: > On 06:36 Wed 12 Aug , Hal Rosenstock wrote: > > > > Does that really make sense for this case when there no files supplied > and > > file is specified ? > > This is for case when one of the files (or both) is *not* specified, > like this: > > opensm ... -R file -U lft-file > > , which is pretty useful. I'm asking about the utility of: opensm -R file (no -U and no -M) Why shouldn't that case fallback ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Aug 12 03:57:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 13:57:34 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> <20090812085111.GH25501@me> <20090812102501.GN25501@me> <20090812104649.GQ25501@me> Message-ID: <20090812105734.GR25501@me> On 06:48 Wed 12 Aug , Hal Rosenstock wrote: > > I'm asking about the utility of: > > opensm -R file > > (no -U and no -M) (but then the discussion is not related to proposed patch since this breaks '-R file -U lft-file' case) > > Why shouldn't that case fallback ? Why it should if user is asking to run such configuration? Sasha From hal.rosenstock at gmail.com Wed Aug 12 03:59:53 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 12 Aug 2009 06:59:53 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: <20090812105734.GR25501@me> References: <20090806181928.GA21698@comcast.net> <20090811212221.GG25501@me> <20090812085111.GH25501@me> <20090812102501.GN25501@me> <20090812104649.GQ25501@me> <20090812105734.GR25501@me> Message-ID: On Wed, Aug 12, 2009 at 6:57 AM, Sasha Khapyorsky wrote: > On 06:48 Wed 12 Aug , Hal Rosenstock wrote: > > > > I'm asking about the utility of: > > > > opensm -R file > > > > (no -U and no -M) > > (but then the discussion is not related to proposed patch since this > breaks '-R file -U lft-file' case) Right; I was discussing the original motivation for the patch. > > > > > > Why shouldn't that case fallback ? > > Why it should if user is asking to run such configuration? Because it's broken request (no warning of nothing useful is going to be done). Don't we try to fallback in broken scenarios ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Aug 12 04:16:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 14:16:12 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: References: <20090811212221.GG25501@me> <20090812085111.GH25501@me> <20090812102501.GN25501@me> <20090812104649.GQ25501@me> <20090812105734.GR25501@me> Message-ID: <20090812111612.GS25501@me> On 06:59 Wed 12 Aug , Hal Rosenstock wrote: > > Because it's broken request (no warning of nothing useful is going to be > done). There are warnings about unspecified files. > Don't we try to fallback in broken scenarios ? It is not obvious in this case - for instance one may want to run OpenSM, to fetch LFT as template, modify them and reload by just adding '-U', etc.. Anyway it is user's decision there. Sasha From hal.rosenstock at gmail.com Wed Aug 12 04:26:31 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 12 Aug 2009 07:26:31 -0400 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: <20090812111612.GS25501@me> References: <20090811212221.GG25501@me> <20090812085111.GH25501@me> <20090812102501.GN25501@me> <20090812104649.GQ25501@me> <20090812105734.GR25501@me> <20090812111612.GS25501@me> Message-ID: On Wed, Aug 12, 2009 at 7:16 AM, Sasha Khapyorsky wrote: > On 06:59 Wed 12 Aug , Hal Rosenstock wrote: > > > > Because it's broken request (no warning of nothing useful is going to be > > done). > > There are warnings about unspecified files. These messages are VERBOSE in log level so they're not normally seen. INFO might be better. > > > > Don't we try to fallback in broken scenarios ? > > It is not obvious in this case - for instance one may want to run OpenSM, > to fetch LFT as template, modify them and reload by just adding '-U', > etc.. dumping LFTs in case of no files doesn't make for a very good template to start with. > Anyway it is user's decision there. Perhaps but we usually help with poor decisions. -- Hal > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halves at linux.vnet.ibm.com Wed Aug 12 05:17:53 2009 From: halves at linux.vnet.ibm.com (Higor Aparecido Vieira Alves) Date: Wed, 12 Aug 2009 09:17:53 -0300 Subject: [ofa-general] Chelsio cards In-Reply-To: <4A825D8F.8030102@voltaire.com> References: <1250023074.16631.1.camel@halves-ltc> <4A81DF7E.6020907@opengridcomputing.com> <4A825D8F.8030102@voltaire.com> Message-ID: <1250079473.7238.4.camel@halves-ltc> Hi guys, Thanks a lot. Em Qua, 2009-08-12 às 09:13 +0300, Aleksey Senin escreveu: > > > > User mode: > > openmpi > > mvapich2 > > udapl 1.2 and 2.0 (and thus most ULPs using udapl like Intel MPI, HP > > MPI, Scali MPI) > > rdmacm (required for connection setup) > > ibverbs (RC QP only) > > perftest (ib_rdma_bw and ib_rdma_lat only) > > qperf > > rping > Another two binaries in user space > ib_rdma_lat > ib_rdma_bw > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Higor Aparecido Vieira Alves Software Engineer Linux Technology Center IBM Systems & Technology Group From hnrose at comcast.net Wed Aug 12 06:22:47 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 12 Aug 2009 09:22:47 -0400 Subject: [ofa-general] [PATCH] opensm/osm_mcast_tbl.c: In osm_mcast_tbl_get_block, eliminate unneeded check Message-ID: <20090812132247.GA15084@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c index 82850be..029a735 100644 --- a/opensm/opensm/osm_mcast_tbl.c +++ b/opensm/opensm/osm_mcast_tbl.c @@ -272,9 +272,6 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl, mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE); - if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho) - return (IB_INVALID_PARAMETER); - for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++) p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position]; From sashak at voltaire.com Wed Aug 12 08:53:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 12 Aug 2009 18:53:10 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_file.c: Fix return status from do_ucast_file_load when file name is not provided In-Reply-To: References: <20090812085111.GH25501@me> <20090812102501.GN25501@me> <20090812104649.GQ25501@me> <20090812105734.GR25501@me> <20090812111612.GS25501@me> Message-ID: <20090812155310.GU25501@me> On 07:26 Wed 12 Aug , Hal Rosenstock wrote: > > These messages are VERBOSE in log level so they're not normally seen. INFO > might be better. VERBOSE is more than enough IMO. > dumping LFTs in case of no files doesn't make for a very good template to > start with. Why not? It will be typical minhop. > Perhaps but we usually help with poor decisions. We can indicate an invalid decision, in this case it is *not* invalid (may be useless, but we cannot know for sure). Sasha From rdreier at cisco.com Wed Aug 12 08:52:36 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Aug 2009 08:52:36 -0700 Subject: [ofa-general] mlx4: device driver tries to sync DMA memory it has not allocated In-Reply-To: <20090812070558.GA23123@mtls03> (Eli Cohen's message of "Wed, 12 Aug 2009 10:05:58 +0300") References: <20090810084527.GA2446@mtls03> <20090812061749.GA20719@mtls03> <20090812070558.GA23123@mtls03> Message-ID: > Is this is always true? What if you allocated an ICM buffer that uses > none adjacent pages? In this case you would need more than one call to > dma_sync_single_for_*(), isn't it? As I said originally, the code only writes one page at a time from CPU into ICM, so you never need more than one dma_sync at a time. - R. From hnrose at comcast.net Wed Aug 12 10:06:14 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 12 Aug 2009 13:06:14 -0400 Subject: [ofa-general] [PATCH] IB/mad: Allow tuning of QP0 and QP1 sizes Message-ID: <20090812170614.GA16298@comcast.net> IB/mad: Allow tuning of QP0 and QP1 sizes Signed-off-by: Hal Rosenstock --- diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index de922a0..7e553c3 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); MODULE_AUTHOR("Sean Hefty"); +int mad_sendq_size = IB_MAD_QP_SEND_SIZE; +int mad_recvq_size = IB_MAD_QP_RECV_SIZE; + +module_param_named(send_queue_size, mad_sendq_size, int, 0444); +MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests"); +module_param_named(recv_queue_size, mad_recvq_size, int, 0444); +MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests"); + static struct kmem_cache *ib_mad_cache; static struct list_head ib_mad_port_list; @@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, qp_init_attr.send_cq = qp_info->port_priv->cq; qp_init_attr.recv_cq = qp_info->port_priv->cq; qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_wr = mad_sendq_size; + qp_init_attr.cap.max_recv_wr = mad_recvq_size; qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; qp_init_attr.qp_type = qp_type; @@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, goto error; } /* Use minimum queue sizes unless the CQ is resized */ - qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; - qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + qp_info->send_queue.max_active = mad_sendq_size; + qp_info->recv_queue.max_active = mad_recvq_size; return 0; error: @@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device, init_mad_qp(port_priv, &port_priv->qp_info[0]); init_mad_qp(port_priv, &port_priv->qp_info[1]); - cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + cq_size = (mad_sendq_size + mad_recvq_size) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, NULL, port_priv, cq_size, 0); @@ -2984,6 +2993,14 @@ static int __init ib_mad_init_module(void) { int ret; + mad_recvq_size = roundup_pow_of_two(mad_recvq_size); + mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); + mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); + + mad_sendq_size = roundup_pow_of_two(mad_sendq_size); + mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); + mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); + spin_lock_init(&ib_mad_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 05ce331..9430ab4 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -2,6 +2,7 @@ * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -49,6 +50,8 @@ /* QP and CQ parameters */ #define IB_MAD_QP_SEND_SIZE 128 #define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_QP_MIN_SIZE 64 +#define IB_MAD_QP_MAX_SIZE 8192 #define IB_MAD_SEND_REQ_MAX_SG 2 #define IB_MAD_RECV_REQ_MAX_SG 1 From sean.hefty at intel.com Wed Aug 12 10:09:58 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 12 Aug 2009 10:09:58 -0700 Subject: [ofa-general] RE: [PATCH] IB/mad: Allow tuning of QP0 and QP1 sizes In-Reply-To: <20090812170614.GA16298@comcast.net> References: <20090812170614.GA16298@comcast.net> Message-ID: <81E288C717914E548EE8023C330E10C3@amr.corp.intel.com> >+ mad_recvq_size = roundup_pow_of_two(mad_recvq_size); >+ mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); >+ mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); >+ >+ mad_sendq_size = roundup_pow_of_two(mad_sendq_size); >+ mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); >+ mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); Why round up to a power of two or have min/max restrictions? From hal.rosenstock at gmail.com Wed Aug 12 10:13:37 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 12 Aug 2009 13:13:37 -0400 Subject: [ofa-general] RE: [PATCH] IB/mad: Allow tuning of QP0 and QP1 sizes In-Reply-To: <81E288C717914E548EE8023C330E10C3@amr.corp.intel.com> References: <20090812170614.GA16298@comcast.net> <81E288C717914E548EE8023C330E10C3@amr.corp.intel.com> Message-ID: On Wed, Aug 12, 2009 at 1:09 PM, Sean Hefty wrote: > >+ mad_recvq_size = roundup_pow_of_two(mad_recvq_size); > >+ mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); > >+ mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); > >+ > >+ mad_sendq_size = roundup_pow_of_two(mad_sendq_size); > >+ mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); > >+ mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); > > Why round up to a power of two or have min/max restrictions? power of two is arbitrary and could be removed. min is also somewhat arbitrary but didn't want to allow it too much smaller than it already is (default for this patch). max truly is a maximum as create QP fails with larger size (didn't try this across all HCAs though). > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Aug 12 10:17:15 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Aug 2009 10:17:15 -0700 Subject: [ofa-general] Re: [PATCH] IB/mad: Allow tuning of QP0 and QP1 sizes In-Reply-To: <20090812170614.GA16298@comcast.net> (Hal Rosenstock's message of "Wed, 12 Aug 2009 13:06:14 -0400") References: <20090812170614.GA16298@comcast.net> Message-ID: > IB/mad: Allow tuning of QP0 and QP1 sizes -ENOCHANGELOG Why allow this tuning? In a couple years, when I'm reading the kernel changelog, what am I going to need to know to see why we did this? - R. From hnrose at comcast.net Wed Aug 12 10:22:51 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 12 Aug 2009 13:22:51 -0400 Subject: [ofa-general] [PATCHv2] IB/mad: Allow tuning of QP0 and QP1 sizes Message-ID: <20090812172251.GA16446@comcast.net> IB/mad: Allow tuning of QP0 and QP1 sizes MADs are UD and can be dropped if there are no receives posted. Send side tuning is done for symmetry with receive. Signed-off-by: Hal Rosenstock --- Changes since v1: Added changelog diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index de922a0..7e553c3 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); MODULE_AUTHOR("Sean Hefty"); +int mad_sendq_size = IB_MAD_QP_SEND_SIZE; +int mad_recvq_size = IB_MAD_QP_RECV_SIZE; + +module_param_named(send_queue_size, mad_sendq_size, int, 0444); +MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests"); +module_param_named(recv_queue_size, mad_recvq_size, int, 0444); +MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests"); + static struct kmem_cache *ib_mad_cache; static struct list_head ib_mad_port_list; @@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, qp_init_attr.send_cq = qp_info->port_priv->cq; qp_init_attr.recv_cq = qp_info->port_priv->cq; qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_wr = mad_sendq_size; + qp_init_attr.cap.max_recv_wr = mad_recvq_size; qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; qp_init_attr.qp_type = qp_type; @@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, goto error; } /* Use minimum queue sizes unless the CQ is resized */ - qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; - qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + qp_info->send_queue.max_active = mad_sendq_size; + qp_info->recv_queue.max_active = mad_recvq_size; return 0; error: @@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device, init_mad_qp(port_priv, &port_priv->qp_info[0]); init_mad_qp(port_priv, &port_priv->qp_info[1]); - cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + cq_size = (mad_sendq_size + mad_recvq_size) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, NULL, port_priv, cq_size, 0); @@ -2984,6 +2993,14 @@ static int __init ib_mad_init_module(void) { int ret; + mad_recvq_size = roundup_pow_of_two(mad_recvq_size); + mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); + mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); + + mad_sendq_size = roundup_pow_of_two(mad_sendq_size); + mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); + mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); + spin_lock_init(&ib_mad_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 05ce331..9430ab4 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -2,6 +2,7 @@ * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -49,6 +50,8 @@ /* QP and CQ parameters */ #define IB_MAD_QP_SEND_SIZE 128 #define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_QP_MIN_SIZE 64 +#define IB_MAD_QP_MAX_SIZE 8192 #define IB_MAD_SEND_REQ_MAX_SG 2 #define IB_MAD_RECV_REQ_MAX_SG 1 From suri at baymicrosystems.com Wed Aug 12 11:48:43 2009 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Wed, 12 Aug 2009 14:48:43 -0400 Subject: [ofa-general] [PATCHv2] IB/mad: Allow tuning of QP0 and QP1 sizes In-Reply-To: <20090812172251.GA16446@comcast.net> References: <20090812172251.GA16446@comcast.net> Message-ID: <9985926A1C2A496B89AC63ACE13093D5@md.baymicrosystems.com> Hal: 1. Aren't you going to remove the power_of_two? 2. Also, don't you need permissions to be 644? -Suri > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > Of Hal Rosenstock > Sent: Wednesday, August 12, 2009 1:23 PM > To: rdreier at cisco.com; sean.hefty at intel.com > Cc: general at lists.openfabrics.org > Subject: [ofa-general] [PATCHv2] IB/mad: Allow tuning of QP0 and QP1 sizes > > > IB/mad: Allow tuning of QP0 and QP1 sizes > > MADs are UD and can be dropped if there are no receives posted. > Send side tuning is done for symmetry with receive. > > Signed-off-by: Hal Rosenstock > --- > Changes since v1: > Added changelog > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index de922a0..7e553c3 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > * Copyright (c) 2005 Intel Corporation. All rights reserved. > * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API"); > MODULE_AUTHOR("Hal Rosenstock"); > MODULE_AUTHOR("Sean Hefty"); > > +int mad_sendq_size = IB_MAD_QP_SEND_SIZE; > +int mad_recvq_size = IB_MAD_QP_RECV_SIZE; > + > +module_param_named(send_queue_size, mad_sendq_size, int, 0444); > +MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests"); > +module_param_named(recv_queue_size, mad_recvq_size, int, 0444); > +MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests"); > + > static struct kmem_cache *ib_mad_cache; > > static struct list_head ib_mad_port_list; > @@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, > qp_init_attr.send_cq = qp_info->port_priv->cq; > qp_init_attr.recv_cq = qp_info->port_priv->cq; > qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; > - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; > - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; > + qp_init_attr.cap.max_send_wr = mad_sendq_size; > + qp_init_attr.cap.max_recv_wr = mad_recvq_size; > qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; > qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; > qp_init_attr.qp_type = qp_type; > @@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, > goto error; > } > /* Use minimum queue sizes unless the CQ is resized */ > - qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; > - qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; > + qp_info->send_queue.max_active = mad_sendq_size; > + qp_info->recv_queue.max_active = mad_recvq_size; > return 0; > > error: > @@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device, > init_mad_qp(port_priv, &port_priv->qp_info[0]); > init_mad_qp(port_priv, &port_priv->qp_info[1]); > > - cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; > + cq_size = (mad_sendq_size + mad_recvq_size) * 2; > port_priv->cq = ib_create_cq(port_priv->device, > ib_mad_thread_completion_handler, > NULL, port_priv, cq_size, 0); > @@ -2984,6 +2993,14 @@ static int __init ib_mad_init_module(void) > { > int ret; > > + mad_recvq_size = roundup_pow_of_two(mad_recvq_size); > + mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); > + mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); > + > + mad_sendq_size = roundup_pow_of_two(mad_sendq_size); > + mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); > + mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); > + > spin_lock_init(&ib_mad_port_list_lock); > > ib_mad_cache = kmem_cache_create("ib_mad", > diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h > index 05ce331..9430ab4 100644 > --- a/drivers/infiniband/core/mad_priv.h > +++ b/drivers/infiniband/core/mad_priv.h > @@ -2,6 +2,7 @@ > * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved. > * Copyright (c) 2005 Intel Corporation. All rights reserved. > * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -49,6 +50,8 @@ > /* QP and CQ parameters */ > #define IB_MAD_QP_SEND_SIZE 128 > #define IB_MAD_QP_RECV_SIZE 512 > +#define IB_MAD_QP_MIN_SIZE 64 > +#define IB_MAD_QP_MAX_SIZE 8192 > #define IB_MAD_SEND_REQ_MAX_SG 2 > #define IB_MAD_RECV_REQ_MAX_SG 1 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hnrose at comcast.net Wed Aug 12 12:27:05 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 12 Aug 2009 15:27:05 -0400 Subject: [ofa-general] [PATCHv3] IB/mad: Allow tuning of QP0 and QP1 sizes Message-ID: <20090812192705.GA16704@comcast.net> IB/mad: Allow tuning of QP0 and QP1 sizes MADs are UD and can be dropped if there are no receives posted. Send side tuning is done for symmetry with receive. Signed-off-by: Hal Rosenstock --- Changes since v2: Removed roundup_pow_of_two of receive and send sizes Changed module paramater permissions to 0644 Changes since v1: Added changelog diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index de922a0..ff9bc22 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); MODULE_AUTHOR("Sean Hefty"); +int mad_sendq_size = IB_MAD_QP_SEND_SIZE; +int mad_recvq_size = IB_MAD_QP_RECV_SIZE; + +module_param_named(send_queue_size, mad_sendq_size, int, 0644); +MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests"); +module_param_named(recv_queue_size, mad_recvq_size, int, 0644); +MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests"); + static struct kmem_cache *ib_mad_cache; static struct list_head ib_mad_port_list; @@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, qp_init_attr.send_cq = qp_info->port_priv->cq; qp_init_attr.recv_cq = qp_info->port_priv->cq; qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_wr = mad_sendq_size; + qp_init_attr.cap.max_recv_wr = mad_recvq_size; qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; qp_init_attr.qp_type = qp_type; @@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, goto error; } /* Use minimum queue sizes unless the CQ is resized */ - qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; - qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + qp_info->send_queue.max_active = mad_sendq_size; + qp_info->recv_queue.max_active = mad_recvq_size; return 0; error: @@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device, init_mad_qp(port_priv, &port_priv->qp_info[0]); init_mad_qp(port_priv, &port_priv->qp_info[1]); - cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + cq_size = (mad_sendq_size + mad_recvq_size) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, NULL, port_priv, cq_size, 0); @@ -2984,6 +2993,12 @@ static int __init ib_mad_init_module(void) { int ret; + mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); + mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); + + mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); + mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); + spin_lock_init(&ib_mad_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 05ce331..9430ab4 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -2,6 +2,7 @@ * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -49,6 +50,8 @@ /* QP and CQ parameters */ #define IB_MAD_QP_SEND_SIZE 128 #define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_QP_MIN_SIZE 64 +#define IB_MAD_QP_MAX_SIZE 8192 #define IB_MAD_SEND_REQ_MAX_SG 2 #define IB_MAD_RECV_REQ_MAX_SG 1 From rdreier at cisco.com Wed Aug 12 14:20:02 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Aug 2009 14:20:02 -0700 Subject: [ofa-general] [PATCHv3] IB/mad: Allow tuning of QP0 and QP1 sizes In-Reply-To: <20090812192705.GA16704@comcast.net> (Hal Rosenstock's message of "Wed, 12 Aug 2009 15:27:05 -0400") References: <20090812192705.GA16704@comcast.net> Message-ID: > Changed module paramater permissions to 0644 Does it really work if someone changes the module parameter at runtime after the module is loaded? - R. From akepner at sgi.com Wed Aug 12 15:59:26 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 12 Aug 2009 15:59:26 -0700 Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas? Message-ID: <20090812225926.GD24786@sgi.com> We have a customer who has repeatedly had system panics with the following signature: Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: {:ib_cm:ib_cm_init_qp_attr+580} PGD 3a2db6067 PUD 0 Oops: 0000 [1] SMP last sysfs file: /class/infiniband/mlx4_0/node_guid CPU 4 Modules linked in: i2c_dev sg sd_mod crc32c libcrc32c iscsi_tcp libiscsi scsi_transport_iscsi rdma_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad iw_cxgb3 cxgb3 firmware_class mlx4_ib ib_mthca ib_mad ib_core loop numatools xpmem worm mlx4_core libata i2c_i801 scsi_mod i2c_core shpchp pci_hotplug nfs lockd nfs _acl af_packet sunrpc e1000 Pid: 3256, comm: star Tainted: G U 2.6.16.60-0.34-smp #1 RIP: 0010:[] {:ib_cm:ib_cm_init_qp_attr+580} RSP: 0018:ffff810369d09d38 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff810419678c00 RCX: 0000000000000008 RDX: 0000000000000246 RSI: ffff810419678d18 RDI: ffff810369d09e70 RBP: ffff810369d09e18 R08: 000000030000003d R09: 0000000000000000 R10: ffff810369d09e18 R11: 0000000000000088 R12: ffff810369d09d88 R13: 0000000000000000 R14: ffff810419678c80 R15: 00000000403500b0 FS: 0000000040354940(0063) GS:ffff810420ffbbc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000010 CR3: 000000039f0c4000 CR4: 00000000000006e0 Process star (pid: 3256, threadinfo ffff810369d08000, task ffff8103b81b5830) Stack: ffff810419678a00 ffff810369d09d88 ffff810369d09e18 ffff810369d09e18 0000000040143430 ffffffff882fb6d5 ffff810376261540 ffff81040bea4740 ffff810376261540 ffffffff88309285 Call Trace: {:rdma_cm:rdma_init_qp_attr+209} {:rdma_ucm:ucma_init_qp_attr+160} {thread_return+0} {:rdma_ucm:ucma_write+115} {vfs_write+215} {sys_write+69} {system_call+126} Code: 8a 40 10 88 85 85 00 00 00 8b 83 38 01 00 00 66 89 45 7a 8a RIP {:ib_cm:ib_cm_init_qp_attr+580} RSP >From a crash dump, I determined that we died in cm_init_qp_rts_attr() (it's inline, so it doesn't show up in the traceback) on the line labeled below: static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv, struct ib_qp_attr *qp_attr, int *qp_attr_mask) { ........ if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) { ..... } else { *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE; qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; <-die cm_id_priv->alt_av.port is NULL, so it looks as if there's a race initializing 'alt_av'. They are running quite old code (OFED 1.3.1), but I'm not aware of anything which would change this behavior in more recent versions, though I certainly may have missed something. Anyone seen similar? Any ideas for a fix, or workaround? -- Arthur From sean.hefty at intel.com Wed Aug 12 16:20:41 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 12 Aug 2009 16:20:41 -0700 Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas? In-Reply-To: <20090812225926.GD24786@sgi.com> References: <20090812225926.GD24786@sgi.com> Message-ID: >Call Trace: {:rdma_cm:rdma_init_qp_attr+209} > {:rdma_ucm:ucma_init_qp_attr+160} > {thread_return+0} >{:rdma_ucm:ucma_write+115} > {vfs_write+215} {sys_write+69} > {system_call+126} The rdma_cm is being used, so alternate path information is not used. >static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv, > struct ib_qp_attr *qp_attr, > int *qp_attr_mask) >{ > ........ > if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) { > ..... > } else { > *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE; > qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; <-die The rdma_cm should always send us through the if portion, and I would expect alt_av to be NULL. Maybe the cm_id is corrupted..? Is there any chance that the remote side is trying to load an alternate path? Getting the value of the lap_state may help, to see if it's at least a valid lap_state value. - Sean From akepner at sgi.com Wed Aug 12 16:14:08 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 12 Aug 2009 16:14:08 -0700 Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas? In-Reply-To: References: <20090812225926.GD24786@sgi.com> Message-ID: <20090812231408.GE24786@sgi.com> On Wed, Aug 12, 2009 at 04:20:41PM -0700, Sean Hefty wrote: > .... > The rdma_cm should always send us through the if portion, and I would expect > alt_av to be NULL. Maybe the cm_id is corrupted..? Is there any chance that > the remote side is trying to load an alternate path? Getting the value of the > lap_state may help, to see if it's at least a valid lap_state value. > Ah, I've got that - lap_state is IB_CM_MRA_LAP_SENT. -- Arthur From sean.hefty at intel.com Wed Aug 12 16:29:28 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 12 Aug 2009 16:29:28 -0700 Subject: [ofa-general] crash in cm_init_qp_rts_attr() - any ideas? In-Reply-To: <20090812231408.GE24786@sgi.com> References: <20090812225926.GD24786@sgi.com> <20090812231408.GE24786@sgi.com> Message-ID: >Ah, I've got that - lap_state is IB_CM_MRA_LAP_SENT. Errr... not sure how that happened. I don't know if ofed 1.3 has this feature or not, but can you cat: /sys/class/infiniband_cm///cm_tx_msgs/lap if it exists? Are both sides using the rdma_cm to communicate? Does anything in the app (either side) try to do something with alternate paths? - Sean From weiny2 at llnl.gov Wed Aug 12 16:53:20 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 12 Aug 2009 16:53:20 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/libibnetdisc: remove all IBPANIC's and clean up error handling Message-ID: <20090812165320.66ea08a5.weiny2@llnl.gov> This patch applies after: libibnetdisc: fix potential memory leak of port object Which I sent last week but I don't think has made it up stream. Ira From: Ira Weiny Date: Wed, 12 Aug 2009 16:13:56 -0700 Subject: [PATCH] infiniband-diags/libibnetdisc: remove all IBPANIC's and clean up error handling Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/src/chassis.c | 124 ++++++++++++++++--------- infiniband-diags/libibnetdisc/src/chassis.h | 2 +- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 75 ++++++++++----- 3 files changed, 132 insertions(+), 69 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 76a02a6..efa4ed5 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -323,7 +323,7 @@ char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0 char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; /* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ -static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) +static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) { ibnd_node_t *n = (ibnd_node_t *)node; @@ -345,12 +345,14 @@ static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) n->ch_slotnum = spine4_slot_2_slb[lineport->portnum]; n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, - node->node.guid); + IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64, + node->node.guid); + return (-1); } + return (0); } -static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport) +static int get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport) { ibnd_node_t *n = (ibnd_node_t *)node; uint64_t guessnum = 0; @@ -385,12 +387,14 @@ static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport) n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, - spineport->node->guid); + IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64, + spineport->node->guid); + return (-1); } + return (0); } -static void get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport) +static int get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport) { n->ch_slot = LINE_CS; if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { @@ -410,9 +414,11 @@ static void get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport) n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, - spineport->node->guid); + IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64, + spineport->node->guid); + return (-1); } + return (0); } /* forward declare this */ @@ -422,7 +428,7 @@ static void voltaire_portmap(ibnd_port_t *port); It could be optimized so, but time overhead is very small and its only diag.util */ -static void fill_voltaire_chassis_record(struct ibnd_node *node) +static int fill_voltaire_chassis_record(struct ibnd_node *node) { ibnd_node_t *n = (ibnd_node_t *)node; int p = 0; @@ -430,7 +436,7 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node) struct ibnd_node *remnode = 0; if (node->ch_found) /* somehow this node has already been passed */ - return; + return (0); node->ch_found = 1; /* node is router only in case of using unique lid */ @@ -456,7 +462,8 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node) } if (!n->ch_type) /* we assume here that remoteport belongs to line */ - get_sfb_slot(node, port->remoteport); + if (get_sfb_slot(node, port->remoteport)) + return (-1); /* we could break here, but need to find if more routers connected */ } @@ -467,7 +474,8 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node) if (!port || port->portnum > 12 || !port->remoteport) continue; /* we assume here that remoteport belongs to spine */ - get_slb_slot(n, port->remoteport); + if (get_slb_slot(n, port->remoteport)) + return (-1); break; } } @@ -480,15 +488,17 @@ static void fill_voltaire_chassis_record(struct ibnd_node *node) voltaire_portmap(port); } - return; + return (0); } static int get_line_index(ibnd_node_t *node) { int retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum; - if (retval > LINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); + if (retval > LINES_MAX_NUM || retval < 1) { + IBND_ERROR("Internal error\n"); + return (-1); + } return retval; } @@ -501,34 +511,44 @@ static int get_spine_index(ibnd_node_t *node) else retval = node->ch_slotnum; - if (retval > SPINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); + if (retval > SPINES_MAX_NUM || retval < 1) { + IBND_ERROR("Internal error\n"); + return (-1); + } return retval; } -static void insert_line_router(ibnd_node_t *node, ibnd_chassis_t *chassis) +static int insert_line_router(ibnd_node_t *node, ibnd_chassis_t *chassis) { int i = get_line_index(node); + if (i < 0) + return (i); + if (chassis->linenode[i]) - return; /* already filled slot */ + return (0); /* already filled slot */ chassis->linenode[i] = node; node->chassis = chassis; + return (0); } -static void insert_spine(ibnd_node_t *node, ibnd_chassis_t *chassis) +static int insert_spine(ibnd_node_t *node, ibnd_chassis_t *chassis) { int i = get_spine_index(node); + if (i < 0) + return (i); + if (chassis->spinenode[i]) - return; /* already filled slot */ + return (0); /* already filled slot */ chassis->spinenode[i] = node; node->chassis = chassis; + return (0); } -static void pass_on_lines_catch_spines(ibnd_chassis_t *chassis) +static int pass_on_lines_catch_spines(ibnd_chassis_t *chassis) { ibnd_node_t *node, *remnode; ibnd_port_t *port; @@ -549,12 +569,14 @@ static void pass_on_lines_catch_spines(ibnd_chassis_t *chassis) if (!CONV_NODE_INTERNAL(remnode)->ch_found) continue; /* some error - spine not initialized ? FIXME */ - insert_spine(remnode, chassis); + if (insert_spine(remnode, chassis)) + return (-1); } } + return (0); } -static void pass_on_spines_catch_lines(ibnd_chassis_t *chassis) +static int pass_on_spines_catch_lines(ibnd_chassis_t *chassis) { ibnd_node_t *node, *remnode; ibnd_port_t *port; @@ -572,9 +594,11 @@ static void pass_on_spines_catch_lines(ibnd_chassis_t *chassis) if (!CONV_NODE_INTERNAL(remnode)->ch_found) continue; /* some error - line/router not initialized ? FIXME */ - insert_line_router(remnode, chassis); + if (insert_line_router(remnode, chassis)) + return (-1); } } + return (0); } /* @@ -602,14 +626,15 @@ static void pass_on_spines_interpolate_chguid(ibnd_chassis_t *chassis) in that chassis chassis structure = structure of one standalone chassis */ -static void build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis) +static int build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis) { int p = 0; struct ibnd_node *remnode = 0; ibnd_port_t *port = 0; /* we get here with node = chassis_spine */ - insert_spine((ibnd_node_t *)node, chassis); + if (insert_spine((ibnd_node_t *)node, chassis)) + return (-1); /* loop: pass on all ports of node */ for (p = 1; p <= node->node.numports; p++ ) { @@ -624,17 +649,23 @@ static void build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis) insert_line_router(&(remnode->node), chassis); } - pass_on_lines_catch_spines(chassis); + if (pass_on_lines_catch_spines(chassis)) + return (-1); /* this pass needed for to catch routers, since routers connected only */ /* to spines in slot 1 or 4 and we could miss them first time */ - pass_on_spines_catch_lines(chassis); + if (pass_on_spines_catch_lines(chassis)) + return (-1); /* additional 2 passes needed for to overcome a problem of pure "in-chassis" */ /* connectivity - extra pass to ensure that all related chips/modules */ /* inserted into the chassis */ - pass_on_lines_catch_spines(chassis); - pass_on_spines_catch_lines(chassis); + if (pass_on_lines_catch_spines(chassis)) + return (-1); + if (pass_on_spines_catch_lines(chassis)) + return (-1); pass_on_spines_interpolate_chguid(chassis); + + return (0); } /*========================================================*/ @@ -724,10 +755,12 @@ voltaire_portmap(ibnd_port_t *port) port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; } -static void add_chassis(struct ibnd_fabric *fabric) +static int add_chassis(struct ibnd_fabric *fabric) { - if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) - IBPANIC("out of mem"); + if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) { + IBND_ERROR("OOM: failed to allocate chassis object\n"); + return (-1); + } if (fabric->first_chassis == NULL) { fabric->first_chassis = fabric->current_chassis; @@ -736,6 +769,7 @@ static void add_chassis(struct ibnd_fabric *fabric) fabric->last_chassis->next = fabric->current_chassis; fabric->last_chassis = fabric->current_chassis; } + return (0); } static void @@ -756,10 +790,9 @@ add_node_to_chassis(ibnd_chassis_t *chassis, ibnd_node_t *node) 3. pass on non Voltaire nodes (SystemImageGUID based grouping) 4. now group non Voltaire nodes by SystemImageGUID Returns: - Pointer to the first chassis in a NULL terminated list of chassis in - the fabric specified. + 0 on success, -1 on failure */ -ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric) +int group_nodes(struct ibnd_fabric *fabric) { struct ibnd_node *node; int dist; @@ -776,7 +809,8 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric) for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->node.info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) - fill_voltaire_chassis_record(node); + if (fill_voltaire_chassis_record(node)) + return (-1); } } @@ -791,9 +825,11 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric) || (node->node.chassis && node->node.chassis->chassisnum) || !is_spine(node)) continue; - add_chassis(fabric); + if (add_chassis(fabric)) + return (-1); fabric->current_chassis->chassisnum = ++chassisnum; - build_chassis(node, fabric->current_chassis); + if (build_chassis(node, fabric->current_chassis)) + return (-1); } } @@ -809,7 +845,8 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric) chassis->nodecount++; else { /* Possible new chassis */ - add_chassis(fabric); + if (add_chassis(fabric)) + return (-1); fabric->current_chassis->chassisguid = get_chassisguid((ibnd_node_t *)node); fabric->current_chassis->nodecount = 1; @@ -842,5 +879,6 @@ ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric) dist++; } - return (fabric->first_chassis); + fabric->fabric.chassis = fabric->first_chassis; + return (0); } diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h index 16dad49..ecb21c9 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.h +++ b/infiniband-diags/libibnetdisc/src/chassis.h @@ -80,6 +80,6 @@ enum ibnd_chassis_type { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT }; enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; -ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric); +int group_nodes(struct ibnd_fabric *fabric); #endif /* _CHASSIS_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 27ae9f3..6c31300 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -121,12 +121,13 @@ static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, struct ibnd_node *inode, struct ibnd_port *iport, ib_portid_t *portid) { + int rc = 0; ibnd_node_t *node = &(inode->node); ibnd_port_t *port = &(iport->port); void *nd = inode->node.nodedesc; - if (query_node_info(ibmad_port, fabric, inode, portid)) - return -1; + if ((rc = query_node_info(ibmad_port, fabric, inode, portid)) != 0) + return rc; port->portnum = mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F); port->guid = mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F); @@ -169,8 +170,10 @@ query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, static int add_port_to_dpath(ib_dr_path_t *path, int nextport) { - if (path->cnt+2 >= sizeof(path->p)) + if (path->cnt+2 >= sizeof(path->p)) { + IBND_ERROR("DR path has grown too long\n"); return -1; + } ++path->cnt; path->p[path->cnt] = (uint8_t) nextport; return path->cnt; @@ -186,8 +189,10 @@ extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f, /* If we were LID routed we need to set up the drslid */ if (!f->selfportid.lid) if (ib_resolve_self_via(&f->selfportid, NULL, NULL, - ibmad_port) < 0) + ibmad_port) < 0) { + IBND_ERROR("Failed to resolve self\n"); return -1; + } portid->drpath.drslid = (uint16_t) f->selfportid.lid; portid->drpath.drdlid = 0xFFFF; @@ -413,8 +418,10 @@ create_node(struct ibnd_fabric *fabric, struct ibnd_node *temp, ib_portid_t *pat struct ibnd_node *node; node = malloc(sizeof(*node)); - if (!node) - IBPANIC("OOM: node creation failed\n"); + if (!node) { + IBND_ERROR("OOM: node creation failed\n"); + return (NULL); + } memcpy(node, temp, sizeof(*node)); node->node.dist = dist; @@ -455,8 +462,10 @@ add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd } port = malloc(sizeof(*port)); - if (!port) + if (!port) { + IBND_ERROR("Failed to allocate port\n"); return NULL; + } memcpy(port, temp, sizeof(*port)); port->port.node = (ibnd_node_t *)node; @@ -489,6 +498,7 @@ get_remote_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_port *port, ib_portid_t *path, int portnum, int dist) { + int rc = 0; struct ibnd_node node_buf; struct ibnd_port port_buf; struct ibnd_node *remotenode, *oldnode; @@ -501,43 +511,51 @@ get_remote_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, if (mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F) != IB_PORT_PHYS_STATE_LINKUP) - return -1; + return 1; /* positive == non-fatal error */ if (extend_dpath(ibmad_port, fabric, path, portnum) < 0) return -1; if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) { - IBND_DEBUG("NodeInfo on %s failed, skipping port", + IBND_ERROR("Query remote node (%s) failed, skipping port\n", portid2str(path)); path->drpath.cnt--; /* restore path */ - return -1; + return 1; /* positive == non-fatal error */ } oldnode = find_existing_node(fabric, &node_buf); if (oldnode) remotenode = oldnode; - else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) - IBPANIC("no memory"); + else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) { + rc = -1; + goto error; + } oldport = find_existing_port_node(remotenode, &port_buf); if (oldport) { remoteport = oldport; - } else if (!(remoteport = add_port_to_node(fabric, remotenode, &port_buf))) - IBPANIC("no memory"); + } else if (!(remoteport = add_port_to_node(fabric, remotenode, + &port_buf))) { + IBND_ERROR("OOM failed to add port to node\n"); + rc = -1; + goto error; + } dump_endnode(path, oldnode ? "known remote" : "new remote", remotenode, remoteport); link_ports(node, port, remotenode, remoteport); +error: path->drpath.cnt--; /* restore path */ - return 0; + return (rc); } ibnd_fabric_t * ibnd_discover_fabric(struct ibmad_port *ibmad_port, ib_portid_t *from, int hops) { + int rc = 0; struct ibnd_fabric *fabric = NULL; ib_portid_t my_portid = {0}; struct ibnd_node node_buf; @@ -563,8 +581,10 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, fabric = malloc(sizeof(*fabric)); - if (!fabric) - IBPANIC("OOM: failed to malloc ibnd_fabric_t\n"); + if (!fabric) { + IBND_ERROR("OOM: failed to malloc ibnd_fabric_t\n"); + return (NULL); + } memset(fabric, 0, sizeof(*fabric)); @@ -586,11 +606,14 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, port = add_port_to_node(fabric, node, &port_buf); if (!port) - IBPANIC("out of memory"); + goto error; - if(get_remote_node(ibmad_port, fabric, node, port, from, + rc = get_remote_node(ibmad_port, fabric, node, port, from, mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), - 0) < 0) + 0); + if (rc < 0) + goto error; + if (rc > 0) /* non-fatal error, nothing more to be done */ return ((ibnd_fabric_t *)fabric); for (dist = 0; dist <= max_hops; dist++) { @@ -608,7 +631,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, continue; if (get_port_info(ibmad_port, fabric, &port_buf, i, path)) { - IBND_DEBUG("can't reach node %s port %d", portid2str(path), i); + IBND_ERROR("can't reach node %s port %d", portid2str(path), i); continue; } @@ -618,7 +641,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, port = add_port_to_node(fabric, node, &port_buf); if (!port) - IBPANIC("out of memory"); + goto error; /* If switch, set port GUID to node port GUID */ if (node->node.type == IB_NODE_SWITCH) { @@ -626,13 +649,15 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, 0, IB_NODE_PORT_GUID_F); } - get_remote_node(ibmad_port, fabric, node, port, - path, i, dist); + if (get_remote_node(ibmad_port, fabric, node, port, + path, i, dist) < 0) + goto error; } } } - fabric->fabric.chassis = group_nodes(fabric); + if (group_nodes(fabric)) + goto error; return ((ibnd_fabric_t *)fabric); error: -- 1.5.4.5 From nashwath at gmail.com Wed Aug 12 23:41:37 2009 From: nashwath at gmail.com (Ashwath Narasimhan) Date: Thu, 13 Aug 2009 02:41:37 -0400 Subject: [ofa-general] Manipulating Credits in Infiniband In-Reply-To: <20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net> References: <20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net> Message-ID: Dear Tom/all I understand the end to end credit based flow control at the link layer where we have a 32 bit Flow control packet being sent for each VL (with FCCL and FCTBS fields) but I fail to understand where this scheme is implemented in the driver. (OFED linux- 1.4 stack, hw-mthca) . I can see a file with a credit table mapped to different credits counts and another that computes the AETH based on this credit table. 1. Is this the place where the flow control packets are formulated? 2. If yes, I don't see them computing this for each VL. why? If no, is it a mid layer flow control? 3. And thats why I have this basic question--> is the link layer implemented as part of OFED stack at all? or does it go into the hardware HCA as firmware? As I understand the hardware vendor only provides verbs to communicate with the HCA. Pardon me if i am bundling you all with a lot with questions. I am new to all this and I am trying my best to understand the stack. Thank you, Ashwath On Tue, Aug 11, 2009 at 10:37 PM, Nifty Tom Mitchell wrote: > On Mon, Aug 10, 2009 at 12:11:22PM -0400, Ashwath Narasimhan wrote: > > > > I looked into the infiniband driver files. As I understand, in order > to > > limit the data rate we manipulate the credits on either ends. Since > the > > number of credits available depends on the receiver's work receive > > queue size, I decided to limit the queue size to say 5 instead of 8192 > > (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my higher > > layer protocol is ipoib). I just want to confirm if I am doing the > > right thing? > > Data rate is not manipulated by credits. > Credits and queue sizes are different and have different purposes. > > Visit the Infiniband Trade Association web site and grab the IB > specifications to understand some of the hardware level parts. > > http://www.infinibandta.org/ > > InfiniBand offers credit based flow control and given the nature of > modern IB switches and processors a very small credit count can still > result in full data rate. Having said that flow control is the lowest > level throttle in the system. Reducing the credit count forces the > higher levels in the protocol stack to source or sink the data through > the hardware before any more can be delivered. Thus flow control can > simplify the implementation of higher level protocols. It can also be > used > to cost reduce or simplify hardware design (smaller hardware buffers). > > The IB specifications are way too long. Start with this FAQ. > > http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf > > The IB specification is way too full of optional features. A vendor may > have XYZ working fine and dandy on one card and since it is optional not > at all on another. > > The various queue sizes for the various protocols built on top of > IB establish transfer behavior in keeping with system interrupt, > system process time slice, system kernel activity loads and needs. > It is counter intuitive but in some cases small queues result in > more responsive and agile systems, especially in the presence of errors. > > Since there are often multiple protocols on the IB stack all protocols > will be impacted by credit tinkering. Most vendors know their hardware > so most drivers will have credit related code optimum. > > In the case of TCP/IP the interaction between IB bandwidths&MTU (IPoIB), > ethernet bandwidth&MTU and even localhost (127.0.0.1) bandwidth&MTU can > be "interesting" depending on host names, subnets, routing etc. TCP/IP > has lots of tuning flags well above the IB driver. I see 500+ net.* > sysctl knobs on this system. > > As you change things do make the changes on all the moving parts, benchmark > and keep a log. Since there are multiple IB hardware vendors > it is important to track hardware specifics. "lspci" is a good tool > to gather chip info. With some cards you also need specifics about > the active firmware. > > So go forth (RPN forever) and conquer. > > > -- > T o m M i t c h e l l > Found me a new hat, now what? > > -- regards, Ashwath -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Thu Aug 13 03:05:38 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 13 Aug 2009 03:05:38 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090813-0200 daily build status Message-ID: <20090813100538.76A16E28273@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090813-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Thu Aug 13 04:36:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 13 Aug 2009 14:36:20 +0300 Subject: [ofa-general] Re: [PATCH] opensm/complib: account for nsec overflow in timeout values In-Reply-To: <20090806183716.c08bbea3.weiny2@llnl.gov> References: <20090806183716.c08bbea3.weiny2@llnl.gov> Message-ID: <20090813113620.GV25501@me> Hi Ira, On 18:37 Thu 06 Aug , Ira Weiny wrote: > > From: Ira Weiny > Date: Thu, 6 Aug 2009 18:31:46 -0700 > Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values > > > Signed-off-by: Ira Weiny > --- > opensm/complib/cl_event.c | 8 +++++--- > 1 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c > index d14b2f4..4bc8d37 100644 > --- a/opensm/complib/cl_event.c > +++ b/opensm/complib/cl_event.c > @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event, > } else { > /* Get the current time */ > if (gettimeofday(&curtime, NULL) == 0) { > - timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000); > - timeout.tv_nsec = > - (curtime.tv_usec + (wait_us % 1000000)) * 1000; > + uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000)) Do you really need fixed size (uint32_t) variable here? > + * 1000; > + timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000) > + + (n_sec % 1000000000); Did you mean (n_sec / 1000000000)? Sasha > + timeout.tv_nsec = n_sec % 1000000000; > > wait_ret = pthread_cond_timedwait(&p_event->condvar, > &p_event->mutex, > -- > 1.5.4.5 > From sashak at voltaire.com Thu Aug 13 04:41:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 13 Aug 2009 14:41:04 +0300 Subject: [ofa-general] Re: [PATCH] libibnetdisc: fix potential memory leak of port object In-Reply-To: <20090807090703.2b857dea.weiny2@llnl.gov> References: <20090807090703.2b857dea.weiny2@llnl.gov> Message-ID: <20090813114104.GW25501@me> On 09:07 Fri 07 Aug , Ira Weiny wrote: > > From: Ira Weiny > Date: Fri, 7 Aug 2009 09:05:44 -0700 > Subject: [PATCH] libibnetdisc: fix potential memory leak of port object > > NOTE: This moves the port allocation below the port array allocation > failure rather than free the port allocation after port array > allocation fails. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Thu Aug 13 04:49:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 13 Aug 2009 14:49:10 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/libibnetdisc: remove all IBPANIC's and clean up error handling In-Reply-To: <20090812165320.66ea08a5.weiny2@llnl.gov> References: <20090812165320.66ea08a5.weiny2@llnl.gov> Message-ID: <20090813114910.GX25501@me> On 16:53 Wed 12 Aug , Ira Weiny wrote: > This patch applies after: > > libibnetdisc: fix potential memory leak of port object > > Which I sent last week but I don't think has made it up stream. > > Ira > > > From: Ira Weiny > Date: Wed, 12 Aug 2009 16:13:56 -0700 > Subject: [PATCH] infiniband-diags/libibnetdisc: remove all IBPANIC's and clean up error handling > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Thu Aug 13 05:28:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 13 Aug 2009 15:28:11 +0300 Subject: [ofa-general] [PATCH] libibnetdisc/ibnetdisc.c: typo fix In-Reply-To: <20090812165320.66ea08a5.weiny2@llnl.gov> References: <20090812165320.66ea08a5.weiny2@llnl.gov> Message-ID: <20090813122811.GY25501@me> Fix statement completion typo ',' -> ';'. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 6c31300..b4bf52d 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -213,7 +213,7 @@ dump_endnode(ib_portid_t *path, char *prompt, if (!show_progress) return; - mad_dump_node_type(type, 64, &(node->node.type), sizeof(int)), + mad_dump_node_type(type, 64, &(node->node.type), sizeof(int)); printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path), prompt, type, -- 1.6.4 From sashak at voltaire.com Thu Aug 13 05:51:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 13 Aug 2009 15:51:25 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_mcast_tbl.c: In osm_mcast_tbl_get_block, eliminate unneeded check In-Reply-To: <20090812132247.GA15084@comcast.net> References: <20090812132247.GA15084@comcast.net> Message-ID: <20090813125125.GZ25501@me> On 09:22 Wed 12 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From weiny2 at llnl.gov Thu Aug 13 09:06:02 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 13 Aug 2009 09:06:02 -0700 Subject: [ofa-general] Re: [PATCH v2] opensm/complib: account for nsec overflow in timeout values In-Reply-To: <20090813113620.GV25501@me> References: <20090806183716.c08bbea3.weiny2@llnl.gov> <20090813113620.GV25501@me> Message-ID: <20090813090602.226b2695.weiny2@llnl.gov> On Thu, 13 Aug 2009 14:36:20 +0300 Sasha Khapyorsky wrote: > Hi Ira, > > On 18:37 Thu 06 Aug , Ira Weiny wrote: > > > > From: Ira Weiny > > Date: Thu, 6 Aug 2009 18:31:46 -0700 > > Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values > > > > > > Signed-off-by: Ira Weiny > > --- > > opensm/complib/cl_event.c | 8 +++++--- > > 1 files changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c > > index d14b2f4..4bc8d37 100644 > > --- a/opensm/complib/cl_event.c > > +++ b/opensm/complib/cl_event.c > > @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event, > > } else { > > /* Get the current time */ > > if (gettimeofday(&curtime, NULL) == 0) { > > - timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000); > > - timeout.tv_nsec = > > - (curtime.tv_usec + (wait_us % 1000000)) * 1000; > > + uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000)) > > Do you really need fixed size (uint32_t) variable here? Well I need at least int32_t. I chose unsigned because we are not trying to go back in time. I don't like leaving this as "int". As rare as it might be, a compiler could chose 16bits for an int and that is not big enough, right? > > > + * 1000; > > + timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000) > > + + (n_sec % 1000000000); > > Did you mean (n_sec / 1000000000)? yes... :-( New patch below, Ira From: Ira Weiny Date: Thu, 6 Aug 2009 18:31:46 -0700 Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values Signed-off-by: Ira Weiny --- opensm/complib/cl_event.c | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/opensm/complib/cl_event.c b/opensm/complib/cl_event.c index d14b2f4..3f17262 100644 --- a/opensm/complib/cl_event.c +++ b/opensm/complib/cl_event.c @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event, } else { /* Get the current time */ if (gettimeofday(&curtime, NULL) == 0) { - timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000); - timeout.tv_nsec = - (curtime.tv_usec + (wait_us % 1000000)) * 1000; + uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000)) + * 1000; + timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000) + + (n_sec / 1000000000); + timeout.tv_nsec = n_sec % 1000000000; wait_ret = pthread_cond_timedwait(&p_event->condvar, &p_event->mutex, -- 1.5.4.5 From sean.hefty at intel.com Thu Aug 13 11:19:00 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Aug 2009 11:19:00 -0700 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 Message-ID: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> Does anyone know off the top of their heads if opensm will respond correctly to SA MADs that are not sent from QP1? - Sean From hal.rosenstock at gmail.com Thu Aug 13 12:41:11 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 13 Aug 2009 15:41:11 -0400 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> Message-ID: On 8/13/09, Sean Hefty wrote: > > Does anyone know off the top of their heads if opensm will respond > correctly to > SA MADs that are not sent from QP1? I don't have the code in front of me right now (I can validate tomorrow) but don't think that should be a problem as for responses it just takes the incoming source QP and uses that for the dest QP. Are you suspecting some issue here ? -- Hal - Sean > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Aug 13 12:51:52 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Aug 2009 12:51:52 -0700 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> Message-ID: <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com> >I don't have the code in front of me right now (I can validate tomorrow) but >don't think that should be a problem as for responses it just takes the >incoming source QP and uses that for the dest QP. Are you suspecting some issue >here ? I just wanted to verify it before going too far down the path of sending MADs to the SA on a different QP. I have nothing that indicates any issue. Not being familiar with the opensm code, nothing jumped out at me as the place to look, but between your reply and Jim's, I'm good assuming that it'll work. Thanks. - Sean From jgunthorpe at obsidianresearch.com Thu Aug 13 13:00:23 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 13 Aug 2009 14:00:23 -0600 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com> References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com> Message-ID: <20090813200023.GO16677@obsidianresearch.com> On Thu, Aug 13, 2009 at 12:51:52PM -0700, Sean Hefty wrote: > >I don't have the code in front of me right now (I can validate tomorrow) but > >don't think that should be a problem as for responses it just takes the > >incoming source QP and uses that for the dest QP. Are you suspecting some issue > >here ? > > I just wanted to verify it before going too far down the path of sending MADs to > the SA on a different QP. I have nothing that indicates any issue. Not being > familiar with the opensm code, nothing jumped out at me as the place to look, > but between your reply and Jim's, I'm good assuming that it'll work. Thanks. Speaking of which, do we have an API to get the node's SM_Key for SA packet construction? Jason From sean.hefty at intel.com Thu Aug 13 13:14:19 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Aug 2009 13:14:19 -0700 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: <20090813200023.GO16677@obsidianresearch.com> References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com> <20090813200023.GO16677@obsidianresearch.com> Message-ID: >Speaking of which, do we have an API to get the node's SM_Key for SA >packet construction? Not that I'm aware of. The ib-diags take the smkey as a command line option. - Sean From rdreier at cisco.com Thu Aug 13 13:25:04 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Aug 2009 13:25:04 -0700 Subject: [ofa-general] Re: [PATCH v3] ib/core: fix for send multicast group send leave retry In-Reply-To: <48A06A66.7070605@Voltaire.COM> (Yossi Etigin's message of "Mon, 11 Aug 2008 19:35:50 +0300") References: <48A06A66.7070605@Voltaire.COM> Message-ID: thanks, applied at long long last. From rdreier at cisco.com Thu Aug 13 13:29:18 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Aug 2009 13:29:18 -0700 Subject: [ofa-general] [PATCH] uverbs: return ENOSYS for unimplemented commands (not EINVAL) In-Reply-To: <200812021943.44732.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 2 Dec 2008 19:43:44 +0200") References: <200812021943.44732.jackm@dev.mellanox.co.il> Message-ID: after meditating about this, I really think this is the right approach. So I applied this patch. From jgunthorpe at obsidianresearch.com Thu Aug 13 14:09:24 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 13 Aug 2009 15:09:24 -0600 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com> <20090813200023.GO16677@obsidianresearch.com> Message-ID: <20090813210924.GQ16677@obsidianresearch.com> On Thu, Aug 13, 2009 at 01:14:19PM -0700, Sean Hefty wrote: > >Speaking of which, do we have an API to get the node's SM_Key for SA > >packet construction? > > Not that I'm aware of. The ib-diags take the smkey as a command line option. Hmm, and the kernel wires it to zero. That's uncool. So, any process that can create a QP can alter, say, the nodes multicast group membership. Thats a bit of a security problem. I admit though, I haven't been able to discern what the SM_Key should be set to from the spec.. -- Jason Gunthorpe (780)4406067x832 Chief Technology Officer, Obsidian Research Corp Edmonton, Canada From weiny2 at llnl.gov Thu Aug 13 20:42:36 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 13 Aug 2009 20:42:36 -0700 Subject: [ofa-general] [PATCH 0/5] Further clean up of libibnetdisc interface Message-ID: <20090813204236.36a161f3.weiny2@llnl.gov> The following patches clean up the interface for the libibnetdisc. The main reasons for these changes are 3 fold. 1) there were some problems with having the structures split between internal and external data. (I thought I was being clever but it is not worth it.) 2) I have, waiting in the wings, a multi-threaded implementation which further improves performance, especially on a fabric with problems (unresponsive nodes etc). These patches lay the groundwork for some of the changes I will need for this implementation. 3) I would really like to get the interface changed before this goes out with OFED 1.5 or any infiniband-diags release. I have split the patches up to chunks which I think are pretty manageable. Let me know if there are issues or you prefer combined patches. Ira -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov From weiny2 at llnl.gov Thu Aug 13 20:42:42 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 13 Aug 2009 20:42:42 -0700 Subject: [ofa-general] [PATCH 1/5] libibnetdisc: make all fields of ibnd_node_t public Message-ID: <20090813204242.b659d8f5.weiny2@llnl.gov> From: Ira Weiny Date: Tue, 11 Aug 2009 15:15:21 -0700 Subject: [PATCH] libibnetdisc: make all fields of ibnd_node_t public Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 12 +- infiniband-diags/libibnetdisc/src/chassis.c | 147 ++++++++--------- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 173 ++++++++++---------- infiniband-diags/libibnetdisc/src/internal.h | 22 +-- 4 files changed, 166 insertions(+), 188 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index 121709d..e7f5f6a 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -45,8 +45,8 @@ struct port; /* forward declare */ /** ========================================================================= * Node */ -typedef struct node { - struct node *next; /* all node list in fabric */ +typedef struct ibnd_node { + struct ibnd_node *next; /* all node list in fabric */ ib_portid_t path_portid; /* path from "from_node" */ int dist; /* num of hops from "from_node" */ @@ -72,12 +72,18 @@ typedef struct node { items MAY BE NULL! (ie 0 == switches only) */ /* chassis info */ - struct node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ + struct ibnd_node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ struct chassis *chassis; /* if != NULL the chassis this node belongs to */ unsigned char ch_type; unsigned char ch_anafanum; unsigned char ch_slotnum; unsigned char ch_slot; + + /* internal use only */ + unsigned char ch_found; + struct ibnd_node *htnext; /* hash table list */ + struct ibnd_node *dnext; /* nodesdist next */ + struct ibnd_node *type_next; /* next based on type */ } ibnd_node_t; /** ========================================================================= diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 120b4b6..0dd259a 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -239,68 +239,68 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum) return 0; } -static int is_router(struct ibnd_node *n) +static int is_router(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); return (devid == VTR_DEVID_IB_FC_ROUTER || devid == VTR_DEVID_IB_IP_ROUTER); } -static int is_spine_9096(struct ibnd_node *n) +static int is_spine_9096(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); return (devid == VTR_DEVID_SFB4 || devid == VTR_DEVID_SFB4_DDR); } -static int is_spine_9288(struct ibnd_node *n) +static int is_spine_9288(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); return (devid == VTR_DEVID_SFB12 || devid == VTR_DEVID_SFB12_DDR); } -static int is_spine_2004(struct ibnd_node *n) +static int is_spine_2004(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); return (devid == VTR_DEVID_SFB2004); } -static int is_spine_2012(struct ibnd_node *n) +static int is_spine_2012(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); return (devid == VTR_DEVID_SFB2012); } -static int is_spine(struct ibnd_node *n) +static int is_spine(ibnd_node_t * n) { return (is_spine_9096(n) || is_spine_9288(n) || is_spine_2004(n) || is_spine_2012(n)); } -static int is_line_24(struct ibnd_node *n) +static int is_line_24(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); - return (devid == VTR_DEVID_SLB24 || devid == VTR_DEVID_SLB24_DDR || - devid == VTR_DEVID_SRB2004); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SLB24 || + devid == VTR_DEVID_SLB24_DDR || devid == VTR_DEVID_SRB2004); } -static int is_line_8(struct ibnd_node *n) +static int is_line_8(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); return (devid == VTR_DEVID_SLB8); } -static int is_line_2024(struct ibnd_node *n) +static int is_line_2024(ibnd_node_t * n) { - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); return (devid == VTR_DEVID_SLB2024); } -static int is_line(struct ibnd_node *n) +static int is_line(ibnd_node_t * n) { return (is_line_24(n) || is_line_8(n) || is_line_2024(n)); } -int is_chassis_switch(struct ibnd_node *n) +int is_chassis_switch(ibnd_node_t * n) { return (is_spine(n) || is_line(n)); } @@ -349,7 +349,7 @@ char anafa_spine4_slot_2_slb[25] = { /* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ -static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport) +static int get_sfb_slot(ibnd_node_t * node, ibnd_port_t * lineport) { ibnd_node_t *n = (ibnd_node_t *) node; @@ -372,25 +372,24 @@ static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport) n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; } else { IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64, - node->node.guid); + node->guid); return (-1); } return (0); } -static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) +static int get_router_slot(ibnd_node_t * n, ibnd_port_t * spineport) { - ibnd_node_t *n = (ibnd_node_t *) node; uint64_t guessnum = 0; - node->ch_found = 1; + n->ch_found = 1; n->ch_slot = SRBD_CS; - if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { + if (is_spine_9096(spineport->node)) { n->ch_type = ISR9096_CT; n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; - } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { + } else if (is_spine_9288(spineport->node)) { n->ch_type = ISR9288_CT; n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; /* this is a smart guess based on nodeguids order on sFB-12 module */ @@ -399,7 +398,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) /* module 2 <--> remote anafa 2 */ /* module 3 <--> remote anafa 1 */ n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { + } else if (is_spine_2012(spineport->node)) { n->ch_type = ISR2012_CT; n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; /* this is a smart guess based on nodeguids order on sFB-12 module */ @@ -408,7 +407,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) // module 2 <--> remote anafa 2 // module 3 <--> remote anafa 1 n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { + } else if (is_spine_2004(spineport->node)) { n->ch_type = ISR2004_CT; n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; @@ -423,19 +422,19 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) static int get_slb_slot(ibnd_node_t * n, ibnd_port_t * spineport) { n->ch_slot = LINE_CS; - if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { + if (is_spine_9096(spineport->node)) { n->ch_type = ISR9096_CT; n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; - } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { + } else if (is_spine_9288(spineport->node)) { n->ch_type = ISR9288_CT; n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { + } else if (is_spine_2012(spineport->node)) { n->ch_type = ISR2012_CT; n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { + } else if (is_spine_2004(spineport->node)) { n->ch_type = ISR2004_CT; n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; @@ -454,12 +453,11 @@ static void voltaire_portmap(ibnd_port_t * port); It could be optimized so, but time overhead is very small and its only diag.util */ -static int fill_voltaire_chassis_record(struct ibnd_node *node) +static int fill_voltaire_chassis_record(ibnd_node_t * node) { - ibnd_node_t *n = (ibnd_node_t *) node; int p = 0; ibnd_port_t *port; - struct ibnd_node *remnode = 0; + ibnd_node_t *remnode = 0; if (node->ch_found) /* somehow this node has already been passed */ return (0); @@ -470,25 +468,23 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node) /* in such case node->ports is actually a requested port... */ if (is_router(node)) { /* find the remote node */ - for (p = 1; p <= node->node.numports; p++) { - port = node->node.ports[p]; - if (port && - is_spine(CONV_NODE_INTERNAL - (port->remoteport->node))) + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && is_spine(port->remoteport->node)) get_router_slot(node, port->remoteport); } } else if (is_spine(node)) { - for (p = 1; p <= node->node.numports; p++) { - port = node->node.ports[p]; + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; if (!port || !port->remoteport) continue; - remnode = CONV_NODE_INTERNAL(port->remoteport->node); - if (remnode->node.type != IB_NODE_SWITCH) { + remnode = port->remoteport->node; + if (remnode->type != IB_NODE_SWITCH) { if (!remnode->ch_found) get_router_slot(remnode, port); continue; } - if (!n->ch_type) + if (!node->ch_type) /* we assume here that remoteport belongs to line */ if (get_sfb_slot(node, port->remoteport)) return (-1); @@ -497,20 +493,20 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node) } } else if (is_line(node)) { - for (p = 1; p <= node->node.numports; p++) { - port = node->node.ports[p]; + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; if (!port || port->portnum > 12 || !port->remoteport) continue; /* we assume here that remoteport belongs to spine */ - if (get_slb_slot(n, port->remoteport)) + if (get_slb_slot(node, port->remoteport)) return (-1); break; } } /* for each port of this node, map external ports */ - for (p = 1; p <= node->node.numports; p++) { - port = node->node.ports[p]; + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; if (!port) continue; voltaire_portmap(port); @@ -534,8 +530,7 @@ static int get_spine_index(ibnd_node_t * node) { int retval; - if (is_spine_9288(CONV_NODE_INTERNAL(node)) - || is_spine_2012(CONV_NODE_INTERNAL(node))) + if (is_spine_9288(node) || is_spine_2012(node)) retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum; else retval = node->ch_slotnum; @@ -586,7 +581,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis) for (i = 1; i <= LINES_MAX_NUM; i++) { node = chassis->linenode[i]; - if (!(node && is_line(CONV_NODE_INTERNAL(node)))) + if (!(node && is_line(node))) continue; /* empty slot or router */ for (p = 1; p <= node->numports; p++) { @@ -596,7 +591,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis) remnode = port->remoteport->node; - if (!CONV_NODE_INTERNAL(remnode)->ch_found) + if (!remnode->ch_found) continue; /* some error - spine not initialized ? FIXME */ if (insert_spine(remnode, chassis)) return (-1); @@ -621,7 +616,7 @@ static int pass_on_spines_catch_lines(ibnd_chassis_t * chassis) continue; remnode = port->remoteport->node; - if (!CONV_NODE_INTERNAL(remnode)->ch_found) + if (!remnode->ch_found) continue; /* some error - line/router not initialized ? FIXME */ if (insert_line_router(remnode, chassis)) return (-1); @@ -655,10 +650,10 @@ static void pass_on_spines_interpolate_chguid(ibnd_chassis_t * chassis) in that chassis chassis structure = structure of one standalone chassis */ -static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis) +static int build_chassis(ibnd_node_t * node, ibnd_chassis_t * chassis) { int p = 0; - struct ibnd_node *remnode = 0; + ibnd_node_t *remnode = 0; ibnd_port_t *port = 0; /* we get here with node = chassis_spine */ @@ -666,16 +661,16 @@ static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis) return (-1); /* loop: pass on all ports of node */ - for (p = 1; p <= node->node.numports; p++) { - port = node->node.ports[p]; + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; if (!port || !port->remoteport) continue; - remnode = CONV_NODE_INTERNAL(port->remoteport->node); + remnode = port->remoteport->node; if (!remnode->ch_found) continue; /* some error - line or router not initialized ? FIXME */ - insert_line_router(&(remnode->node), chassis); + insert_line_router(remnode, chassis); } if (pass_on_lines_catch_spines(chassis)) @@ -764,13 +759,11 @@ int int2ext_map_slb2024[2][25] = { /* map internal ports to external ports if appropriate */ static void voltaire_portmap(ibnd_port_t * port) { - struct ibnd_node *n = CONV_NODE_INTERNAL(port->node); int portnum = port->portnum; int chipnum = 0; ibnd_node_t *node = port->node; - if (!n->ch_found || !is_line(CONV_NODE_INTERNAL(node)) - || (portnum < 13 || portnum > 24)) { + if (!node->ch_found || !is_line(node) || (portnum < 13 || portnum > 24)) { port->ext_portnum = 0; return; } @@ -782,9 +775,9 @@ static void voltaire_portmap(ibnd_port_t * port) chipnum = port->node->ch_anafanum - 1; - if (is_line_24(CONV_NODE_INTERNAL(node))) + if (is_line_24(node)) port->ext_portnum = int2ext_map_slb24[chipnum][portnum]; - else if (is_line_2024(CONV_NODE_INTERNAL(node))) + else if (is_line_2024(node)) port->ext_portnum = int2ext_map_slb2024[chipnum][portnum]; else port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; @@ -828,7 +821,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node) */ int group_nodes(struct ibnd_fabric *fabric) { - struct ibnd_node *node; + ibnd_node_t *node; int dist; int chassisnum = 0; ibnd_chassis_t *chassis; @@ -842,7 +835,7 @@ int group_nodes(struct ibnd_fabric *fabric) /* not very efficient but clear code so... */ for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { - if (mad_get_field(node->node.info, 0, + if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) if (fill_voltaire_chassis_record(node)) return (-1); @@ -853,13 +846,11 @@ int group_nodes(struct ibnd_fabric *fabric) /* algorithm: catch spine and find all surrounding nodes */ for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { - if (mad_get_field(node->node.info, 0, + if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) continue; - //if (!node->node.chrecord || node->node.chrecord->chassisnum || !is_spine(node)) if (!node->ch_found - || (node->node.chassis - && node->node.chassis->chassisnum) + || (node->chassis && node->chassis->chassisnum) || !is_spine(node)) continue; if (add_chassis(fabric)) @@ -874,10 +865,10 @@ int group_nodes(struct ibnd_fabric *fabric) /* grouped by common SystemImageGUID */ for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { - if (mad_get_field(node->node.info, 0, + if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) continue; - if (mad_get_field64(node->node.info, 0, + if (mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F)) { chassis = find_chassisguid(fabric, @@ -901,10 +892,10 @@ int group_nodes(struct ibnd_fabric *fabric) /* (defined as chassis->nodecount > 1) */ for (dist = 0; dist <= MAXHOPS;) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { - if (mad_get_field(node->node.info, 0, + if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) continue; - if (mad_get_field64(node->node.info, 0, + if (mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F)) { chassis = find_chassisguid(fabric, diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index b33be8d..b883d4a 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -98,18 +98,17 @@ static int get_port_info(struct ibmad_port *ibmad_port, * Returns -1 if error. */ static int query_node_info(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, struct ibnd_node *node, + struct ibnd_fabric *fabric, ibnd_node_t * node, ib_portid_t * portid) { - if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, 0, + if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0, ibmad_port)) return -1; /* decode just a couple of fields for quicker reference. */ - mad_decode_field(node->node.info, IB_NODE_GUID_F, &(node->node.guid)); - mad_decode_field(node->node.info, IB_NODE_TYPE_F, &(node->node.type)); - mad_decode_field(node->node.info, IB_NODE_NPORTS_F, - &(node->node.numports)); + mad_decode_field(node->info, IB_NODE_GUID_F, &(node->guid)); + mad_decode_field(node->info, IB_NODE_TYPE_F, &(node->type)); + mad_decode_field(node->info, IB_NODE_NPORTS_F, &(node->numports)); return (0); } @@ -118,15 +117,14 @@ static int query_node_info(struct ibmad_port *ibmad_port, * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. */ static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, - struct ibnd_node *inode, struct ibnd_port *iport, + ibnd_node_t * node, struct ibnd_port *iport, ib_portid_t * portid) { int rc = 0; - ibnd_node_t *node = &(inode->node); ibnd_port_t *port = &(iport->port); - void *nd = inode->node.nodedesc; + void *nd = node->nodedesc; - if ((rc = query_node_info(ibmad_port, fabric, inode, portid)) != 0) + if ((rc = query_node_info(ibmad_port, fabric, node, portid)) != 0) return rc; port->portnum = mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F); @@ -204,30 +202,30 @@ static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f, } static void dump_endnode(ib_portid_t * path, char *prompt, - struct ibnd_node *node, struct ibnd_port *port) + ibnd_node_t * node, struct ibnd_port *port) { char type[64]; if (!show_progress) return; - mad_dump_node_type(type, 64, &(node->node.type), sizeof(int)); - - printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", - portid2str(path), prompt, type, node->node.guid, - node->node.type == IB_NODE_SWITCH ? 0 : port->port.portnum, - port->port.base_lid, - port->port.base_lid + (1 << port->port.lmc) - 1, - node->node.nodedesc); + mad_dump_node_type(type, 64, &(node->type), sizeof(int)), + printf("%s -> %s %s {%016" PRIx64 + "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path), + prompt, type, node->guid, + node->type == IB_NODE_SWITCH ? 0 : port->port.portnum, + port->port.base_lid, + port->port.base_lid + (1 << port->port.lmc) - 1, + node->nodedesc); } -static struct ibnd_node *find_existing_node(struct ibnd_fabric *fabric, - struct ibnd_node *new) +static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric, + ibnd_node_t * new) { - int hash = HASHGUID(new->node.guid) % HTSZ; - struct ibnd_node *node; + int hash = HASHGUID(new->guid) % HTSZ; + ibnd_node_t *node; for (node = fabric->nodestbl[hash]; node; node = node->htnext) - if (node->node.guid == new->node.guid) + if (node->guid == new->guid) return node; return NULL; @@ -237,7 +235,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) { struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int hash = HASHGUID(guid) % HTSZ; - struct ibnd_node *node; + ibnd_node_t *node; if (!fabric) { IBND_DEBUG("fabric parameter NULL\n"); @@ -245,7 +243,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) } for (node = f->nodestbl[hash]; node; node = node->htnext) - if (node->node.guid == guid) + if (node->guid == guid) return (ibnd_node_t *) node; return NULL; @@ -273,7 +271,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, void *nd = node->nodedesc; int p = 0; struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); - struct ibnd_node *n = CONV_NODE_INTERNAL(node); if (_check_ibmad_port(ibmad_port) < 0) return (NULL); @@ -288,36 +285,36 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, return (NULL); } - if (query_node_info(ibmad_port, f, n, &(n->node.path_portid))) + if (query_node_info(ibmad_port, f, node, &(node->path_portid))) return (NULL); - if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, 0, + if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0, ibmad_port)) return (NULL); /* update all the port info's */ - for (p = 1; p >= n->node.numports; p++) { - get_port_info(ibmad_port, f, - CONV_PORT_INTERNAL(n->node.ports[p]), p, - &(n->node.path_portid)); + for (p = 1; p >= node->numports; p++) { + get_port_info(ibmad_port, f, CONV_PORT_INTERNAL(node->ports[p]), + p, &(node->path_portid)); } - if (n->node.type != IB_NODE_SWITCH) + if (node->type != IB_NODE_SWITCH) goto done; - if (!smp_query_via(portinfo_port0, &(n->node.path_portid), - IB_ATTR_PORT_INFO, 0, 0, ibmad_port)) + if (!smp_query_via + (portinfo_port0, &(node->path_portid), IB_ATTR_PORT_INFO, 0, 0, + ibmad_port)) return (NULL); - n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); - n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); + node->smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); + node->smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); - if (!smp_query_via(node->switchinfo, &(n->node.path_portid), + if (!smp_query_via(node->switchinfo, &(node->path_portid), IB_ATTR_SWITCH_INFO, 0, 0, ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, - &n->node.smaenhsp0); + &node->smaenhsp0); done: return (node); @@ -358,10 +355,9 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str) return (rc); } -static void add_to_nodeguid_hash(struct ibnd_node *node, - struct ibnd_node *hash[]) +static void add_to_nodeguid_hash(ibnd_node_t * node, ibnd_node_t * hash[]) { - int hash_idx = HASHGUID(node->node.guid) % HTSZ; + int hash_idx = HASHGUID(node->guid) % HTSZ; node->htnext = hash[hash_idx]; hash[hash_idx] = node; @@ -376,9 +372,9 @@ static void add_to_portguid_hash(struct ibnd_port *port, hash[hash_idx] = port; } -static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric) +static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric) { - switch (node->node.type) { + switch (node->type) { case IB_NODE_CA: node->type_next = fabric->ch_adapters; fabric->ch_adapters = node; @@ -394,21 +390,21 @@ static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric) } } -static void add_to_nodedist(struct ibnd_node *node, struct ibnd_fabric *fabric) +static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric) { - int dist = node->node.dist; - if (node->node.type != IB_NODE_SWITCH) + int dist = node->dist; + if (node->type != IB_NODE_SWITCH) dist = MAXHOPS; /* special Ca list */ node->dnext = fabric->nodesdist[dist]; fabric->nodesdist[dist] = node; } -static struct ibnd_node *create_node(struct ibnd_fabric *fabric, - struct ibnd_node *temp, ib_portid_t * path, - int dist) +static ibnd_node_t *create_node(struct ibnd_fabric *fabric, + ibnd_node_t * temp, ib_portid_t * path, + int dist) { - struct ibnd_node *node; + ibnd_node_t *node; node = malloc(sizeof(*node)); if (!node) { @@ -417,13 +413,13 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric, } memcpy(node, temp, sizeof(*node)); - node->node.dist = dist; - node->node.path_portid = *path; + node->dist = dist; + node->path_portid = *path; add_to_nodeguid_hash(node, fabric->nodestbl); /* add this to the all nodes list */ - node->node.next = fabric->fabric.nodes; + node->next = fabric->fabric.nodes; fabric->fabric.nodes = (ibnd_node_t *) node; add_to_type_list(node, fabric); @@ -432,26 +428,24 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric, return node; } -static struct ibnd_port *find_existing_port_node(struct ibnd_node *node, +static struct ibnd_port *find_existing_port_node(ibnd_node_t * node, struct ibnd_port *port) { - if (port->port.portnum > node->node.numports - || node->node.ports == NULL) + if (port->port.portnum > node->numports || node->ports == NULL) return (NULL); - return (CONV_PORT_INTERNAL(node->node.ports[port->port.portnum])); + return (CONV_PORT_INTERNAL(node->ports[port->port.portnum])); } static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, - struct ibnd_node *node, + ibnd_node_t * node, struct ibnd_port *temp) { struct ibnd_port *port; - if (node->node.ports == NULL) { - node->node.ports = - calloc(sizeof(*node->node.ports), node->node.numports + 1); - if (!node->node.ports) { + if (node->ports == NULL) { + node->ports = calloc(sizeof(*node->ports), node->numports + 1); + if (!node->ports) { IBND_ERROR("Failed to allocate the ports array\n"); return (NULL); } @@ -467,20 +461,19 @@ static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, port->port.node = (ibnd_node_t *) node; port->port.ext_portnum = 0; - node->node.ports[temp->port.portnum] = (ibnd_port_t *) port; + node->ports[temp->port.portnum] = (ibnd_port_t *) port; add_to_portguid_hash(port, fabric->portstbl); return port; } -static void link_ports(struct ibnd_node *node, struct ibnd_port *port, - struct ibnd_node *remotenode, - struct ibnd_port *remoteport) +static void link_ports(ibnd_node_t * node, struct ibnd_port *port, + ibnd_node_t * remotenode, struct ibnd_port *remoteport) { IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 - " %p->%p:%u\n", node->node.guid, node, port, - port->port.portnum, remotenode->node.guid, remotenode, - remoteport, remoteport->port.portnum); + " %p->%p:%u\n", node->guid, node, port, port->port.portnum, + remotenode->guid, remotenode, remoteport, + remoteport->port.portnum); if (port->port.remoteport) port->port.remoteport->remoteport = NULL; if (remoteport->port.remoteport) @@ -490,14 +483,14 @@ static void link_ports(struct ibnd_node *node, struct ibnd_port *port, } static int get_remote_node(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, struct ibnd_node *node, + struct ibnd_fabric *fabric, ibnd_node_t * node, struct ibnd_port *port, ib_portid_t * path, int portnum, int dist) { int rc = 0; - struct ibnd_node node_buf; + ibnd_node_t node_buf; struct ibnd_port port_buf; - struct ibnd_node *remotenode, *oldnode; + ibnd_node_t *remotenode, *oldnode; struct ibnd_port *remoteport, *oldport; memset(&node_buf, 0, sizeof(node_buf)); @@ -554,9 +547,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, int rc = 0; struct ibnd_fabric *fabric = NULL; ib_portid_t my_portid = { 0 }; - struct ibnd_node node_buf; + ibnd_node_t node_buf; struct ibnd_port port_buf; - struct ibnd_node *node; + ibnd_node_t *node; struct ibnd_port *port; int i; int dist = 0; @@ -605,7 +598,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, goto error; rc = get_remote_node(ibmad_port, fabric, node, port, from, - mad_get_field(node->node.info, 0, + mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F), 0); if (rc < 0) goto error; @@ -616,13 +609,13 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, for (node = fabric->nodesdist[dist]; node; node = node->dnext) { - path = &node->node.path_portid; + path = &node->path_portid; IBND_DEBUG("dist %d node %p\n", dist, node); dump_endnode(path, "processing", node, port); - for (i = 1; i <= node->node.numports; i++) { - if (i == mad_get_field(node->node.info, 0, + for (i = 1; i <= node->numports; i++) { + if (i == mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F)) continue; @@ -644,9 +637,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, goto error; /* If switch, set port GUID to node port GUID */ - if (node->node.type == IB_NODE_SWITCH) { + if (node->type == IB_NODE_SWITCH) { port->port.guid = - mad_get_field64(node->node.info, 0, + mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F); } @@ -666,14 +659,14 @@ error: return (NULL); } -static void destroy_node(struct ibnd_node *node) +static void destroy_node(ibnd_node_t * node) { int p = 0; - for (p = 0; p <= node->node.numports; p++) { - free(node->node.ports[p]); + for (p = 0; p <= node->numports; p++) { + free(node->ports[p]); } - free(node->node.ports); + free(node->ports); free(node); } @@ -681,8 +674,8 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric) { struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int dist = 0; - struct ibnd_node *node = NULL; - struct ibnd_node *next = NULL; + ibnd_node_t *node = NULL; + ibnd_node_t *next = NULL; ibnd_chassis_t *ch, *ch_next; if (!fabric) @@ -747,8 +740,8 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, int node_type, void *user_data) { struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); - struct ibnd_node *list = NULL; - struct ibnd_node *cur = NULL; + ibnd_node_t *list = NULL; + ibnd_node_t *cur = NULL; if (!fabric) { IBND_DEBUG("fabric parameter NULL\n"); diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index 38555a0..449bd70 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -49,18 +49,6 @@ #define IBND_ERROR(fmt, ...) \ fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__) -struct ibnd_node { - /* This member MUST BE FIRST */ - ibnd_node_t node; - - /* internal use only */ - unsigned char ch_found; - struct ibnd_node *htnext; /* hash table list */ - struct ibnd_node *dnext; /* nodesdist next */ - struct ibnd_node *type_next; /* next based on type */ -}; -#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node) - struct ibnd_port { /* This member MUST BE FIRST */ ibnd_port_t port; @@ -79,15 +67,15 @@ struct ibnd_fabric { ibnd_fabric_t fabric; /* internal use only */ - struct ibnd_node *nodestbl[HTSZ]; + ibnd_node_t *nodestbl[HTSZ]; struct ibnd_port *portstbl[HTSZ]; - struct ibnd_node *nodesdist[MAXHOPS + 1]; + ibnd_node_t *nodesdist[MAXHOPS + 1]; ibnd_chassis_t *first_chassis; ibnd_chassis_t *current_chassis; ibnd_chassis_t *last_chassis; - struct ibnd_node *switches; - struct ibnd_node *ch_adapters; - struct ibnd_node *routers; + ibnd_node_t *switches; + ibnd_node_t *ch_adapters; + ibnd_node_t *routers; ib_portid_t selfportid; }; #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) -- 1.5.4.5 From weiny2 at llnl.gov Thu Aug 13 20:42:46 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 13 Aug 2009 20:42:46 -0700 Subject: [ofa-general] [PATCH 2/5] libibnetdisc: make all fields of ibnd_port_t public Message-ID: <20090813204246.59efeb5e.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 13 Aug 2009 19:54:00 -0700 Subject: [PATCH] libibnetdisc: make all fields of ibnd_port_t public Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 15 ++-- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 87 ++++++++++---------- infiniband-diags/libibnetdisc/src/internal.h | 11 +-- 3 files changed, 52 insertions(+), 61 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index e7f5f6a..4a57855 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -40,7 +40,7 @@ struct ib_fabric; /* forward declare */ struct chassis; /* forward declare */ -struct port; /* forward declare */ +struct ibnd_port; /* forward declare */ /** ========================================================================= * Node @@ -67,7 +67,7 @@ typedef struct ibnd_node { char nodedesc[IB_SMP_DATA_SIZE]; - struct port **ports; /* in order array of port pointers + struct ibnd_port **ports; /* in order array of port pointers the size of this array is info.numports + 1 items MAY BE NULL! (ie 0 == switches only) */ @@ -89,17 +89,20 @@ typedef struct ibnd_node { /** ========================================================================= * Port */ -typedef struct port { +typedef struct ibnd_port { uint64_t guid; int portnum; - int ext_portnum; /* optional if != 0 external port num */ - ibnd_node_t *node; /* node this port belongs to */ - struct port *remoteport; /* null if SMA, or does not exist */ + int ext_portnum; /* optional if != 0 external port num */ + ibnd_node_t *node; /* node this port belongs to */ + struct ibnd_port *remoteport; /* null if SMA, or does not exist */ /* quick cache of info below */ uint16_t base_lid; uint8_t lmc; /* use libibmad decoder functions for info */ uint8_t info[IB_SMP_DATA_SIZE]; + + /* internal use only */ + struct ibnd_port *htnext; } ibnd_port_t; /** ========================================================================= diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index b883d4a..1fc964c 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -67,28 +67,28 @@ void decode_port_info(ibnd_port_t * port) } static int get_port_info(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, struct ibnd_port *port, + struct ibnd_fabric *fabric, ibnd_port_t * port, int portnum, ib_portid_t * portid) { char width[64], speed[64]; int iwidth; int ispeed; - port->port.portnum = portnum; - iwidth = mad_get_field(port->port.info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); - ispeed = mad_get_field(port->port.info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + port->portnum = portnum; + iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); - if (!smp_query_via(port->port.info, portid, IB_ATTR_PORT_INFO, + if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, portnum, 0, ibmad_port)) return -1; - decode_port_info(&(port->port)); + decode_port_info(port); IBND_DEBUG ("portid %s portnum %d: base lid %d state %d physstate %d %s %s\n", - portid2str(portid), portnum, port->port.base_lid, - mad_get_field(port->port.info, 0, IB_PORT_STATE_F), - mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F), + portid2str(portid), portnum, port->base_lid, + mad_get_field(port->info, 0, IB_PORT_STATE_F), + mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F), mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth), mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed)); return 0; @@ -117,11 +117,10 @@ static int query_node_info(struct ibmad_port *ibmad_port, * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. */ static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, - ibnd_node_t * node, struct ibnd_port *iport, + ibnd_node_t * node, ibnd_port_t * port, ib_portid_t * portid) { int rc = 0; - ibnd_port_t *port = &(iport->port); void *nd = node->nodedesc; if ((rc = query_node_info(ibmad_port, fabric, node, portid)) != 0) @@ -202,7 +201,7 @@ static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f, } static void dump_endnode(ib_portid_t * path, char *prompt, - ibnd_node_t * node, struct ibnd_port *port) + ibnd_node_t * node, ibnd_port_t * port) { char type[64]; if (!show_progress) @@ -212,10 +211,9 @@ static void dump_endnode(ib_portid_t * path, char *prompt, printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path), prompt, type, node->guid, - node->type == IB_NODE_SWITCH ? 0 : port->port.portnum, - port->port.base_lid, - port->port.base_lid + (1 << port->port.lmc) - 1, - node->nodedesc); + node->type == IB_NODE_SWITCH ? 0 : port->portnum, + port->base_lid, + port->base_lid + (1 << port->lmc) - 1, node->nodedesc); } static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric, @@ -294,7 +292,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, /* update all the port info's */ for (p = 1; p >= node->numports; p++) { - get_port_info(ibmad_port, f, CONV_PORT_INTERNAL(node->ports[p]), + get_port_info(ibmad_port, f, node->ports[p], p, &(node->path_portid)); } @@ -363,10 +361,9 @@ static void add_to_nodeguid_hash(ibnd_node_t * node, ibnd_node_t * hash[]) hash[hash_idx] = node; } -static void add_to_portguid_hash(struct ibnd_port *port, - struct ibnd_port *hash[]) +static void add_to_portguid_hash(ibnd_port_t * port, ibnd_port_t * hash[]) { - int hash_idx = HASHGUID(port->port.guid) % HTSZ; + int hash_idx = HASHGUID(port->guid) % HTSZ; port->htnext = hash[hash_idx]; hash[hash_idx] = port; @@ -429,19 +426,19 @@ static ibnd_node_t *create_node(struct ibnd_fabric *fabric, } static struct ibnd_port *find_existing_port_node(ibnd_node_t * node, - struct ibnd_port *port) + ibnd_port_t * port) { - if (port->port.portnum > node->numports || node->ports == NULL) + if (port->portnum > node->numports || node->ports == NULL) return (NULL); - return (CONV_PORT_INTERNAL(node->ports[port->port.portnum])); + return (node->ports[port->portnum]); } static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, ibnd_node_t * node, - struct ibnd_port *temp) + ibnd_port_t * temp) { - struct ibnd_port *port; + ibnd_port_t *port; if (node->ports == NULL) { node->ports = calloc(sizeof(*node->ports), node->numports + 1); @@ -458,40 +455,40 @@ static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, } memcpy(port, temp, sizeof(*port)); - port->port.node = (ibnd_node_t *) node; - port->port.ext_portnum = 0; + port->node = (ibnd_node_t *) node; + port->ext_portnum = 0; - node->ports[temp->port.portnum] = (ibnd_port_t *) port; + node->ports[temp->portnum] = (ibnd_port_t *) port; add_to_portguid_hash(port, fabric->portstbl); return port; } -static void link_ports(ibnd_node_t * node, struct ibnd_port *port, - ibnd_node_t * remotenode, struct ibnd_port *remoteport) +static void link_ports(ibnd_node_t * node, ibnd_port_t * port, + ibnd_node_t * remotenode, ibnd_port_t * remoteport) { IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 - " %p->%p:%u\n", node->guid, node, port, port->port.portnum, + " %p->%p:%u\n", node->guid, node, port, port->portnum, remotenode->guid, remotenode, remoteport, - remoteport->port.portnum); - if (port->port.remoteport) - port->port.remoteport->remoteport = NULL; - if (remoteport->port.remoteport) - remoteport->port.remoteport->remoteport = NULL; - port->port.remoteport = (ibnd_port_t *) remoteport; - remoteport->port.remoteport = (ibnd_port_t *) port; + remoteport->portnum); + if (port->remoteport) + port->remoteport->remoteport = NULL; + if (remoteport->remoteport) + remoteport->remoteport->remoteport = NULL; + port->remoteport = (ibnd_port_t *) remoteport; + remoteport->remoteport = (ibnd_port_t *) port; } static int get_remote_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, ibnd_node_t * node, - struct ibnd_port *port, ib_portid_t * path, + ibnd_port_t * port, ib_portid_t * path, int portnum, int dist) { int rc = 0; ibnd_node_t node_buf; - struct ibnd_port port_buf; + ibnd_port_t port_buf; ibnd_node_t *remotenode, *oldnode; - struct ibnd_port *remoteport, *oldport; + ibnd_port_t *remoteport, *oldport; memset(&node_buf, 0, sizeof(node_buf)); memset(&port_buf, 0, sizeof(port_buf)); @@ -499,7 +496,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port, IBND_DEBUG("handle node %p port %p:%d dist %d\n", node, port, portnum, dist); - if (mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F) + if (mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F) != IB_PORT_PHYS_STATE_LINKUP) return 1; /* positive == non-fatal error */ @@ -548,9 +545,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, struct ibnd_fabric *fabric = NULL; ib_portid_t my_portid = { 0 }; ibnd_node_t node_buf; - struct ibnd_port port_buf; + ibnd_port_t port_buf; ibnd_node_t *node; - struct ibnd_port *port; + ibnd_port_t *port; int i; int dist = 0; ib_portid_t *path; @@ -638,7 +635,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, /* If switch, set port GUID to node port GUID */ if (node->type == IB_NODE_SWITCH) { - port->port.guid = + port->guid = mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F); } diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index 449bd70..f06d2c3 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -49,15 +49,6 @@ #define IBND_ERROR(fmt, ...) \ fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__) -struct ibnd_port { - /* This member MUST BE FIRST */ - ibnd_port_t port; - - /* internal use only */ - struct ibnd_port *htnext; -}; -#define CONV_PORT_INTERNAL(port) ((struct ibnd_port *)port) - /* HASH table defines */ #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) #define HTSZ 137 @@ -68,7 +59,7 @@ struct ibnd_fabric { /* internal use only */ ibnd_node_t *nodestbl[HTSZ]; - struct ibnd_port *portstbl[HTSZ]; + ibnd_port_t *portstbl[HTSZ]; ibnd_node_t *nodesdist[MAXHOPS + 1]; ibnd_chassis_t *first_chassis; ibnd_chassis_t *current_chassis; -- 1.5.4.5 From weiny2 at llnl.gov Thu Aug 13 20:42:51 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 13 Aug 2009 20:42:51 -0700 Subject: [ofa-general] [PATCH 3/5] libibnetdisc: make all fields of ibnd_fabric_t public Message-ID: <20090813204251.df6446c1.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 13 Aug 2009 20:08:51 -0700 Subject: [PATCH] libibnetdisc: make all fields of ibnd_fabric_t public In addition clean up the name of the chassis struct Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 41 +++++++++---- infiniband-diags/libibnetdisc/src/chassis.c | 23 ++++---- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 63 +++++++++----------- infiniband-diags/libibnetdisc/src/internal.h | 24 -------- 4 files changed, 69 insertions(+), 82 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index 4a57855..414e068 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -38,8 +38,7 @@ #include #include -struct ib_fabric; /* forward declare */ -struct chassis; /* forward declare */ +struct ibnd_chassis; /* forward declare */ struct ibnd_port; /* forward declare */ /** ========================================================================= @@ -67,13 +66,13 @@ typedef struct ibnd_node { char nodedesc[IB_SMP_DATA_SIZE]; - struct ibnd_port **ports; /* in order array of port pointers - the size of this array is info.numports + 1 - items MAY BE NULL! (ie 0 == switches only) */ + struct ibnd_port **ports; /* in order array of port pointers + the size of this array is info.numports + 1 + items MAY BE NULL! (ie 0 == switches only) */ /* chassis info */ struct ibnd_node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ - struct chassis *chassis; /* if != NULL the chassis this node belongs to */ + struct ibnd_chassis *chassis; /* if != NULL the chassis this node belongs to */ unsigned char ch_type; unsigned char ch_anafanum; unsigned char ch_slotnum; @@ -92,9 +91,9 @@ typedef struct ibnd_node { typedef struct ibnd_port { uint64_t guid; int portnum; - int ext_portnum; /* optional if != 0 external port num */ - ibnd_node_t *node; /* node this port belongs to */ - struct ibnd_port *remoteport; /* null if SMA, or does not exist */ + int ext_portnum; /* optional if != 0 external port num */ + ibnd_node_t *node; /* node this port belongs to */ + struct ibnd_port *remoteport; /* null if SMA, or does not exist */ /* quick cache of info below */ uint16_t base_lid; uint8_t lmc; @@ -108,8 +107,8 @@ typedef struct ibnd_port { /** ========================================================================= * Chassis */ -typedef struct chassis { - struct chassis *next; +typedef struct ibnd_chassis { + struct ibnd_chassis *next; uint64_t chassisguid; unsigned char chassisnum; @@ -124,11 +123,17 @@ typedef struct chassis { ibnd_node_t *linenode[LINES_MAX_NUM + 1]; } ibnd_chassis_t; +/* HASH table defines */ +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) +#define HTSZ 137 + +#define MAXHOPS 63 + /** ========================================================================= * Fabric * Main fabric object which is returned and represents the data discovered */ -typedef struct ib_fabric { +typedef struct ibnd_fabric { /* the node the discover was initiated from * "from" parameter in ibnd_discover_fabric * or by default the node you ar running on @@ -139,6 +144,18 @@ typedef struct ib_fabric { /* NULL terminated list of all chassis found in the fabric */ ibnd_chassis_t *chassis; int maxhops_discovered; + + /* internal use only */ + ibnd_node_t *nodestbl[HTSZ]; + ibnd_port_t *portstbl[HTSZ]; + ibnd_node_t *nodesdist[MAXHOPS + 1]; + ibnd_chassis_t *first_chassis; + ibnd_chassis_t *current_chassis; + ibnd_chassis_t *last_chassis; + ibnd_node_t *switches; + ibnd_node_t *ch_adapters; + ibnd_node_t *routers; + ib_portid_t selfportid; } ibnd_fabric_t; /** ========================================================================= diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 0dd259a..4886cfc 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -91,7 +91,7 @@ char *ibnd_get_chassis_slot_str(ibnd_node_t * node, char *str, size_t size) return (str); } -static ibnd_chassis_t *find_chassisnum(struct ibnd_fabric *fabric, +static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric, unsigned char chassisnum) { ibnd_chassis_t *current; @@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node) return sysimgguid; } -static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f, +static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric, ibnd_node_t * node) { ibnd_chassis_t *current; uint64_t chguid; chguid = get_chassisguid(node); - for (current = f->first_chassis; current; current = current->next) { + for (current = fabric->first_chassis; current; current = current->next) { if (current->chassisguid == chguid) return current; } @@ -224,7 +224,6 @@ static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f, uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); ibnd_chassis_t *chassis; if (!fabric) { @@ -232,7 +231,7 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum) return 0; } - chassis = find_chassisnum(f, chassisnum); + chassis = find_chassisnum(fabric, chassisnum); if (chassis) return chassis->chassisguid; else @@ -783,7 +782,7 @@ static void voltaire_portmap(ibnd_port_t * port) port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; } -static int add_chassis(struct ibnd_fabric *fabric) +static int add_chassis(ibnd_fabric_t * fabric) { if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) { IBND_ERROR("OOM: failed to allocate chassis object\n"); @@ -819,7 +818,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node) Returns: 0 on success, -1 on failure */ -int group_nodes(struct ibnd_fabric *fabric) +int group_nodes(ibnd_fabric_t * fabric) { ibnd_node_t *node; int dist; @@ -833,7 +832,7 @@ int group_nodes(struct ibnd_fabric *fabric) /* an appropriate chassis record (slotnum and position) */ /* according to internal connectivity */ /* not very efficient but clear code so... */ - for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) @@ -844,7 +843,7 @@ int group_nodes(struct ibnd_fabric *fabric) /* separate every Voltaire chassis from each other and build linked list of them */ /* algorithm: catch spine and find all surrounding nodes */ - for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) @@ -863,7 +862,7 @@ int group_nodes(struct ibnd_fabric *fabric) /* now make pass on nodes for chassis which are not Voltaire */ /* grouped by common SystemImageGUID */ - for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) @@ -913,12 +912,12 @@ int group_nodes(struct ibnd_fabric *fabric) } } } - if (dist == fabric->fabric.maxhops_discovered) + if (dist == fabric->maxhops_discovered) dist = MAXHOPS; /* skip to CAs */ else dist++; } - fabric->fabric.chassis = fabric->first_chassis; + fabric->chassis = fabric->first_chassis; return (0); } diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 1fc964c..2cd2c9b 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -67,7 +67,7 @@ void decode_port_info(ibnd_port_t * port) } static int get_port_info(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, ibnd_port_t * port, + ibnd_fabric_t * fabric, ibnd_port_t * port, int portnum, ib_portid_t * portid) { char width[64], speed[64]; @@ -98,7 +98,7 @@ static int get_port_info(struct ibmad_port *ibmad_port, * Returns -1 if error. */ static int query_node_info(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, ibnd_node_t * node, + ibnd_fabric_t * fabric, ibnd_node_t * node, ib_portid_t * portid) { if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0, @@ -116,7 +116,7 @@ static int query_node_info(struct ibmad_port *ibmad_port, /* * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. */ -static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, +static int query_node(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, ibnd_node_t * node, ibnd_port_t * port, ib_portid_t * portid) { @@ -175,28 +175,28 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport) return path->cnt; } -static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f, +static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, ib_portid_t * portid, int nextport) { int rc = 0; if (portid->lid) { /* If we were LID routed we need to set up the drslid */ - if (!f->selfportid.lid) - if (ib_resolve_self_via(&f->selfportid, NULL, NULL, + if (!fabric->selfportid.lid) + if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL, ibmad_port) < 0) { IBND_ERROR("Failed to resolve self\n"); return -1; } - portid->drpath.drslid = (uint16_t) f->selfportid.lid; + portid->drpath.drslid = (uint16_t) fabric->selfportid.lid; portid->drpath.drdlid = 0xFFFF; } rc = add_port_to_dpath(&portid->drpath, nextport); - if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) - f->fabric.maxhops_discovered = portid->drpath.cnt; + if ((rc != -1) && (portid->drpath.cnt > fabric->maxhops_discovered)) + fabric->maxhops_discovered = portid->drpath.cnt; return (rc); } @@ -216,7 +216,7 @@ static void dump_endnode(ib_portid_t * path, char *prompt, port->base_lid + (1 << port->lmc) - 1, node->nodedesc); } -static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric, +static ibnd_node_t *find_existing_node(ibnd_fabric_t * fabric, ibnd_node_t * new) { int hash = HASHGUID(new->guid) % HTSZ; @@ -231,7 +231,6 @@ static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric, ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int hash = HASHGUID(guid) % HTSZ; ibnd_node_t *node; @@ -240,7 +239,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) return (NULL); } - for (node = f->nodestbl[hash]; node; node = node->htnext) + for (node = fabric->nodestbl[hash]; node; node = node->htnext) if (node->guid == guid) return (ibnd_node_t *) node; @@ -268,7 +267,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, char portinfo_port0[IB_SMP_DATA_SIZE]; void *nd = node->nodedesc; int p = 0; - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); if (_check_ibmad_port(ibmad_port) < 0) return (NULL); @@ -283,7 +281,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, return (NULL); } - if (query_node_info(ibmad_port, f, node, &(node->path_portid))) + if (query_node_info(ibmad_port, fabric, node, &(node->path_portid))) return (NULL); if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0, @@ -292,7 +290,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, /* update all the port info's */ for (p = 1; p >= node->numports; p++) { - get_port_info(ibmad_port, f, node->ports[p], + get_port_info(ibmad_port, fabric, node->ports[p], p, &(node->path_portid)); } @@ -320,7 +318,6 @@ done: ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int i = 0; ibnd_node_t *rc; ib_dr_path_t path; @@ -330,7 +327,7 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str) return (NULL); } - rc = f->fabric.from_node; + rc = fabric->from_node; if (str2drpath(&path, dr_str, 0, 0) == -1) { return (NULL); @@ -369,7 +366,7 @@ static void add_to_portguid_hash(ibnd_port_t * port, ibnd_port_t * hash[]) hash[hash_idx] = port; } -static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric) +static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric) { switch (node->type) { case IB_NODE_CA: @@ -387,7 +384,7 @@ static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric) } } -static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric) +static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric) { int dist = node->dist; if (node->type != IB_NODE_SWITCH) @@ -397,7 +394,7 @@ static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric) fabric->nodesdist[dist] = node; } -static ibnd_node_t *create_node(struct ibnd_fabric *fabric, +static ibnd_node_t *create_node(ibnd_fabric_t * fabric, ibnd_node_t * temp, ib_portid_t * path, int dist) { @@ -416,8 +413,8 @@ static ibnd_node_t *create_node(struct ibnd_fabric *fabric, add_to_nodeguid_hash(node, fabric->nodestbl); /* add this to the all nodes list */ - node->next = fabric->fabric.nodes; - fabric->fabric.nodes = (ibnd_node_t *) node; + node->next = fabric->nodes; + fabric->nodes = (ibnd_node_t *) node; add_to_type_list(node, fabric); add_to_nodedist(node, fabric); @@ -434,7 +431,7 @@ static struct ibnd_port *find_existing_port_node(ibnd_node_t * node, return (node->ports[port->portnum]); } -static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, +static struct ibnd_port *add_port_to_node(ibnd_fabric_t * fabric, ibnd_node_t * node, ibnd_port_t * temp) { @@ -480,7 +477,7 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port, } static int get_remote_node(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, ibnd_node_t * node, + ibnd_fabric_t * fabric, ibnd_node_t * node, ibnd_port_t * port, ib_portid_t * path, int portnum, int dist) { @@ -542,7 +539,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, ib_portid_t * from, int hops) { int rc = 0; - struct ibnd_fabric *fabric = NULL; + ibnd_fabric_t *fabric = NULL; ib_portid_t my_portid = { 0 }; ibnd_node_t node_buf; ibnd_port_t port_buf; @@ -588,7 +585,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, if (!node) goto error; - fabric->fabric.from_node = (ibnd_node_t *) node; + fabric->from_node = (ibnd_node_t *) node; port = add_port_to_node(fabric, node, &port_buf); if (!port) @@ -669,7 +666,6 @@ static void destroy_node(ibnd_node_t * node) void ibnd_destroy_fabric(ibnd_fabric_t * fabric) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int dist = 0; ibnd_node_t *node = NULL; ibnd_node_t *next = NULL; @@ -678,21 +674,21 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric) if (!fabric) return; - ch = f->first_chassis; + ch = fabric->first_chassis; while (ch) { ch_next = ch->next; free(ch); ch = ch_next; } for (dist = 0; dist <= MAXHOPS; dist++) { - node = f->nodesdist[dist]; + node = fabric->nodesdist[dist]; while (node) { next = node->dnext; destroy_node(node); node = next; } } - free(f); + free(fabric); } void ibnd_debug(int i) @@ -736,7 +732,6 @@ void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, int node_type, void *user_data) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); ibnd_node_t *list = NULL; ibnd_node_t *cur = NULL; @@ -752,13 +747,13 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, switch (node_type) { case IB_NODE_SWITCH: - list = f->switches; + list = fabric->switches; break; case IB_NODE_CA: - list = f->ch_adapters; + list = fabric->ch_adapters; break; case IB_NODE_ROUTER: - list = f->routers; + list = fabric->routers; break; default: IBND_DEBUG("Invalid node_type specified %d\n", node_type); diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index f06d2c3..ba32291 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -40,8 +40,6 @@ #include -#define MAXHOPS 63 - #define IBND_DEBUG(fmt, ...) \ if (ibdebug) { \ printf("%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__); \ @@ -49,26 +47,4 @@ #define IBND_ERROR(fmt, ...) \ fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__) -/* HASH table defines */ -#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) -#define HTSZ 137 - -struct ibnd_fabric { - /* This member MUST BE FIRST */ - ibnd_fabric_t fabric; - - /* internal use only */ - ibnd_node_t *nodestbl[HTSZ]; - ibnd_port_t *portstbl[HTSZ]; - ibnd_node_t *nodesdist[MAXHOPS + 1]; - ibnd_chassis_t *first_chassis; - ibnd_chassis_t *current_chassis; - ibnd_chassis_t *last_chassis; - ibnd_node_t *switches; - ibnd_node_t *ch_adapters; - ibnd_node_t *routers; - ib_portid_t selfportid; -}; -#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) - #endif /* _INTERNAL_H_ */ -- 1.5.4.5 From weiny2 at llnl.gov Thu Aug 13 20:43:06 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 13 Aug 2009 20:43:06 -0700 Subject: [ofa-general] [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object. Message-ID: <20090813204306.dffc3237.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 13 Aug 2009 20:16:01 -0700 Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object. This object must be created before query functions can be used and is used to control the functionality of the queries. Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/Makefile.am | 4 +- .../libibnetdisc/include/infiniband/ibnetdisc.h | 23 ++++-- .../libibnetdisc/man/ibnd_create_ctx.3 | 2 + .../libibnetdisc/man/ibnd_destroy_ctx.3 | 2 + .../libibnetdisc/man/ibnd_discover_fabric.3 | 41 ++++++++--- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 74 ++++++++++++++------ infiniband-diags/libibnetdisc/src/internal.h | 5 ++ infiniband-diags/libibnetdisc/src/libibnetdisc.map | 2 + infiniband-diags/libibnetdisc/test/testleaks.c | 7 ++- infiniband-diags/src/iblinkinfo.c | 8 ++- infiniband-diags/src/ibnetdiscover.c | 13 +++- infiniband-diags/src/ibqueryerrors.c | 8 ++- 12 files changed, 141 insertions(+), 48 deletions(-) create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am index 7085f14..5619aad 100644 --- a/infiniband-diags/libibnetdisc/Makefile.am +++ b/infiniband-diags/libibnetdisc/Makefile.am @@ -45,7 +45,9 @@ man_MANS = man/ibnd_debug.3 \ man/ibnd_iter_nodes.3 \ man/ibnd_iter_nodes_type.3 \ man/ibnd_show_progress.3 \ - man/ibnd_update_node.3 + man/ibnd_update_node.3 \ + man/ibnd_create_ctx.3 \ + man/ibnd_destroy_ctx.3 EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver $(man_MANS) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index 414e068..65ba74f 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -38,8 +38,11 @@ #include #include -struct ibnd_chassis; /* forward declare */ -struct ibnd_port; /* forward declare */ +typedef struct ibnd_ctx ibnd_ctx_t; + +/* forward declares */ +struct ibnd_chassis; +struct ibnd_port; /** ========================================================================= * Node @@ -159,15 +162,21 @@ typedef struct ibnd_fabric { } ibnd_fabric_t; /** ========================================================================= - * Initialization (fabric operations) + * Initialization */ MAD_EXPORT void ibnd_debug(int i); -MAD_EXPORT void ibnd_show_progress(int i); -MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, +MAD_EXPORT ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port); +MAD_EXPORT void ibnd_destroy_ctx(ibnd_ctx_t * ctx); +MAD_EXPORT int ibnd_show_progress(ibnd_ctx_t * ctx, int i); + +/** ========================================================================= + * Fabric Operations + */ +MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, ib_portid_t * from, int hops); /** - * open: (required) ibmad_port object from libibmad + * ctx : (required) context created by ibnd_create_ctx. * from: (optional) specify the node to start scanning from. * If NULL start from the node we are running on. * hops: (optional) Specify how much of the fabric to traverse. @@ -181,7 +190,7 @@ MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t * fabric); MAD_EXPORT ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid); MAD_EXPORT ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str); -MAD_EXPORT ibnd_node_t *ibnd_update_node(struct ibmad_port *ibmad_port, +MAD_EXPORT ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, ibnd_node_t * node); diff --git a/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 new file mode 100644 index 0000000..8b321b0 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 @@ -0,0 +1,2 @@ +.\".TH IBND_CREATE_CTX 3 "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 new file mode 100644 index 0000000..bb9d96a --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 @@ -0,0 +1,2 @@ +.\".TH IBND_DESTROY_CTX 3 "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 index dfeaf47..f014977 100644 --- a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 +++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 @@ -1,46 +1,65 @@ .TH IBND_DISCOVER_FABRIC 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" .SH "NAME" -ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- initialize ibnetdiscover library. +ibnd_create_ctx, ibnd_destroy_ctx, +ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug, ibnd_show_progress \- +initialize ibnetdiscover library and query the fabric. .SH "SYNOPSIS" .nf .B #include .sp -.bi "ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops)" +.bi "ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port)" +.BI "void ibnd_destroy_ctx(ibnd_ctx_t *ctx)" +.bi "ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t *ctx, ib_portid_t *from, int hops)" .BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)" .BI "void ibnd_debug(int i)" -.BI "void ibnd_show_progress(int i)" +.BI "int ibnd_show_progress(ibnd_ctx_t *ctx, int i)" .SH "DESCRIPTION" -.B ibnd_discover_fabric() -Discover the fabric connected to the port specified by ibmad_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. +.B ibnd_create_ctx() +Create a context for the ibnetdiscover library to be used in query operations. ibmad_port must be opened with at least IB_SMI_CLASS and IB_SMI_DIRECT_CLASS -classes for ibnd_discover_fabric to work. +classes for queries to work. + +.B ibnd_discover_fabric() +Discover the fabric using the context specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. .B ibnd_destroy_fabric() free all memory and resources associated with the fabric. +.B ibnd_destroy_ctx() +free all memory and resources associated with the context. + .B ibnd_debug() Set the debug level to be printed as library operations take place. -.B ibnd_debug() -Indicate that the library should print debug output which shows it's progress +.B ibnd_show_progress() +Indicate that the library should print output which shows it's progress through the fabric. .SH "RETURN VALUE" +.B ibnd_create_ctx() +return NULL on failure, otherwise a valid ibnd_ctx_t object. + .B ibnd_discover_fabric() return NULL on failure, otherwise a valid ibnd_fabric_t object. -.B ibnd_destory_fabric(), ibnd_debug() +.B ibnd_show_progress() +Returnes the previous setting for this value. + +.B ibnd_destory_fabric(), ibnd_debug(), ibnd_destroy_ctx() NONE + .SH "EXAMPLES" .B Discover the entire fabric connected to device "mthca0", port 1. int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; struct ibmad_port *ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); - ibnd_fabric_t *fabric = ibnd_discover_fabric(ibmad_port, 100, NULL, 0); + ibnd_ctx_t *ctx = ibnd_create_ctx(ibmad_port); + ibnd_fabric_t *fabric = ibnd_discover_fabric(ctx, NULL, 0); ... ibnd_destroy_fabric(fabric); + ibnd_destroy_ctx(ctx); mad_rpc_close_port(ibmad_port); .B Discover only a single node and those nodes connected to it. @@ -48,7 +67,7 @@ NONE ... str2drpath(&(port_id.drpath), from, 0, 0); ... - ibnd_discover_fabric(ibmad_port, 100, &port_id, 1); + ibnd_discover_fabric(ctx, &port_id, 1); ... .SH "SEE ALSO" libibmad, mad_rpc_open_port diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 2cd2c9b..4b320cd 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -57,9 +57,23 @@ #include "internal.h" #include "chassis.h" -static int show_progress = 0; int ibdebug; +ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port) +{ + ibnd_ctx_t *rc = calloc(1, sizeof *rc); + if (!rc) + return (NULL); + + rc->ibmad_port = ibmad_port; + return (rc); +} + +void ibnd_destroy_ctx(ibnd_ctx_t * ctx) +{ + free(ctx); +} + void decode_port_info(ibnd_port_t * port) { port->base_lid = (uint16_t) mad_get_field(port->info, 0, IB_PORT_LID_F); @@ -204,8 +218,6 @@ static void dump_endnode(ib_portid_t * path, char *prompt, ibnd_node_t * node, ibnd_port_t * port) { char type[64]; - if (!show_progress) - return; mad_dump_node_type(type, 64, &(node->type), sizeof(int)), printf("%s -> %s %s {%016" PRIx64 @@ -261,16 +273,29 @@ static int _check_ibmad_port(struct ibmad_port *ibmad_port) return (0); } -ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, - ibnd_fabric_t * fabric, ibnd_node_t * node) +static int check_ctx(ibnd_ctx_t * ctx) +{ + if (!ctx) { + IBND_DEBUG("ctx must be specified\n"); + return (-1); + } + + return (_check_ibmad_port(ctx->ibmad_port)); +} + +ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, + ibnd_node_t * node) { char portinfo_port0[IB_SMP_DATA_SIZE]; void *nd = node->nodedesc; int p = 0; + struct ibmad_port *ibmad_port; - if (_check_ibmad_port(ibmad_port) < 0) + if (check_ctx(ctx) < 0) return (NULL); + ibmad_port = ctx->ibmad_port; + if (!fabric) { IBND_DEBUG("fabric parameter NULL\n"); return (NULL); @@ -476,12 +501,12 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port, remoteport->remoteport = (ibnd_port_t *) port; } -static int get_remote_node(struct ibmad_port *ibmad_port, - ibnd_fabric_t * fabric, ibnd_node_t * node, - ibnd_port_t * port, ib_portid_t * path, - int portnum, int dist) +static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, + ibnd_node_t * node, ibnd_port_t * port, + ib_portid_t * path, int portnum, int dist) { int rc = 0; + struct ibmad_port *ibmad_port = ctx->ibmad_port; ibnd_node_t node_buf; ibnd_port_t port_buf; ibnd_node_t *remotenode, *oldnode; @@ -525,8 +550,9 @@ static int get_remote_node(struct ibmad_port *ibmad_port, goto error; } - dump_endnode(path, oldnode ? "known remote" : "new remote", - remotenode, remoteport); + if (ctx->show_progress) + dump_endnode(path, oldnode ? "known remote" : "new remote", + remotenode, remoteport); link_ports(node, port, remotenode, remoteport); @@ -535,7 +561,7 @@ error: return (rc); } -ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, +ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, ib_portid_t * from, int hops) { int rc = 0; @@ -550,7 +576,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, ib_portid_t *path; int max_hops = MAXHOPS - 1; /* default find everything */ - if (_check_ibmad_port(ibmad_port) < 0) + if (check_ctx(ctx) < 0) return (NULL); /* if not everything how much? */ @@ -576,7 +602,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, memset(&node_buf, 0, sizeof(node_buf)); memset(&port_buf, 0, sizeof(port_buf)); - if (query_node(ibmad_port, fabric, &node_buf, &port_buf, from)) { + if (query_node(ctx->ibmad_port, fabric, &node_buf, &port_buf, from)) { IBND_DEBUG("can't reach node %s\n", portid2str(from)); goto error; } @@ -591,7 +617,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, if (!port) goto error; - rc = get_remote_node(ibmad_port, fabric, node, port, from, + rc = get_remote_node(ctx, fabric, node, port, from, mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F), 0); if (rc < 0) @@ -606,14 +632,15 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, path = &node->path_portid; IBND_DEBUG("dist %d node %p\n", dist, node); - dump_endnode(path, "processing", node, port); + if (ctx->show_progress) + dump_endnode(path, "processing", node, port); for (i = 1; i <= node->numports; i++) { if (i == mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F)) continue; - if (get_port_info(ibmad_port, fabric, + if (get_port_info(ctx->ibmad_port, fabric, &port_buf, i, path)) { IBND_ERROR ("can't reach node %s port %d", @@ -637,7 +664,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, IB_NODE_PORT_GUID_F); } - if (get_remote_node(ibmad_port, fabric, node, + if (get_remote_node(ctx, fabric, node, port, path, i, dist) < 0) goto error; } @@ -704,9 +731,14 @@ void ibnd_debug(int i) } } -void ibnd_show_progress(int i) +int ibnd_show_progress(ibnd_ctx_t * ctx, int i) { - show_progress = i; + int rc = 0; + if (check_ctx(ctx)) + return (-1); + rc = ctx->show_progress; + ctx->show_progress = i; + return (rc); } void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index ba32291..8753eae 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -47,4 +47,9 @@ #define IBND_ERROR(fmt, ...) \ fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__) +struct ibnd_ctx { + struct ibmad_port *ibmad_port; + int show_progress; +}; + #endif /* _INTERNAL_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map index bd108ab..56560ec 100644 --- a/infiniband-diags/libibnetdisc/src/libibnetdisc.map +++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map @@ -2,6 +2,8 @@ IBNETDISC_1.0 { global: ibnd_debug; ibnd_show_progress; + ibnd_create_ctx; + ibnd_destroy_ctx; ibnd_discover_fabric; ibnd_destroy_fabric; ibnd_find_node_guid; diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c index cb5651e..b121bdd 100644 --- a/infiniband-diags/libibnetdisc/test/testleaks.c +++ b/infiniband-diags/libibnetdisc/test/testleaks.c @@ -87,6 +87,7 @@ int main(int argc, char **argv) int hops = 0; ib_portid_t port_id; int iters = -1; + ibnd_ctx_t *ctx = NULL; struct ibmad_port *ibmad_port; int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; @@ -156,11 +157,12 @@ int main(int argc, char **argv) mad_rpc_set_timeout(ibmad_port, timeout_ms); + ctx = ibnd_create_ctx(ibmad_port); while (iters == -1 || iters-- > 0) { if (from) { /* only scan part of the fabric */ str2drpath(&(port_id.drpath), from, 0, 0); - if ((fabric = ibnd_discover_fabric(ibmad_port, + if ((fabric = ibnd_discover_fabric(ctx, &port_id, hops)) == NULL) { fprintf(stderr, "discover failed\n"); @@ -170,7 +172,7 @@ int main(int argc, char **argv) guid = 0; } else { if ((fabric = - ibnd_discover_fabric(ibmad_port, NULL, + ibnd_discover_fabric(ctx, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; @@ -182,6 +184,7 @@ int main(int argc, char **argv) } close_port: + ibnd_destroy_ctx(ctx); mad_rpc_close_port(ibmad_port); exit(rc); } diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 5dfadee..af5be09 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -274,6 +274,7 @@ main(int argc, char **argv) int rc = 0; int resolved = -1; ibnd_fabric_t *fabric = NULL; + ibnd_ctx_t *ctx = NULL; struct ibmad_port *ibmad_port; ib_portid_t port_id = {0}; int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; @@ -313,6 +314,8 @@ main(int argc, char **argv) node_name_map = open_node_name_map(node_name_map_file); + ctx = ibnd_create_ctx(ibmad_port); + if (dr_path) { /* only scan part of the fabric */ if ((resolved = ib_resolve_portid_str_via(&port_id, dr_path, IB_DEST_DRPATH, @@ -327,12 +330,12 @@ main(int argc, char **argv) } if (resolved >= 0) - if ((fabric = ibnd_discover_fabric(ibmad_port, &port_id, + if ((fabric = ibnd_discover_fabric(ctx, &port_id, hops)) == NULL) IBWARN("Single node discover failed; attempting full scan\n"); if (!fabric) - if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; goto close_port; @@ -364,6 +367,7 @@ main(int argc, char **argv) ibnd_destroy_fabric(fabric); close_port: + ibnd_destroy_ctx(ctx); close_node_name_map(node_name_map); mad_rpc_close_port(ibmad_port); exit(rc); diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index b04f2c6..ecb591e 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -65,6 +65,7 @@ static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; static int report_max_hops = 0; +static int show_progress = 0; /** * Define our own conversion functions to maintain compatibility with the old @@ -610,7 +611,7 @@ static int process_opt(void *context, int ch, char *optarg) node_name_map_file = strdup(optarg); break; case 's': - ibnd_show_progress(1); + show_progress = 1; break; case 'l': list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; @@ -643,6 +644,7 @@ static int process_opt(void *context, int ch, char *optarg) int main(int argc, char **argv) { ibnd_fabric_t *fabric = NULL; + ibnd_ctx_t *ctx = NULL; struct ibmad_port *ibmad_port; int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; @@ -683,8 +685,14 @@ int main(int argc, char **argv) IBERROR("can't open file %s for writing", argv[0]); node_name_map = open_node_name_map(node_name_map_file); + ctx = ibnd_create_ctx(ibmad_port); - if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) + if (!ctx) + IBERROR("failed to create libibnetdisc context\n"); + + ibnd_show_progress(ctx, show_progress); + + if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) IBERROR("discover failed\n"); if (ports_report) @@ -697,6 +705,7 @@ int main(int argc, char **argv) dump_topology(group, fabric); ibnd_destroy_fabric(fabric); + ibnd_destroy_ctx(ctx); close_node_name_map(node_name_map); mad_rpc_close_port(ibmad_port); exit(0); diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c index 2c85423..0955415 100644 --- a/infiniband-diags/src/ibqueryerrors.c +++ b/infiniband-diags/src/ibqueryerrors.c @@ -392,6 +392,7 @@ main(int argc, char **argv) ib_portid_t portid = {0}; int rc = 0; ibnd_fabric_t *fabric = NULL; + ibnd_ctx_t *ctx = NULL; int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; @@ -427,6 +428,8 @@ main(int argc, char **argv) node_name_map = open_node_name_map(node_name_map_file); + ctx = ibnd_create_ctx(ibmad_port); + /* limit the scan the fabric around the target */ if (dr_path) { if ((resolved = ib_resolve_portid_str_via(&portid, dr_path, IB_DEST_DRPATH, @@ -440,12 +443,12 @@ main(int argc, char **argv) } if (resolved >= 0) - if ((fabric = ibnd_discover_fabric(ibmad_port, &portid, + if ((fabric = ibnd_discover_fabric(ctx, &portid, 0)) == NULL) IBWARN("Single node discover failed; attempting full scan\n"); if (!fabric) /* do a full scan */ - if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; goto close_port; @@ -479,6 +482,7 @@ main(int argc, char **argv) ibnd_destroy_fabric(fabric); close_port: + ibnd_destroy_ctx(ctx); mad_rpc_close_port(ibmad_port); close_node_name_map(node_name_map); exit(rc); -- 1.5.4.5 From weiny2 at llnl.gov Thu Aug 13 20:43:16 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 13 Aug 2009 20:43:16 -0700 Subject: [ofa-general] [PATCH 5/5] infiniband-diags/libibnetdisc: remove members of the fabric struct which are used in the scan only. Message-ID: <20090813204316.c6ce0de3.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 13 Aug 2009 20:27:41 -0700 Subject: [PATCH] infiniband-diags/libibnetdisc: remove members of the fabric struct which are used in the scan only. Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 7 -- infiniband-diags/libibnetdisc/src/chassis.c | 52 +++++++------- infiniband-diags/libibnetdisc/src/chassis.h | 2 +- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 80 ++++++++++++-------- infiniband-diags/libibnetdisc/src/internal.h | 13 +++ 5 files changed, 88 insertions(+), 66 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index 65ba74f..da14942 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -130,8 +130,6 @@ typedef struct ibnd_chassis { #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) #define HTSZ 137 -#define MAXHOPS 63 - /** ========================================================================= * Fabric * Main fabric object which is returned and represents the data discovered @@ -151,14 +149,9 @@ typedef struct ibnd_fabric { /* internal use only */ ibnd_node_t *nodestbl[HTSZ]; ibnd_port_t *portstbl[HTSZ]; - ibnd_node_t *nodesdist[MAXHOPS + 1]; - ibnd_chassis_t *first_chassis; - ibnd_chassis_t *current_chassis; - ibnd_chassis_t *last_chassis; ibnd_node_t *switches; ibnd_node_t *ch_adapters; ibnd_node_t *routers; - ib_portid_t selfportid; } ibnd_fabric_t; /** ========================================================================= diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 4886cfc..d11d7df 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -96,7 +96,7 @@ static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric, { ibnd_chassis_t *current; - for (current = fabric->first_chassis; current; current = current->next) { + for (current = fabric->chassis; current; current = current->next) { if (current->chassisnum == chassisnum) return current; } @@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node) return sysimgguid; } -static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric, +static ibnd_chassis_t *find_chassisguid(struct ibnd_chassis_ctx *ch_ctx, ibnd_node_t * node) { ibnd_chassis_t *current; uint64_t chguid; chguid = get_chassisguid(node); - for (current = fabric->first_chassis; current; current = current->next) { + for (current = ch_ctx->first_chassis; current; current = current->next) { if (current->chassisguid == chguid) return current; } @@ -782,19 +782,19 @@ static void voltaire_portmap(ibnd_port_t * port) port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; } -static int add_chassis(ibnd_fabric_t * fabric) +static int add_chassis(struct ibnd_chassis_ctx *ch_ctx) { - if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) { + if (!(ch_ctx->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) { IBND_ERROR("OOM: failed to allocate chassis object\n"); return (-1); } - if (fabric->first_chassis == NULL) { - fabric->first_chassis = fabric->current_chassis; - fabric->last_chassis = fabric->current_chassis; + if (ch_ctx->first_chassis == NULL) { + ch_ctx->first_chassis = ch_ctx->current_chassis; + ch_ctx->last_chassis = ch_ctx->current_chassis; } else { - fabric->last_chassis->next = fabric->current_chassis; - fabric->last_chassis = fabric->current_chassis; + ch_ctx->last_chassis->next = ch_ctx->current_chassis; + ch_ctx->last_chassis = ch_ctx->current_chassis; } return (0); } @@ -818,22 +818,22 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node) Returns: 0 on success, -1 on failure */ -int group_nodes(ibnd_fabric_t * fabric) +int group_nodes(struct ibnd_scan_ctx *scan_ctx, ibnd_fabric_t * fabric) { ibnd_node_t *node; int dist; int chassisnum = 0; ibnd_chassis_t *chassis; + struct ibnd_chassis_ctx ch_ctx; - fabric->first_chassis = NULL; - fabric->current_chassis = NULL; + memset(&ch_ctx, 0, sizeof ch_ctx); /* first pass on switches and build for every Voltaire node */ /* an appropriate chassis record (slotnum and position) */ /* according to internal connectivity */ /* not very efficient but clear code so... */ for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) if (fill_voltaire_chassis_record(node)) @@ -844,7 +844,7 @@ int group_nodes(ibnd_fabric_t * fabric) /* separate every Voltaire chassis from each other and build linked list of them */ /* algorithm: catch spine and find all surrounding nodes */ for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) continue; @@ -852,10 +852,10 @@ int group_nodes(ibnd_fabric_t * fabric) || (node->chassis && node->chassis->chassisnum) || !is_spine(node)) continue; - if (add_chassis(fabric)) + if (add_chassis(&ch_ctx)) return (-1); - fabric->current_chassis->chassisnum = ++chassisnum; - if (build_chassis(node, fabric->current_chassis)) + ch_ctx.current_chassis->chassisnum = ++chassisnum; + if (build_chassis(node, ch_ctx.current_chassis)) return (-1); } } @@ -863,25 +863,25 @@ int group_nodes(ibnd_fabric_t * fabric) /* now make pass on nodes for chassis which are not Voltaire */ /* grouped by common SystemImageGUID */ for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) continue; if (mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F)) { chassis = - find_chassisguid(fabric, + find_chassisguid(&ch_ctx, (ibnd_node_t *) node); if (chassis) chassis->nodecount++; else { /* Possible new chassis */ - if (add_chassis(fabric)) + if (add_chassis(&ch_ctx)) return (-1); - fabric->current_chassis->chassisguid = + ch_ctx.current_chassis->chassisguid = get_chassisguid((ibnd_node_t *) node); - fabric->current_chassis->nodecount = 1; + ch_ctx.current_chassis->nodecount = 1; } } } @@ -890,14 +890,14 @@ int group_nodes(ibnd_fabric_t * fabric) /* now, make another pass to see which nodes are part of chassis */ /* (defined as chassis->nodecount > 1) */ for (dist = 0; dist <= MAXHOPS;) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) continue; if (mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F)) { chassis = - find_chassisguid(fabric, + find_chassisguid(&ch_ctx, (ibnd_node_t *) node); if (chassis && chassis->nodecount > 1) { if (!chassis->chassisnum) @@ -918,6 +918,6 @@ int group_nodes(ibnd_fabric_t * fabric) dist++; } - fabric->chassis = fabric->first_chassis; + fabric->chassis = ch_ctx.first_chassis; return (0); } diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h index 2191046..707140c 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.h +++ b/infiniband-diags/libibnetdisc/src/chassis.h @@ -82,6 +82,6 @@ enum ibnd_chassis_type { }; enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; -int group_nodes(struct ibnd_fabric *fabric); +int group_nodes(struct ibnd_scan_ctx *scan_ctx, struct ibnd_fabric *fabric); #endif /* _CHASSIS_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 4b320cd..14f6bf1 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -189,21 +189,27 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport) return path->cnt; } -static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, +static int extend_dpath(struct ibnd_scan_ctx *scan_ctx, + struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, ib_portid_t * portid, int nextport) { int rc = 0; if (portid->lid) { + if (!scan_ctx) { + IBND_ERROR("Invalid internal scan state"); + return (-1); + } /* If we were LID routed we need to set up the drslid */ - if (!fabric->selfportid.lid) - if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL, - ibmad_port) < 0) { + if (!scan_ctx->selfportid.lid) + if (ib_resolve_self_via + (&scan_ctx->selfportid, NULL, NULL, + ibmad_port) < 0) { IBND_ERROR("Failed to resolve self\n"); return -1; } - portid->drpath.drslid = (uint16_t) fabric->selfportid.lid; + portid->drpath.drslid = (uint16_t) scan_ctx->selfportid.lid; portid->drpath.drdlid = 0xFFFF; } @@ -409,19 +415,25 @@ static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric) } } -static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric) +static void add_to_nodedist(ibnd_node_t * node, struct ibnd_scan_ctx *scan_ctx) { int dist = node->dist; + + if (!scan_ctx) { + IBND_ERROR("Invalid internal scan state"); + return; + } + if (node->type != IB_NODE_SWITCH) dist = MAXHOPS; /* special Ca list */ - node->dnext = fabric->nodesdist[dist]; - fabric->nodesdist[dist] = node; + node->dnext = scan_ctx->nodesdist[dist]; + scan_ctx->nodesdist[dist] = node; } -static ibnd_node_t *create_node(ibnd_fabric_t * fabric, - ibnd_node_t * temp, ib_portid_t * path, - int dist) +static ibnd_node_t *create_node(struct ibnd_scan_ctx *scan_ctx, + ibnd_fabric_t * fabric, ibnd_node_t * temp, + ib_portid_t * path, int dist) { ibnd_node_t *node; @@ -442,7 +454,7 @@ static ibnd_node_t *create_node(ibnd_fabric_t * fabric, fabric->nodes = (ibnd_node_t *) node; add_to_type_list(node, fabric); - add_to_nodedist(node, fabric); + add_to_nodedist(node, scan_ctx); return node; } @@ -501,9 +513,10 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port, remoteport->remoteport = (ibnd_port_t *) port; } -static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, - ibnd_node_t * node, ibnd_port_t * port, - ib_portid_t * path, int portnum, int dist) +static int get_remote_node(ibnd_ctx_t * ctx, struct ibnd_scan_ctx *scan_ctx, + ibnd_fabric_t * fabric, ibnd_node_t * node, + ibnd_port_t * port, ib_portid_t * path, + int portnum, int dist) { int rc = 0; struct ibmad_port *ibmad_port = ctx->ibmad_port; @@ -522,7 +535,7 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, != IB_PORT_PHYS_STATE_LINKUP) return 1; /* positive == non-fatal error */ - if (extend_dpath(ibmad_port, fabric, path, portnum) < 0) + if (extend_dpath(scan_ctx, ibmad_port, fabric, path, portnum) < 0) return -1; if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) { @@ -535,7 +548,9 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, oldnode = find_existing_node(fabric, &node_buf); if (oldnode) remotenode = oldnode; - else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) { + else if (! + (remotenode = + create_node(scan_ctx, fabric, &node_buf, path, dist + 1))) { rc = -1; goto error; } @@ -575,10 +590,13 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, int dist = 0; ib_portid_t *path; int max_hops = MAXHOPS - 1; /* default find everything */ + struct ibnd_scan_ctx scan_ctx; if (check_ctx(ctx) < 0) return (NULL); + memset(&scan_ctx, 0, sizeof scan_ctx); + /* if not everything how much? */ if (hops >= 0) { max_hops = hops; @@ -607,7 +625,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, goto error; } - node = create_node(fabric, &node_buf, from, 0); + node = create_node(&scan_ctx, fabric, &node_buf, from, 0); if (!node) goto error; @@ -617,7 +635,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, if (!port) goto error; - rc = get_remote_node(ctx, fabric, node, port, from, + rc = get_remote_node(ctx, &scan_ctx, fabric, node, port, from, mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F), 0); if (rc < 0) @@ -627,7 +645,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, for (dist = 0; dist <= max_hops; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx.nodesdist[dist]; node; node = node->dnext) { path = &node->path_portid; @@ -664,14 +682,15 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, IB_NODE_PORT_GUID_F); } - if (get_remote_node(ctx, fabric, node, - port, path, i, dist) < 0) + if (get_remote_node + (ctx, &scan_ctx, fabric, node, port, path, + i, dist) < 0) goto error; } } } - if (group_nodes(fabric)) + if (group_nodes(&scan_ctx, fabric)) goto error; return ((ibnd_fabric_t *) fabric); @@ -693,7 +712,6 @@ static void destroy_node(ibnd_node_t * node) void ibnd_destroy_fabric(ibnd_fabric_t * fabric) { - int dist = 0; ibnd_node_t *node = NULL; ibnd_node_t *next = NULL; ibnd_chassis_t *ch, *ch_next; @@ -701,19 +719,17 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric) if (!fabric) return; - ch = fabric->first_chassis; + ch = fabric->chassis; while (ch) { ch_next = ch->next; free(ch); ch = ch_next; } - for (dist = 0; dist <= MAXHOPS; dist++) { - node = fabric->nodesdist[dist]; - while (node) { - next = node->dnext; - destroy_node(node); - node = next; - } + node = fabric->nodes; + while (node) { + next = node->next; + destroy_node(node); + node = next; } free(fabric); } diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index 8753eae..cf0b4bc 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -47,6 +47,19 @@ #define IBND_ERROR(fmt, ...) \ fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__) +#define MAXHOPS 63 + +struct ibnd_chassis_ctx { + ibnd_chassis_t *first_chassis; + ibnd_chassis_t *current_chassis; + ibnd_chassis_t *last_chassis; +}; + +struct ibnd_scan_ctx { + ibnd_node_t *nodesdist[MAXHOPS + 1]; + ib_portid_t selfportid; +}; + struct ibnd_ctx { struct ibmad_port *ibmad_port; int show_progress; -- 1.5.4.5 From vlad at lists.openfabrics.org Fri Aug 14 03:02:01 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 14 Aug 2009 03:02:01 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090814-0200 daily build status Message-ID: <20090814100201.890FB1020236@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090814-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hnrose at comcast.net Fri Aug 14 04:31:43 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 14 Aug 2009 07:31:43 -0400 Subject: [ofa-general] [PATCHv4] IB/mad: Allow tuning of QP0 and QP1 sizes Message-ID: <20090814113143.GA18401@comcast.net> IB/mad: Allow tuning of QP0 and QP1 sizes MADs are UD and can be dropped if there are no receives posted. Send side tuning is done for symmetry with receive. Signed-off-by: Hal Rosenstock --- Changes since v3: Reverted module parameter permissions to 0444 Changes since v2: Removed roundup_pow_of_two of receive and send sizes Changed module parameter permissions to 0644 Changes since v1: Added changelog diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index de922a0..d1127ec 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -45,6 +46,14 @@ MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); MODULE_AUTHOR("Sean Hefty"); +int mad_sendq_size = IB_MAD_QP_SEND_SIZE; +int mad_recvq_size = IB_MAD_QP_RECV_SIZE; + +module_param_named(send_queue_size, mad_sendq_size, int, 0444); +MODULE_PARM_DESC(send_queue_size, "Size of send queue in number of work requests"); +module_param_named(recv_queue_size, mad_recvq_size, int, 0444); +MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work requests"); + static struct kmem_cache *ib_mad_cache; static struct list_head ib_mad_port_list; @@ -2736,8 +2745,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, qp_init_attr.send_cq = qp_info->port_priv->cq; qp_init_attr.recv_cq = qp_info->port_priv->cq; qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_wr = mad_sendq_size; + qp_init_attr.cap.max_recv_wr = mad_recvq_size; qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; qp_init_attr.qp_type = qp_type; @@ -2752,8 +2761,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info, goto error; } /* Use minimum queue sizes unless the CQ is resized */ - qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; - qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + qp_info->send_queue.max_active = mad_sendq_size; + qp_info->recv_queue.max_active = mad_recvq_size; return 0; error: @@ -2792,7 +2801,7 @@ static int ib_mad_port_open(struct ib_device *device, init_mad_qp(port_priv, &port_priv->qp_info[0]); init_mad_qp(port_priv, &port_priv->qp_info[1]); - cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + cq_size = (mad_sendq_size + mad_recvq_size) * 2; port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, NULL, port_priv, cq_size, 0); @@ -2984,6 +2993,12 @@ static int __init ib_mad_init_module(void) { int ret; + mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); + mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); + + mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); + mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); + spin_lock_init(&ib_mad_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 05ce331..9430ab4 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -2,6 +2,7 @@ * Copyright (c) 2004, 2005, Voltaire, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -49,6 +50,8 @@ /* QP and CQ parameters */ #define IB_MAD_QP_SEND_SIZE 128 #define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_QP_MIN_SIZE 64 +#define IB_MAD_QP_MAX_SIZE 8192 #define IB_MAD_SEND_REQ_MAX_SG 2 #define IB_MAD_RECV_REQ_MAX_SG 1 From hnrose at comcast.net Fri Aug 14 04:56:07 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 14 Aug 2009 07:56:07 -0400 Subject: [ofa-general] [PATCH] opensm/osm_sm_mad_ctrl.c: Fix endian of status in error message Message-ID: <20090814115607.GA18583@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_sm_mad_ctrl.c b/opensm/opensm/osm_sm_mad_ctrl.c index 791c848..c211bf8 100644 --- a/opensm/opensm/osm_sm_mad_ctrl.c +++ b/opensm/opensm/osm_sm_mad_ctrl.c @@ -637,7 +637,7 @@ static void sm_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw, if (status != 0) { OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 3111: " - "Error status = 0x%X\n", status); + "Error status = 0x%X\n", cl_ntoh16(status)); osm_dump_dr_smp(p_ctrl->p_log, p_smp, OSM_LOG_ERROR); } From hnrose at comcast.net Fri Aug 14 04:58:23 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 14 Aug 2009 07:58:23 -0400 Subject: [ofa-general] [PATCH] opensm/osm_helper.c: In osm_dump_dr_smp, fix endian of status Message-ID: <20090814115823.GB18583@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 1b16ad9..23392a4 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -1911,7 +1911,7 @@ void osm_dump_dr_smp(IN osm_log_t * p_log, IN const ib_smp_t * p_smp, "\t\t\t\tD bit...................0x%X\n" "\t\t\t\tstatus..................0x%X\n", ib_smp_is_d(p_smp), - ib_smp_get_status(p_smp)); + cl_ntoh16(ib_smp_get_status(p_smp))); } else { n += snprintf(buf + n, sizeof(buf) - n, "\t\t\t\tstatus..................0x%X\n", From hnrose at comcast.net Fri Aug 14 07:11:32 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 14 Aug 2009 10:11:32 -0400 Subject: [ofa-general] [PATCH] opensm/libvendor/osm_vendor_ibumad.c: Handle umad_alloc failure in osm_vendor_get Message-ID: <20090814141132.GA31087@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c index a551493..e5f0c54 100644 --- a/opensm/libvendor/osm_vendor_ibumad.c +++ b/opensm/libvendor/osm_vendor_ibumad.c @@ -973,7 +973,7 @@ ib_mad_t *osm_vendor_get(IN osm_bind_handle_t h_bind, "Acquired UMAD %p, size = %u\n", p_vw->umad, p_vw->size); OSM_LOG_EXIT(p_vend->p_log); - return umad_get_mad(p_vw->umad); + return (p_vw->umad ? umad_get_mad(p_vw->umad) : NULL); } /********************************************************************** From hal.rosenstock at gmail.com Fri Aug 14 07:24:10 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 14 Aug 2009 10:24:10 -0400 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> Message-ID: On Thu, Aug 13, 2009 at 3:41 PM, Hal Rosenstock wrote: > > > On 8/13/09, Sean Hefty wrote: >> >> Does anyone know off the top of their heads if opensm will respond >> correctly to >> SA MADs that are not sent from QP1? > > > I don't have the code in front of me right now (I can validate tomorrow) > but don't think that should be a problem as for responses it just takes the > incoming source QP and uses that for the dest QP. > Based on a code audit, I've confirmed that this should work (osm_vendor_ibumad.c:osm_vendor_send takes care of doing this). I'm not sure it's been tried for SA but it has been exercised for other GS classes (sending to some QP other than QP1). -- Hal > Are you suspecting some issue here ? > > -- Hal > > - Sean >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Fri Aug 14 07:52:49 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 14 Aug 2009 10:52:49 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/saquery.c: Fix typo in option name Message-ID: <20090814145249.GA8448@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 330c6aa..313f9a7 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1638,7 +1638,7 @@ int main(int argc, char **argv) {"reversible", 'r', 1, NULL, "Reversible path (PathRecord)"}, {"numb_path", 'n', 1, NULL, "Number of paths (PathRecord)"}, {"pkey", 18, 1, NULL, "P_Key (PathRecord, MCMemberRecord)"}, - {"qos_calss", 'Q', 1, NULL, "QoS Class (PathRecord)"}, + {"qos_class", 'Q', 1, NULL, "QoS Class (PathRecord)"}, {"sl", 19, 1, NULL, "Service level (PathRecord, MCMemberRecord)"}, {"mtu", 'M', 1, NULL, From nmehrotra at riorey.com Fri Aug 14 08:42:42 2009 From: nmehrotra at riorey.com (Nitin Mehrotra) Date: Fri, 14 Aug 2009 10:42:42 -0500 (GMT-05:00) Subject: [ofa-general] Help - RDMA event files remain open after acknowledging them In-Reply-To: <940427960.2111250264532375.JavaMail.root@zmail.riorey.com> Message-ID: <822888568.2131250264562447.JavaMail.root@zmail.riorey.com> Folks, This may be a newbie question but I can't seem to find the answer and I'm hoping someone can point me in the right direction. I'm building an IB application where the two ends are required to robustly connect when present. Either of the ends may fail for extended periods of time and the other needs to handle this and reconnect when the peer recovers. The server is trivial since it passively listens for connections but the client is giving me some trouble. I have used a model similar to the one described in http://linux.die.net/man/7/rdma_cm. The general connection flow on the client is  rdma_create_id/rdma_resolve_addr/rdma_create_qp/rdma_resolve_route/rdma_connect, handling the events as appropriate. This works when the peer (server) is present. However when the server is not present I have observed that rdma_resolve_addr and rdma_resolve_route succeed (since the local HCA and SM are present) and then I get a RDMA_CM_EVENT_REJECTED or a RDMA_CM_EVENT_UNREACHABLE event. At this point I delete the IB resources allocated between steps 1 & 2 (QP, CQE, CQ, etc) and restart the rdma_resolve_addr. As an aside, I found that I could not just restart rdma_resolve_route - that returned error EINVAL, I had to restart from rdma_resolve_addr. The problem I am facing is that it appears that every RDMA event I receive (from uverbs it appears) creates a special file that is linked to "infinibandevent:". See below. However even though I am careful to acknowledge every RDMA event I receive (rdma_ack_cm_event for every rdma_ack_cm_event) these files don't get closed or deleted so that eventually the application fails with error EMFILE (too many open files) when trying to create the completion event queue (as part of creating the QP). What am I doing wrong? Is there something more I need to do than calling rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be closed? As an fyi, I have even tried closing the rdma_id and destroying the event channel when the connection fails to force the event files to be closed without success. Btw, this is a user space application and I am using OFED 1.4.1 on Linux 2.6.27 (gentoo distro). It should be irrelevant but just in case, this is using a Mellanox HCA, both peers are on a local subnet with only one IB interface per peer. Thanks, Nitin filter-1 ib # ls -l /proc/8072/fd total 0 lrwx------ 1 root root 64 Aug 14 06:44 0 -> /dev/pts/0 lrwx------ 1 root root 64 Aug 14 06:44 1 -> /dev/pts/0 lr-x------ 1 root root 64 Aug 14 06:44 10 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 11 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 12 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 13 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 14 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 15 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 16 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 17 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 18 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 19 -> infinibandevent: lrwx------ 1 root root 64 Aug 14 06:44 2 -> /dev/pts/0 lr-x------ 1 root root 64 Aug 14 06:44 20 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 21 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 22 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 23 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 24 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 25 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 26 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 27 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 28 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 29 -> infinibandevent: lrwx------ 1 root root 64 Aug 14 06:44 3 -> socket:[223603] lr-x------ 1 root root 64 Aug 14 06:44 30 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 31 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 32 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 33 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 34 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 35 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 36 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 37 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 38 -> infinibandevent: lr-x------ 1 root root 64 Aug 14 06:44 39 -> infinibandevent: These grow until 999 files and then the app fails. From sean.hefty at intel.com Fri Aug 14 09:34:15 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Aug 2009 09:34:15 -0700 Subject: [ofa-general] Help - RDMA event files remain open after acknowledging them In-Reply-To: <822888568.2131250264562447.JavaMail.root@zmail.riorey.com> References: <940427960.2111250264532375.JavaMail.root@zmail.riorey.com> <822888568.2131250264562447.JavaMail.root@zmail.riorey.com> Message-ID: >What am I doing wrong? Is there something more I need to do than calling >rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be >closed? As an fyi, I have even tried closing the rdma_id and destroying the >event channel when the connection fails to force the event files to be closed >without success. The following calls result in opening files to the kernel: ibv_create_comp_channel() - used to report cq events rdma_create_event_channel() - used to report rdma cm events Be sure that there are corresponding calls to: ibv_destroy_comp_channel() rdma_destroy_event_channel() These are the calls that close the opened files. - Sean From sean.hefty at intel.com Fri Aug 14 10:15:23 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Aug 2009 10:15:23 -0700 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> Message-ID: >Based on a code audit, I've confirmed that this should work >(osm_vendor_ibumad.c:osm_vendor_send takes care of doing this). I'm not sure >it's been tried for SA but it has been exercised for other GS classes (sending >to some QP other than QP1). Thanks for checking and pointing me at the right source file. From robertacummins at gmail.com Fri Aug 14 10:16:14 2009 From: robertacummins at gmail.com (Robert Cummins) Date: Fri, 14 Aug 2009 11:16:14 -0600 Subject: [ofa-general] RHEL 5.3 (2.6.18-128.1.1.el5 kernel) and connected mode Message-ID: <1250270174.6330.135.camel@rockymtn.cumminsconsultants.com> Hello, IHAC that is experiencing a problem with IB. Specifically, when placing the Infinihost III card in connected mode with 'echo connected > /sys/class/net/ib0/mode' some nodes stop responding. By 'stop responding' I mean: - ping doesn't work (no packets returned; 100% packet loss) - ib_rdma_bw -b node never runs - ibping does work since the customer is mounting their nfs server over IB nfs services stop working when in connected mode. What is interesting is if I leave the nfs server in datagram mode then the affected nodes can still interact with the nfs server, ie., nfs service continues to work, but I can not communicate over IB with other nodes that are also in connected mode. At first I thought this was only a problem with IPoIB. I note the following difference between nodes that do not work in connected mode and nodes that do. The first output is from a node that stops working, the second from a node that continues to work. [root at ws3 ~]# modinfo ib_ipoib filename: /lib/modules/2.6.18-128.el5/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko license: Dual BSD/GPL description: IP-over-InfiniBand net driver author: Roland Dreier srcversion: E3C28A100A995101E2AB934 depends: ib_cm,ipv6,ib_core,ib_sa vermagic: 2.6.18-128.el5 SMP mod_unload gcc-4.1 parm: max_nonsrq_conn_qp:Max number of connected-mode QPs per interface (applied only if shared receive queue is not available) (int) parm: set_nonsrq:set to dictate working in none SRQ mode, otherwise act according to device capabilities (int) parm: mcast_debug_level:Enable multicast debug tracing if > 0 (int) parm: send_queue_size:Number of descriptors in send queue (int) parm: recv_queue_size:Number of descriptors in receive queue (int) parm: debug_level:Enable debug tracing if > 0 (int) module_sig: 883f35049492f615cdc734e64d24fa112659309d1b9619270a5e84a97a46cbc6e4ac0908b21f20a0a75b803bc72eba1ce62d2a8eec53fd9c2d7288c [root at ws3 ~]# [root at scyld ~]# modinfo ib_ipoib filename: /lib/modules/2.6.18-128.1.1.el5.530g0000/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko license: Dual BSD/GPL description: IP-over-InfiniBand net driver author: Roland Dreier srcversion: 8E47481E21B330BFE32B7CE depends: ib_cm,ipv6,ib_core,ib_sa vermagic: 2.6.18-128.1.1.el5.530g0000 SMP mod_unload gcc-4.1 parm: max_nonsrq_conn_qp:Max number of connected-mode QPs per interface (applied only if shared receive queue is not available) (int) parm: set_nonsrq:set to dictate working in none SRQ mode, otherwise act according to device capabilities (int) parm: mcast_debug_level:Enable multicast debug tracing if > 0 (int) parm: send_queue_size:Number of descriptors in send queue (int) parm: recv_queue_size:Number of descriptors in receive queue (int) parm: debug_level:Enable debug tracing if > 0 (int) module_sig: 883f35049c0555e56ccec1c0ba19c3112535c09b5f5dbc8607465f947d60f2be7fa26132d43309f5dc241bebfe2f2f88fc7c93fbe5ea12cd721a59 [root at scyld ~]# However, after retesting with ib_rdma_bw I can see that even the verbs layer is not working. I have not tried using the ib_ipoib.ko from the 'working' configuration in the non-working system since I assumed it would not load due to the slight kernel difference. It should be noted that the I have four nodes that fail and nearly 20 that 'work'. The failing nodes are running the same kernel (2.6.18-128.el5) while the working nodes are running the 2.6.18-128.1.1.el5 kernel. I am at a loss as to how to proceed with debugging this short of getting the latest OFED distro and building it. Has anyone else run into this problem and if so, how did you get around it? TIA R. From nmehrotra at riorey.com Fri Aug 14 12:02:21 2009 From: nmehrotra at riorey.com (Nitin Mehrotra) Date: Fri, 14 Aug 2009 14:02:21 -0500 (GMT-05:00) Subject: How to destroy IB resources (was Re: [ofa-general] Help - RDMA event files remain open after acknowledging them) In-Reply-To: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com> Message-ID: <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com> Sean, Thanks for your reply. It turns out the problem is the file created by the ibv_create_comp_channel() call. I do make sure to call the destroy call for each create call, the problem is that it is failing with error 16 (device or resource busy) and I missed that fact. So this brings me to another newbie question which I haven't been able to completely solve and that is how to cleanly and successfully destroy all IB resources. Since this is a new subject I have changed the thread subject appropriately. I init IB as follows: - ibv_create_comp_channel() - make ccc_fd non-blocking - ibv_create_cq() - ibv_req_notify_cq() - ibv_alloc_pd() - ibv_create_qp I shutdown in the reverse order - drain_cq() - ibv_destroy_qp() - ibv_dealloc_pd() - ibv_destroy_cq() - ibv_destroy_comp_channel() my drain_cq() function is: loop: - ibv_get_cq_event() - ibv_ack_cq_events() all unacknowledged events pending, if any - ibv_req_notify_cq() - ibv_poll_cq() until either ibv_get_cq_event() returns an error, ibv_poll_cq() returns 0 completions or I have looped the depth of the cq. It works, in a fashion. Without the drain function ibv_destroy_cq() hangs. However now, ibv_get_cq_event() returns EAGAIN continuonsly so I exit after I have looped the depth of cq and then ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return error EBUSY. This leaves the file open in the system. I guess my question is, what's the best way to destroy IB resources? (Perhaps even, what's the best way to init them in the first place). Thanks, Nitin ----- Original Message ----- From: "Sean Hefty" To: "Nitin Mehrotra" , general at lists.openfabrics.org Sent: Friday, August 14, 2009 12:34:15 PM GMT -05:00 US/Canada Eastern Subject: RE: [ofa-general] Help - RDMA event files remain open after acknowledging them >What am I doing wrong? Is there something more I need to do than calling >rdma_ack_cm_event after every rdma_ack_cm_event to get these event files to be >closed? As an fyi, I have even tried closing the rdma_id and destroying the >event channel when the connection fails to force the event files to be closed >without success. The following calls result in opening files to the kernel: ibv_create_comp_channel() - used to report cq events rdma_create_event_channel() - used to report rdma cm events Be sure that there are corresponding calls to: ibv_destroy_comp_channel() rdma_destroy_event_channel() These are the calls that close the opened files. - Sean From sean.hefty at intel.com Fri Aug 14 12:45:36 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Aug 2009 12:45:36 -0700 Subject: How to destroy IB resources (was Re: [ofa-general] Help - RDMA event files remain open after acknowledging them) In-Reply-To: <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com> References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com> <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com> Message-ID: <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com> >I guess my question is, what's the best way to destroy IB resources? (Perhaps >even, what's the best way to init them in the first place). If you're destroying the CQ, there's no need to call ibv_get_cq_event() or ibv_poll_cq(), unless you need completion information (for example, from flushed receives). However, every successful call to ibv_get_cq_event() needs a corresponding call to ibv_ack_cq_event(). You can call ack(1) for each cq event, or count the number of times that get returns success and call ack(get_cnt) once before calling destroy. Note that the count refers to the number of cq events, and not the number of completions returned through ibv_poll_cq. For your drain_cq() function, you should be safe doing something like this: while (ibv_poll_cq(...) > 0) /* optional processing of any left over completions */; ibv_ack_cq_event(...this_cqs_total_event_cnt); /* or ack after get */ ibv_destroy_cq(...); >ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return >error EBUSY This sounds like a QP isn't being destroyed. I'm not sure that anything else fails CQ destruction with EBUSY. Btw, if you're using the rdma_cm interface, then it's simpler to use the rdma_create_qp/rdma_destroy_qp calls, which allows the rdma_cm to perform the QP state transitions for you. - Sean From rdreier at cisco.com Fri Aug 14 15:15:44 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 14 Aug 2009 15:15:44 -0700 Subject: [ofa-general] [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: (Roland Dreier's message of "Mon, 10 Aug 2009 18:59:03 -0700") References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: How about this approach? Basically it just open-codes delayed work by splitting the timer and the work struct, and switches to mod_timer() instead of del_timer() + add_timer(). It passes very light testing here (basically I started ipoib and nothing blew up). --- drivers/infiniband/core/mad.c | 51 +++++++++++++++++------------------ drivers/infiniband/core/mad_priv.h | 3 +- 2 files changed, 27 insertions(+), 27 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 5cef8f8..16ff496 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -174,6 +174,15 @@ int ib_response_mad(struct ib_mad *mad) } EXPORT_SYMBOL(ib_response_mad); +static void timeout_callback(unsigned long data) +{ + struct ib_mad_agent_private *mad_agent_priv = + (struct ib_mad_agent_private *) data; + + queue_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->timeout_work); +} + /* * ib_register_mad_agent - Register to send/receive MADs */ @@ -305,7 +314,9 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, INIT_LIST_HEAD(&mad_agent_priv->wait_list); INIT_LIST_HEAD(&mad_agent_priv->done_list); INIT_LIST_HEAD(&mad_agent_priv->rmpp_list); - INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends); + INIT_WORK(&mad_agent_priv->timeout_work, timeout_sends); + setup_timer(&mad_agent_priv->timeout_timer, timeout_callback, + (unsigned long) mad_agent_priv); INIT_LIST_HEAD(&mad_agent_priv->local_list); INIT_WORK(&mad_agent_priv->local_work, local_completions); atomic_set(&mad_agent_priv->refcount, 1); @@ -512,7 +523,8 @@ static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) */ cancel_mads(mad_agent_priv); port_priv = mad_agent_priv->qp_info->port_priv; - cancel_delayed_work(&mad_agent_priv->timed_work); + del_timer_sync(&mad_agent_priv->timeout_timer); + cancel_work_sync(&mad_agent_priv->timeout_work); spin_lock_irqsave(&port_priv->reg_lock, flags); remove_mad_reg_req(mad_agent_priv); @@ -1970,10 +1982,9 @@ out: static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) { struct ib_mad_send_wr_private *mad_send_wr; - unsigned long delay; if (list_empty(&mad_agent_priv->wait_list)) { - cancel_delayed_work(&mad_agent_priv->timed_work); + del_timer(&mad_agent_priv->timeout_timer); } else { mad_send_wr = list_entry(mad_agent_priv->wait_list.next, struct ib_mad_send_wr_private, @@ -1982,13 +1993,8 @@ static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) if (time_after(mad_agent_priv->timeout, mad_send_wr->timeout)) { mad_agent_priv->timeout = mad_send_wr->timeout; - cancel_delayed_work(&mad_agent_priv->timed_work); - delay = mad_send_wr->timeout - jiffies; - if ((long)delay <= 0) - delay = 1; - queue_delayed_work(mad_agent_priv->qp_info-> - port_priv->wq, - &mad_agent_priv->timed_work, delay); + mod_timer(&mad_agent_priv->timeout_timer, + mad_send_wr->timeout); } } } @@ -2015,17 +2021,14 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr) temp_mad_send_wr->timeout)) break; } - } - else + } else list_item = &mad_agent_priv->wait_list; list_add(&mad_send_wr->agent_list, list_item); /* Reschedule a work item if we have a shorter timeout */ - if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { - cancel_delayed_work(&mad_agent_priv->timed_work); - queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, - &mad_agent_priv->timed_work, delay); - } + if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) + mod_timer(&mad_agent_priv->timeout_timer, + mad_send_wr->timeout); } void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr, @@ -2469,10 +2472,10 @@ static void timeout_sends(struct work_struct *work) struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; struct ib_mad_send_wc mad_send_wc; - unsigned long flags, delay; + unsigned long flags; mad_agent_priv = container_of(work, struct ib_mad_agent_private, - timed_work.work); + timeout_work); mad_send_wc.vendor_err = 0; spin_lock_irqsave(&mad_agent_priv->lock, flags); @@ -2482,12 +2485,8 @@ static void timeout_sends(struct work_struct *work) agent_list); if (time_after(mad_send_wr->timeout, jiffies)) { - delay = mad_send_wr->timeout - jiffies; - if ((long)delay <= 0) - delay = 1; - queue_delayed_work(mad_agent_priv->qp_info-> - port_priv->wq, - &mad_agent_priv->timed_work, delay); + mod_timer(&mad_agent_priv->timeout_timer, + mad_send_wr->timeout); break; } diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 05ce331..1526fa2 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -99,7 +99,8 @@ struct ib_mad_agent_private { struct list_head send_list; struct list_head wait_list; struct list_head done_list; - struct delayed_work timed_work; + struct work_struct timeout_work; + struct timer_list timeout_timer; unsigned long timeout; struct list_head local_list; struct work_struct local_work; From nmehrotra at riorey.com Fri Aug 14 16:41:08 2009 From: nmehrotra at riorey.com (Nitin Mehrotra) Date: Fri, 14 Aug 2009 19:41:08 -0400 Subject: How to destroy IB resources (was Re: [ofa-general] Help - RDMA event files remain open after acknowledging them) In-Reply-To: <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com> References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com> <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com> <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com> Message-ID: <4A85F614.7040106@riorey.com> Hmm, I do amortize the cost of ibv_ack_cq_event() over multiple ibv_get_cq_event() calls; however when the shutdown is in progress I don't have the last event that was "gotten" so I have to call ibv_get_cq_event() one last time to get an event to acknowledge against. I suppose it's probably better to keep the last event processed if it hasn't been acknowledged and use it to issue the final acknowledge when shutting down. Then I wouldn't have to make that ibv_get_cq_event() call. One last question, when I create the completing event queue I set it to non-blocking but I find that during shutdown I have to do that again before making the final call to ibv_get_cq_event() otherwise it blocks. Which I suppose is why it returns EAGAIN when there are no pending events, but I don't understand why I have to set it to non-blocking again. Anyway, much thanks for all your help. Nitin Sean Hefty wrote: >> I guess my question is, what's the best way to destroy IB resources? (Perhaps >> even, what's the best way to init them in the first place). >> > > If you're destroying the CQ, there's no need to call ibv_get_cq_event() or > ibv_poll_cq(), unless you need completion information (for example, from flushed > receives). > > However, every successful call to ibv_get_cq_event() needs a corresponding call > to ibv_ack_cq_event(). You can call ack(1) for each cq event, or count the > number of times that get returns success and call ack(get_cnt) once before > calling destroy. Note that the count refers to the number of cq events, and not > the number of completions returned through ibv_poll_cq. > > For your drain_cq() function, you should be safe doing something like this: > > while (ibv_poll_cq(...) > 0) > /* optional processing of any left over completions */; > > ibv_ack_cq_event(...this_cqs_total_event_cnt); /* or ack after get */ > ibv_destroy_cq(...); > > >> ibv_dealloc_pd(), ibv_destroy_cq() and ibv_destroy_comp_channel() all return >> error EBUSY >> > > This sounds like a QP isn't being destroyed. I'm not sure that anything else > fails CQ destruction with EBUSY. > > Btw, if you're using the rdma_cm interface, then it's simpler to use the > rdma_create_qp/rdma_destroy_qp calls, which allows the rdma_cm to perform the QP > state transitions for you. > > - Sean > > ------------------------------------------------------------------------ > > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.392 / Virus Database: 270.13.56/2302 - Release Date: 08/14/09 06:10:00 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Aug 14 21:36:32 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Aug 2009 21:36:32 -0700 Subject: [ofa-general] RE: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: <956908184AA146C5B1131524100A0EFE@amr.corp.intel.com> >How about this approach? Basically it just open-codes delayed work by >splitting the timer and the work struct, and switches to mod_timer() >instead of del_timer() + add_timer(). It passes very light testing here >(basically I started ipoib and nothing blew up). The approach looks okay to me. >@@ -512,7 +523,8 @@ static void unregister_mad_agent(struct >ib_mad_agent_private *mad_agent_priv) > */ > cancel_mads(mad_agent_priv); > port_priv = mad_agent_priv->qp_info->port_priv; >- cancel_delayed_work(&mad_agent_priv->timed_work); >+ del_timer_sync(&mad_agent_priv->timeout_timer); >+ cancel_work_sync(&mad_agent_priv->timeout_work); I had to check if there was a race between del_timer_sync() and the worker thread, but the call to cancel_mads() looks like it prevents any issues. - Sean From rdreier at cisco.com Fri Aug 14 21:59:17 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 14 Aug 2009 21:59:17 -0700 Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: <956908184AA146C5B1131524100A0EFE@amr.corp.intel.com> (Sean Hefty's message of "Fri, 14 Aug 2009 21:36:32 -0700") References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> <956908184AA146C5B1131524100A0EFE@amr.corp.intel.com> Message-ID: > > cancel_mads(mad_agent_priv); > > port_priv = mad_agent_priv->qp_info->port_priv; > >- cancel_delayed_work(&mad_agent_priv->timed_work); > >+ del_timer_sync(&mad_agent_priv->timeout_timer); > >+ cancel_work_sync(&mad_agent_priv->timeout_work); > > I had to check if there was a race between del_timer_sync() and the worker > thread, but the call to cancel_mads() looks like it prevents any issues. Yeah, I think it's OK, and in any case any race is already there I think (since cancel_delayed_work is essentially equivalent to del_timer_sync) Thanks. From bart.vanassche at gmail.com Fri Aug 14 22:56:18 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sat, 15 Aug 2009 07:56:18 +0200 Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: On Sat, Aug 15, 2009 at 12:15 AM, Roland Dreier wrote: > How about this approach?  Basically it just open-codes delayed work by > splitting the timer and the work struct, and switches to mod_timer() > instead of del_timer() + add_timer().  It passes very light testing here > (basically I started ipoib and nothing blew up). The patch looks fine to me. Bart. From vlad at lists.openfabrics.org Sat Aug 15 03:01:37 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 15 Aug 2009 03:01:37 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090815-0200 daily build status Message-ID: <20090815100138.18BB8E61D07@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090815-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hnrose at comcast.net Sat Aug 15 06:46:24 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sat, 15 Aug 2009 09:46:24 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/saquery.c: Fix CHECK_AND_SET_VAL macro Message-ID: <20090815134624.GA25048@comcast.net> Changed check from > to != since using integer comparison and some masks can use full range and hence be negative Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 330c6aa..e1e2cfc 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -3,6 +3,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * Produced at Lawrence Livermore National Laboratory. * Written by Ira Weiny . @@ -922,7 +923,7 @@ static int parse_lid_and_ports(bind_handle_t h, #define cl_hton8(x) (x) #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \ - if ((int##size##_t) val > (int##size##_t) comp_with) { \ + if ((int##size##_t) val != (int##size##_t) comp_with) { \ target = cl_hton##size((uint##size##_t) val); \ comp_mask |= IB_##name##_COMPMASK_##mask; \ } From hal.rosenstock at gmail.com Sat Aug 15 07:06:31 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 15 Aug 2009 10:06:31 -0400 Subject: [ofa-general] will opensm respond to requests that do not originate from qp1 In-Reply-To: <20090813210924.GQ16677@obsidianresearch.com> References: <358B3524FE2744959DAE588F0F5457D5@amr.corp.intel.com> <6F281F1FB20A411C88BEDE76C539AF95@amr.corp.intel.com> <20090813200023.GO16677@obsidianresearch.com> <20090813210924.GQ16677@obsidianresearch.com> Message-ID: On 8/13/09, Jason Gunthorpe wrote: > > On Thu, Aug 13, 2009 at 01:14:19PM -0700, Sean Hefty wrote: > > >Speaking of which, do we have an API to get the node's SM_Key for SA > > >packet construction? > > > > Not that I'm aware of. The ib-diags take the smkey as a command line > option. > > Hmm, and the kernel wires it to zero. What are you referring to being wired by kernel to zero ? AFAIK neither use (there are two) of SM_Key is wired to zero. > That's uncool. > > So, any process that can create a QP can alter, say, the nodes > multicast group membership. > > Thats a bit of a security problem. > > I admit though, I haven't been able to discern what the SM_Key should > be set to from the spec.. It's a policy (SM admin) decision. -- Hal > > > -- > Jason Gunthorpe (780)4406067x832 > Chief Technology Officer, Obsidian Research Corp Edmonton, Canada > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bryan.d.green at nasa.gov Sat Aug 15 15:55:38 2009 From: bryan.d.green at nasa.gov (Bryan Green) Date: Sat, 15 Aug 2009 15:55:38 -0700 Subject: [ofa-general] librdmacm - okay to select on a cm channel's file descriptor? Message-ID: <20090815225538.ABA412391C7@ece06.nas.nasa.gov> Hi, I'm using librdmacm for connection management (on Linux). In an attempt to get unexpected DISCONNECT notifications during ib communication, I'm trying to use 'select()' on the cm channel's file descriptor, testing it for readability. I've found that this works some of the time, but not all of the time. Is this a legitimate way to test for disconnections, or am I required to either make the descriptor nonblocking and just poll, or use a background thread for receiving cm events? I'd rather not use the nonblocking approach, because I'd like to simultaneously select on the cm channel descriptor and an ibv_comp_channel descriptor. I'm not sure if selecting on the ibv_comp_channel descriptor is acceptable either, but it appears to work. I'd appreciate it if anyone can enlighten me on this. Thanks, -bryan From sashak at voltaire.com Sun Aug 16 01:21:16 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 11:21:16 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: make accessors function for timeout values used in libibmad In-Reply-To: <20090806160106.4725041e.weiny2@llnl.gov> References: <20090806160106.4725041e.weiny2@llnl.gov> Message-ID: <20090816082116.GE25501@me> On 16:01 Thu 06 Aug , Ira Weiny wrote: > Sasha, > > In using mad_send_via and mad_receive_via I have found getting the timeout and retry values from the mad layer to be beneficial. > > This and the patch that follows export functions to get those values as well as standardize the use of them internally. > > Ira > > > From: Ira Weiny > Date: Mon, 27 Jul 2009 13:48:17 -0700 > Subject: [PATCH] libibmad: make accessors function for timeout values used in libibmad > > In addition use this function to determine the timeout to be used throughout the library. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 16 01:30:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 11:30:53 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: make accessors function for retry values used in libibmad In-Reply-To: <20090806160107.83193923.weiny2@llnl.gov> References: <20090806160107.83193923.weiny2@llnl.gov> Message-ID: <20090816083053.GF25501@me> On 16:01 Thu 06 Aug , Ira Weiny wrote: > > From: Ira Weiny > Date: Thu, 6 Aug 2009 15:27:30 -0700 > Subject: [PATCH] libibmad: make accessors function for retry values used in libibmad > > In addition use this function to determine the retries used throughout the library. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From tziporet at dev.mellanox.co.il Sun Aug 16 02:23:18 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 16 Aug 2009 12:23:18 +0300 Subject: [ofa-general] RHEL 5.3 (2.6.18-128.1.1.el5 kernel) and connected mode In-Reply-To: <1250270174.6330.135.camel@rockymtn.cumminsconsultants.com> References: <1250270174.6330.135.camel@rockymtn.cumminsconsultants.com> Message-ID: <4A87D006.8030308@mellanox.co.il> Robert Cummins wrote: > Hello, > > IHAC that is experiencing a problem with IB. Specifically, when placing > the Infinihost III card in connected mode with 'echo connected > >> /sys/class/net/ib0/mode' some nodes stop responding. By 'stop >> > responding' I mean: > > - ping doesn't work (no packets returned; 100% packet > loss) > - ib_rdma_bw -b node never runs > - ibping does work > > ... > It should be noted that the I have four nodes that fail and nearly 20 > that 'work'. The failing nodes are running the same kernel > (2.6.18-128.el5) while the working nodes are running the > 2.6.18-128.1.1.el5 kernel. I am at a loss as to how to proceed with > debugging this short of getting the latest OFED distro and building it. > > Has anyone else run into this problem and if so, how did you get around > it? > > What is the FW version you use? Can you see if there are any interesting messages in /var/log/messages, especially from mthca driver Tziporet From sashak at voltaire.com Sun Aug 16 02:49:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 12:49:27 +0300 Subject: [ofa-general] [PATCH] libibmad: fix warnings In-Reply-To: <20090806160107.83193923.weiny2@llnl.gov> References: <20090806160107.83193923.weiny2@llnl.gov> Message-ID: <20090816094927.GG25501@me> Fix compilation warnings "passing argument 1 of ‘mad_get_retries’ discards qualifiers from pointer target type" for mad_get_timeout() and mad_get_retries() functions. Signed-off-by: Sasha Khapyorsky --- libibmad/include/infiniband/mad.h | 4 ++-- libibmad/src/mad.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 0d0dcf1..d8053b4 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -803,8 +803,8 @@ MAD_EXPORT void mad_rpc_set_retries(struct ibmad_port *port, int retries); MAD_EXPORT void mad_rpc_set_timeout(struct ibmad_port *port, int timeout); MAD_EXPORT int mad_rpc_class_agent(struct ibmad_port *srcport, int cls); -MAD_EXPORT int mad_get_timeout(struct ibmad_port *srcport, int override_ms); -MAD_EXPORT int mad_get_retries(struct ibmad_port *srcport); +MAD_EXPORT int mad_get_timeout(const struct ibmad_port *srcport, int override_ms); +MAD_EXPORT int mad_get_retries(const struct ibmad_port *srcport); /* register.c */ diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index 7192dd6..1361e2b 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -64,13 +64,13 @@ uint64_t mad_trid(void) return next; } -int mad_get_timeout(struct ibmad_port *srcport, int override_ms) +int mad_get_timeout(const struct ibmad_port *srcport, int override_ms) { return (override_ms ? override_ms : srcport->timeout ? srcport->timeout : madrpc_timeout); } -int mad_get_retries(struct ibmad_port *srcport) +int mad_get_retries(const struct ibmad_port *srcport) { return (srcport->retries ? srcport->retries : madrpc_retries); } -- 1.6.4 From vlad at lists.openfabrics.org Sun Aug 16 03:07:48 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 16 Aug 2009 03:07:48 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090816-0200 daily build status Message-ID: <20090816100748.A55C5E28204@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090816-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Sun Aug 16 03:02:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:02:45 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix CHECK_AND_SET_VAL macro In-Reply-To: <20090815134624.GA25048@comcast.net> References: <20090815134624.GA25048@comcast.net> Message-ID: <20090816100245.GJ25501@me> On 09:46 Sat 15 Aug , Hal Rosenstock wrote: > > Changed check from > to != since using integer comparison > and some masks can use full range and hence be negative Any example? Sasha From sashak at voltaire.com Sun Aug 16 03:03:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:03:03 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix typo in option name In-Reply-To: <20090814145249.GA8448@comcast.net> References: <20090814145249.GA8448@comcast.net> Message-ID: <20090816100303.GK25501@me> On 10:52 Fri 14 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 16 03:05:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:05:25 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_sm_mad_ctrl.c: Fix endian of status in error message In-Reply-To: <20090814115607.GA18583@comcast.net> References: <20090814115607.GA18583@comcast.net> Message-ID: <20090816100525.GL25501@me> On 07:56 Fri 14 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 16 03:06:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:06:03 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: In osm_dump_dr_smp, fix endian of status In-Reply-To: <20090814115823.GB18583@comcast.net> References: <20090814115823.GB18583@comcast.net> Message-ID: <20090816100603.GM25501@me> On 07:58 Fri 14 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 16 03:11:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:11:01 +0300 Subject: [ofa-general] Re: [PATCH] opensm/libvendor/osm_vendor_ibumad.c: Handle umad_alloc failure in osm_vendor_get In-Reply-To: <20090814141132.GA31087@comcast.net> References: <20090814141132.GA31087@comcast.net> Message-ID: <20090816101101.GN25501@me> On 10:11 Fri 14 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hal.rosenstock at gmail.com Sun Aug 16 03:20:39 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 16 Aug 2009 06:20:39 -0400 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix CHECK_AND_SET_VAL macro In-Reply-To: <20090816100245.GJ25501@me> References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me> Message-ID: On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky wrote: > On 09:46 Sat 15 Aug , Hal Rosenstock wrote: > > > > Changed check from > to != since using integer comparison > > and some masks can use full range and hence be negative > > Any example? Pkey for one. I think there are others too but didn't do a full audit of all the components which use CHECK_AND_SET_VAL. -- Hal > > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Aug 16 03:29:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:29:04 +0300 Subject: [ofa-general] Re: [PATCH 1/5] libibnetdisc: make all fields of ibnd_node_t public In-Reply-To: <20090813204242.b659d8f5.weiny2@llnl.gov> References: <20090813204242.b659d8f5.weiny2@llnl.gov> Message-ID: <20090816102904.GP25501@me> On 20:42 Thu 13 Aug , Ira Weiny wrote: > > static void dump_endnode(ib_portid_t * path, char *prompt, > - struct ibnd_node *node, struct ibnd_port *port) > + ibnd_node_t * node, struct ibnd_port *port) > { > char type[64]; > if (!show_progress) > return; > > - mad_dump_node_type(type, 64, &(node->node.type), sizeof(int)); > - > - printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", > - portid2str(path), prompt, type, node->node.guid, > - node->node.type == IB_NODE_SWITCH ? 0 : port->port.portnum, > - port->port.base_lid, > - port->port.base_lid + (1 << port->port.lmc) - 1, > - node->node.nodedesc); > + mad_dump_node_type(type, 64, &(node->type), sizeof(int)), ',' at end of the statement. I'm fixing this (again :)) Sasha > + printf("%s -> %s %s {%016" PRIx64 > + "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path), > + prompt, type, node->guid, > + node->type == IB_NODE_SWITCH ? 0 : port->port.portnum, > + port->port.base_lid, > + port->port.base_lid + (1 << port->port.lmc) - 1, > + node->nodedesc); > } From sashak at voltaire.com Sun Aug 16 03:39:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:39:02 +0300 Subject: [ofa-general] Re: [PATCH 1/5] libibnetdisc: make all fields of ibnd_node_t public In-Reply-To: <20090813204242.b659d8f5.weiny2@llnl.gov> References: <20090813204242.b659d8f5.weiny2@llnl.gov> Message-ID: <20090816103902.GQ25501@me> On 20:42 Thu 13 Aug , Ira Weiny wrote: > It would be really nice to have a commit message (in addition to subject) for each patch. Cover email ([PATH 0/N]) is not saved in change history. > From: Ira Weiny > Date: Tue, 11 Aug 2009 15:15:21 -0700 > Subject: [PATCH] libibnetdisc: make all fields of ibnd_node_t public > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha > --- > .../libibnetdisc/include/infiniband/ibnetdisc.h | 12 +- > infiniband-diags/libibnetdisc/src/chassis.c | 147 ++++++++--------- > infiniband-diags/libibnetdisc/src/ibnetdisc.c | 173 ++++++++++---------- > infiniband-diags/libibnetdisc/src/internal.h | 22 +-- > 4 files changed, 166 insertions(+), 188 deletions(-) > > diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > index 121709d..e7f5f6a 100644 > --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > @@ -45,8 +45,8 @@ struct port; /* forward declare */ > /** ========================================================================= > * Node > */ > -typedef struct node { > - struct node *next; /* all node list in fabric */ > +typedef struct ibnd_node { > + struct ibnd_node *next; /* all node list in fabric */ > > ib_portid_t path_portid; /* path from "from_node" */ > int dist; /* num of hops from "from_node" */ > @@ -72,12 +72,18 @@ typedef struct node { > items MAY BE NULL! (ie 0 == switches only) */ > > /* chassis info */ > - struct node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ > + struct ibnd_node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ > struct chassis *chassis; /* if != NULL the chassis this node belongs to */ > unsigned char ch_type; > unsigned char ch_anafanum; > unsigned char ch_slotnum; > unsigned char ch_slot; > + > + /* internal use only */ > + unsigned char ch_found; > + struct ibnd_node *htnext; /* hash table list */ > + struct ibnd_node *dnext; /* nodesdist next */ > + struct ibnd_node *type_next; /* next based on type */ > } ibnd_node_t; > > /** ========================================================================= > diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c > index 120b4b6..0dd259a 100644 > --- a/infiniband-diags/libibnetdisc/src/chassis.c > +++ b/infiniband-diags/libibnetdisc/src/chassis.c > @@ -239,68 +239,68 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum) > return 0; > } > > -static int is_router(struct ibnd_node *n) > +static int is_router(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > return (devid == VTR_DEVID_IB_FC_ROUTER || > devid == VTR_DEVID_IB_IP_ROUTER); > } > > -static int is_spine_9096(struct ibnd_node *n) > +static int is_spine_9096(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > return (devid == VTR_DEVID_SFB4 || devid == VTR_DEVID_SFB4_DDR); > } > > -static int is_spine_9288(struct ibnd_node *n) > +static int is_spine_9288(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > return (devid == VTR_DEVID_SFB12 || devid == VTR_DEVID_SFB12_DDR); > } > > -static int is_spine_2004(struct ibnd_node *n) > +static int is_spine_2004(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > return (devid == VTR_DEVID_SFB2004); > } > > -static int is_spine_2012(struct ibnd_node *n) > +static int is_spine_2012(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > return (devid == VTR_DEVID_SFB2012); > } > > -static int is_spine(struct ibnd_node *n) > +static int is_spine(ibnd_node_t * n) > { > return (is_spine_9096(n) || is_spine_9288(n) || > is_spine_2004(n) || is_spine_2012(n)); > } > > -static int is_line_24(struct ibnd_node *n) > +static int is_line_24(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > - return (devid == VTR_DEVID_SLB24 || devid == VTR_DEVID_SLB24_DDR || > - devid == VTR_DEVID_SRB2004); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > + return (devid == VTR_DEVID_SLB24 || > + devid == VTR_DEVID_SLB24_DDR || devid == VTR_DEVID_SRB2004); > } > > -static int is_line_8(struct ibnd_node *n) > +static int is_line_8(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > return (devid == VTR_DEVID_SLB8); > } > > -static int is_line_2024(struct ibnd_node *n) > +static int is_line_2024(ibnd_node_t * n) > { > - uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); > + uint32_t devid = mad_get_field(n->info, 0, IB_NODE_DEVID_F); > return (devid == VTR_DEVID_SLB2024); > } > > -static int is_line(struct ibnd_node *n) > +static int is_line(ibnd_node_t * n) > { > return (is_line_24(n) || is_line_8(n) || is_line_2024(n)); > } > > -int is_chassis_switch(struct ibnd_node *n) > +int is_chassis_switch(ibnd_node_t * n) > { > return (is_spine(n) || is_line(n)); > } > @@ -349,7 +349,7 @@ char anafa_spine4_slot_2_slb[25] = { > > /* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ > > -static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport) > +static int get_sfb_slot(ibnd_node_t * node, ibnd_port_t * lineport) > { > ibnd_node_t *n = (ibnd_node_t *) node; > > @@ -372,25 +372,24 @@ static int get_sfb_slot(struct ibnd_node *node, ibnd_port_t * lineport) > n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; > } else { > IBND_ERROR("Unexpected node found: guid 0x%016" PRIx64, > - node->node.guid); > + node->guid); > return (-1); > } > return (0); > } > > -static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) > +static int get_router_slot(ibnd_node_t * n, ibnd_port_t * spineport) > { > - ibnd_node_t *n = (ibnd_node_t *) node; > uint64_t guessnum = 0; > > - node->ch_found = 1; > + n->ch_found = 1; > > n->ch_slot = SRBD_CS; > - if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { > + if (is_spine_9096(spineport->node)) { > n->ch_type = ISR9096_CT; > n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; > n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; > - } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { > + } else if (is_spine_9288(spineport->node)) { > n->ch_type = ISR9288_CT; > n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; > /* this is a smart guess based on nodeguids order on sFB-12 module */ > @@ -399,7 +398,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) > /* module 2 <--> remote anafa 2 */ > /* module 3 <--> remote anafa 1 */ > n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); > - } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { > + } else if (is_spine_2012(spineport->node)) { > n->ch_type = ISR2012_CT; > n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; > /* this is a smart guess based on nodeguids order on sFB-12 module */ > @@ -408,7 +407,7 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) > // module 2 <--> remote anafa 2 > // module 3 <--> remote anafa 1 > n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); > - } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { > + } else if (is_spine_2004(spineport->node)) { > n->ch_type = ISR2004_CT; > n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; > n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; > @@ -423,19 +422,19 @@ static int get_router_slot(struct ibnd_node *node, ibnd_port_t * spineport) > static int get_slb_slot(ibnd_node_t * n, ibnd_port_t * spineport) > { > n->ch_slot = LINE_CS; > - if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { > + if (is_spine_9096(spineport->node)) { > n->ch_type = ISR9096_CT; > n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; > n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; > - } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { > + } else if (is_spine_9288(spineport->node)) { > n->ch_type = ISR9288_CT; > n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; > n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; > - } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { > + } else if (is_spine_2012(spineport->node)) { > n->ch_type = ISR2012_CT; > n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; > n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; > - } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { > + } else if (is_spine_2004(spineport->node)) { > n->ch_type = ISR2004_CT; > n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; > n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; > @@ -454,12 +453,11 @@ static void voltaire_portmap(ibnd_port_t * port); > It could be optimized so, but time overhead is very small > and its only diag.util > */ > -static int fill_voltaire_chassis_record(struct ibnd_node *node) > +static int fill_voltaire_chassis_record(ibnd_node_t * node) > { > - ibnd_node_t *n = (ibnd_node_t *) node; > int p = 0; > ibnd_port_t *port; > - struct ibnd_node *remnode = 0; > + ibnd_node_t *remnode = 0; > > if (node->ch_found) /* somehow this node has already been passed */ > return (0); > @@ -470,25 +468,23 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node) > /* in such case node->ports is actually a requested port... */ > if (is_router(node)) { > /* find the remote node */ > - for (p = 1; p <= node->node.numports; p++) { > - port = node->node.ports[p]; > - if (port && > - is_spine(CONV_NODE_INTERNAL > - (port->remoteport->node))) > + for (p = 1; p <= node->numports; p++) { > + port = node->ports[p]; > + if (port && is_spine(port->remoteport->node)) > get_router_slot(node, port->remoteport); > } > } else if (is_spine(node)) { > - for (p = 1; p <= node->node.numports; p++) { > - port = node->node.ports[p]; > + for (p = 1; p <= node->numports; p++) { > + port = node->ports[p]; > if (!port || !port->remoteport) > continue; > - remnode = CONV_NODE_INTERNAL(port->remoteport->node); > - if (remnode->node.type != IB_NODE_SWITCH) { > + remnode = port->remoteport->node; > + if (remnode->type != IB_NODE_SWITCH) { > if (!remnode->ch_found) > get_router_slot(remnode, port); > continue; > } > - if (!n->ch_type) > + if (!node->ch_type) > /* we assume here that remoteport belongs to line */ > if (get_sfb_slot(node, port->remoteport)) > return (-1); > @@ -497,20 +493,20 @@ static int fill_voltaire_chassis_record(struct ibnd_node *node) > } > > } else if (is_line(node)) { > - for (p = 1; p <= node->node.numports; p++) { > - port = node->node.ports[p]; > + for (p = 1; p <= node->numports; p++) { > + port = node->ports[p]; > if (!port || port->portnum > 12 || !port->remoteport) > continue; > /* we assume here that remoteport belongs to spine */ > - if (get_slb_slot(n, port->remoteport)) > + if (get_slb_slot(node, port->remoteport)) > return (-1); > break; > } > } > > /* for each port of this node, map external ports */ > - for (p = 1; p <= node->node.numports; p++) { > - port = node->node.ports[p]; > + for (p = 1; p <= node->numports; p++) { > + port = node->ports[p]; > if (!port) > continue; > voltaire_portmap(port); > @@ -534,8 +530,7 @@ static int get_spine_index(ibnd_node_t * node) > { > int retval; > > - if (is_spine_9288(CONV_NODE_INTERNAL(node)) > - || is_spine_2012(CONV_NODE_INTERNAL(node))) > + if (is_spine_9288(node) || is_spine_2012(node)) > retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum; > else > retval = node->ch_slotnum; > @@ -586,7 +581,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis) > for (i = 1; i <= LINES_MAX_NUM; i++) { > node = chassis->linenode[i]; > > - if (!(node && is_line(CONV_NODE_INTERNAL(node)))) > + if (!(node && is_line(node))) > continue; /* empty slot or router */ > > for (p = 1; p <= node->numports; p++) { > @@ -596,7 +591,7 @@ static int pass_on_lines_catch_spines(ibnd_chassis_t * chassis) > > remnode = port->remoteport->node; > > - if (!CONV_NODE_INTERNAL(remnode)->ch_found) > + if (!remnode->ch_found) > continue; /* some error - spine not initialized ? FIXME */ > if (insert_spine(remnode, chassis)) > return (-1); > @@ -621,7 +616,7 @@ static int pass_on_spines_catch_lines(ibnd_chassis_t * chassis) > continue; > remnode = port->remoteport->node; > > - if (!CONV_NODE_INTERNAL(remnode)->ch_found) > + if (!remnode->ch_found) > continue; /* some error - line/router not initialized ? FIXME */ > if (insert_line_router(remnode, chassis)) > return (-1); > @@ -655,10 +650,10 @@ static void pass_on_spines_interpolate_chguid(ibnd_chassis_t * chassis) > in that chassis > chassis structure = structure of one standalone chassis > */ > -static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis) > +static int build_chassis(ibnd_node_t * node, ibnd_chassis_t * chassis) > { > int p = 0; > - struct ibnd_node *remnode = 0; > + ibnd_node_t *remnode = 0; > ibnd_port_t *port = 0; > > /* we get here with node = chassis_spine */ > @@ -666,16 +661,16 @@ static int build_chassis(struct ibnd_node *node, ibnd_chassis_t * chassis) > return (-1); > > /* loop: pass on all ports of node */ > - for (p = 1; p <= node->node.numports; p++) { > - port = node->node.ports[p]; > + for (p = 1; p <= node->numports; p++) { > + port = node->ports[p]; > if (!port || !port->remoteport) > continue; > - remnode = CONV_NODE_INTERNAL(port->remoteport->node); > + remnode = port->remoteport->node; > > if (!remnode->ch_found) > continue; /* some error - line or router not initialized ? FIXME */ > > - insert_line_router(&(remnode->node), chassis); > + insert_line_router(remnode, chassis); > } > > if (pass_on_lines_catch_spines(chassis)) > @@ -764,13 +759,11 @@ int int2ext_map_slb2024[2][25] = { > /* map internal ports to external ports if appropriate */ > static void voltaire_portmap(ibnd_port_t * port) > { > - struct ibnd_node *n = CONV_NODE_INTERNAL(port->node); > int portnum = port->portnum; > int chipnum = 0; > ibnd_node_t *node = port->node; > > - if (!n->ch_found || !is_line(CONV_NODE_INTERNAL(node)) > - || (portnum < 13 || portnum > 24)) { > + if (!node->ch_found || !is_line(node) || (portnum < 13 || portnum > 24)) { > port->ext_portnum = 0; > return; > } > @@ -782,9 +775,9 @@ static void voltaire_portmap(ibnd_port_t * port) > > chipnum = port->node->ch_anafanum - 1; > > - if (is_line_24(CONV_NODE_INTERNAL(node))) > + if (is_line_24(node)) > port->ext_portnum = int2ext_map_slb24[chipnum][portnum]; > - else if (is_line_2024(CONV_NODE_INTERNAL(node))) > + else if (is_line_2024(node)) > port->ext_portnum = int2ext_map_slb2024[chipnum][portnum]; > else > port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; > @@ -828,7 +821,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node) > */ > int group_nodes(struct ibnd_fabric *fabric) > { > - struct ibnd_node *node; > + ibnd_node_t *node; > int dist; > int chassisnum = 0; > ibnd_chassis_t *chassis; > @@ -842,7 +835,7 @@ int group_nodes(struct ibnd_fabric *fabric) > /* not very efficient but clear code so... */ > for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { > for (node = fabric->nodesdist[dist]; node; node = node->dnext) { > - if (mad_get_field(node->node.info, 0, > + if (mad_get_field(node->info, 0, > IB_NODE_VENDORID_F) == VTR_VENDOR_ID) > if (fill_voltaire_chassis_record(node)) > return (-1); > @@ -853,13 +846,11 @@ int group_nodes(struct ibnd_fabric *fabric) > /* algorithm: catch spine and find all surrounding nodes */ > for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { > for (node = fabric->nodesdist[dist]; node; node = node->dnext) { > - if (mad_get_field(node->node.info, 0, > + if (mad_get_field(node->info, 0, > IB_NODE_VENDORID_F) != VTR_VENDOR_ID) > continue; > - //if (!node->node.chrecord || node->node.chrecord->chassisnum || !is_spine(node)) > if (!node->ch_found > - || (node->node.chassis > - && node->node.chassis->chassisnum) > + || (node->chassis && node->chassis->chassisnum) > || !is_spine(node)) > continue; > if (add_chassis(fabric)) > @@ -874,10 +865,10 @@ int group_nodes(struct ibnd_fabric *fabric) > /* grouped by common SystemImageGUID */ > for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { > for (node = fabric->nodesdist[dist]; node; node = node->dnext) { > - if (mad_get_field(node->node.info, 0, > + if (mad_get_field(node->info, 0, > IB_NODE_VENDORID_F) == VTR_VENDOR_ID) > continue; > - if (mad_get_field64(node->node.info, 0, > + if (mad_get_field64(node->info, 0, > IB_NODE_SYSTEM_GUID_F)) { > chassis = > find_chassisguid(fabric, > @@ -901,10 +892,10 @@ int group_nodes(struct ibnd_fabric *fabric) > /* (defined as chassis->nodecount > 1) */ > for (dist = 0; dist <= MAXHOPS;) { > for (node = fabric->nodesdist[dist]; node; node = node->dnext) { > - if (mad_get_field(node->node.info, 0, > + if (mad_get_field(node->info, 0, > IB_NODE_VENDORID_F) == VTR_VENDOR_ID) > continue; > - if (mad_get_field64(node->node.info, 0, > + if (mad_get_field64(node->info, 0, > IB_NODE_SYSTEM_GUID_F)) { > chassis = > find_chassisguid(fabric, > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > index b33be8d..b883d4a 100644 > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > @@ -98,18 +98,17 @@ static int get_port_info(struct ibmad_port *ibmad_port, > * Returns -1 if error. > */ > static int query_node_info(struct ibmad_port *ibmad_port, > - struct ibnd_fabric *fabric, struct ibnd_node *node, > + struct ibnd_fabric *fabric, ibnd_node_t * node, > ib_portid_t * portid) > { > - if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, 0, > + if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0, > ibmad_port)) > return -1; > > /* decode just a couple of fields for quicker reference. */ > - mad_decode_field(node->node.info, IB_NODE_GUID_F, &(node->node.guid)); > - mad_decode_field(node->node.info, IB_NODE_TYPE_F, &(node->node.type)); > - mad_decode_field(node->node.info, IB_NODE_NPORTS_F, > - &(node->node.numports)); > + mad_decode_field(node->info, IB_NODE_GUID_F, &(node->guid)); > + mad_decode_field(node->info, IB_NODE_TYPE_F, &(node->type)); > + mad_decode_field(node->info, IB_NODE_NPORTS_F, &(node->numports)); > > return (0); > } > @@ -118,15 +117,14 @@ static int query_node_info(struct ibmad_port *ibmad_port, > * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. > */ > static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, > - struct ibnd_node *inode, struct ibnd_port *iport, > + ibnd_node_t * node, struct ibnd_port *iport, > ib_portid_t * portid) > { > int rc = 0; > - ibnd_node_t *node = &(inode->node); > ibnd_port_t *port = &(iport->port); > - void *nd = inode->node.nodedesc; > + void *nd = node->nodedesc; > > - if ((rc = query_node_info(ibmad_port, fabric, inode, portid)) != 0) > + if ((rc = query_node_info(ibmad_port, fabric, node, portid)) != 0) > return rc; > > port->portnum = mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F); > @@ -204,30 +202,30 @@ static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f, > } > > static void dump_endnode(ib_portid_t * path, char *prompt, > - struct ibnd_node *node, struct ibnd_port *port) > + ibnd_node_t * node, struct ibnd_port *port) > { > char type[64]; > if (!show_progress) > return; > > - mad_dump_node_type(type, 64, &(node->node.type), sizeof(int)); > - > - printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", > - portid2str(path), prompt, type, node->node.guid, > - node->node.type == IB_NODE_SWITCH ? 0 : port->port.portnum, > - port->port.base_lid, > - port->port.base_lid + (1 << port->port.lmc) - 1, > - node->node.nodedesc); > + mad_dump_node_type(type, 64, &(node->type), sizeof(int)), > + printf("%s -> %s %s {%016" PRIx64 > + "} portnum %d base lid %d-%d\"%s\"\n", portid2str(path), > + prompt, type, node->guid, > + node->type == IB_NODE_SWITCH ? 0 : port->port.portnum, > + port->port.base_lid, > + port->port.base_lid + (1 << port->port.lmc) - 1, > + node->nodedesc); > } > > -static struct ibnd_node *find_existing_node(struct ibnd_fabric *fabric, > - struct ibnd_node *new) > +static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric, > + ibnd_node_t * new) > { > - int hash = HASHGUID(new->node.guid) % HTSZ; > - struct ibnd_node *node; > + int hash = HASHGUID(new->guid) % HTSZ; > + ibnd_node_t *node; > > for (node = fabric->nodestbl[hash]; node; node = node->htnext) > - if (node->node.guid == new->node.guid) > + if (node->guid == new->guid) > return node; > > return NULL; > @@ -237,7 +235,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) > { > struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > int hash = HASHGUID(guid) % HTSZ; > - struct ibnd_node *node; > + ibnd_node_t *node; > > if (!fabric) { > IBND_DEBUG("fabric parameter NULL\n"); > @@ -245,7 +243,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) > } > > for (node = f->nodestbl[hash]; node; node = node->htnext) > - if (node->node.guid == guid) > + if (node->guid == guid) > return (ibnd_node_t *) node; > > return NULL; > @@ -273,7 +271,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, > void *nd = node->nodedesc; > int p = 0; > struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > - struct ibnd_node *n = CONV_NODE_INTERNAL(node); > > if (_check_ibmad_port(ibmad_port) < 0) > return (NULL); > @@ -288,36 +285,36 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, > return (NULL); > } > > - if (query_node_info(ibmad_port, f, n, &(n->node.path_portid))) > + if (query_node_info(ibmad_port, f, node, &(node->path_portid))) > return (NULL); > > - if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, 0, > + if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0, > ibmad_port)) > return (NULL); > > /* update all the port info's */ > - for (p = 1; p >= n->node.numports; p++) { > - get_port_info(ibmad_port, f, > - CONV_PORT_INTERNAL(n->node.ports[p]), p, > - &(n->node.path_portid)); > + for (p = 1; p >= node->numports; p++) { > + get_port_info(ibmad_port, f, CONV_PORT_INTERNAL(node->ports[p]), > + p, &(node->path_portid)); > } > > - if (n->node.type != IB_NODE_SWITCH) > + if (node->type != IB_NODE_SWITCH) > goto done; > > - if (!smp_query_via(portinfo_port0, &(n->node.path_portid), > - IB_ATTR_PORT_INFO, 0, 0, ibmad_port)) > + if (!smp_query_via > + (portinfo_port0, &(node->path_portid), IB_ATTR_PORT_INFO, 0, 0, > + ibmad_port)) > return (NULL); > > - n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); > - n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); > + node->smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); > + node->smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); > > - if (!smp_query_via(node->switchinfo, &(n->node.path_portid), > + if (!smp_query_via(node->switchinfo, &(node->path_portid), > IB_ATTR_SWITCH_INFO, 0, 0, ibmad_port)) > node->smaenhsp0 = 0; /* assume base SP0 */ > else > mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, > - &n->node.smaenhsp0); > + &node->smaenhsp0); > > done: > return (node); > @@ -358,10 +355,9 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str) > return (rc); > } > > -static void add_to_nodeguid_hash(struct ibnd_node *node, > - struct ibnd_node *hash[]) > +static void add_to_nodeguid_hash(ibnd_node_t * node, ibnd_node_t * hash[]) > { > - int hash_idx = HASHGUID(node->node.guid) % HTSZ; > + int hash_idx = HASHGUID(node->guid) % HTSZ; > > node->htnext = hash[hash_idx]; > hash[hash_idx] = node; > @@ -376,9 +372,9 @@ static void add_to_portguid_hash(struct ibnd_port *port, > hash[hash_idx] = port; > } > > -static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric) > +static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric) > { > - switch (node->node.type) { > + switch (node->type) { > case IB_NODE_CA: > node->type_next = fabric->ch_adapters; > fabric->ch_adapters = node; > @@ -394,21 +390,21 @@ static void add_to_type_list(struct ibnd_node *node, struct ibnd_fabric *fabric) > } > } > > -static void add_to_nodedist(struct ibnd_node *node, struct ibnd_fabric *fabric) > +static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric) > { > - int dist = node->node.dist; > - if (node->node.type != IB_NODE_SWITCH) > + int dist = node->dist; > + if (node->type != IB_NODE_SWITCH) > dist = MAXHOPS; /* special Ca list */ > > node->dnext = fabric->nodesdist[dist]; > fabric->nodesdist[dist] = node; > } > > -static struct ibnd_node *create_node(struct ibnd_fabric *fabric, > - struct ibnd_node *temp, ib_portid_t * path, > - int dist) > +static ibnd_node_t *create_node(struct ibnd_fabric *fabric, > + ibnd_node_t * temp, ib_portid_t * path, > + int dist) > { > - struct ibnd_node *node; > + ibnd_node_t *node; > > node = malloc(sizeof(*node)); > if (!node) { > @@ -417,13 +413,13 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric, > } > > memcpy(node, temp, sizeof(*node)); > - node->node.dist = dist; > - node->node.path_portid = *path; > + node->dist = dist; > + node->path_portid = *path; > > add_to_nodeguid_hash(node, fabric->nodestbl); > > /* add this to the all nodes list */ > - node->node.next = fabric->fabric.nodes; > + node->next = fabric->fabric.nodes; > fabric->fabric.nodes = (ibnd_node_t *) node; > > add_to_type_list(node, fabric); > @@ -432,26 +428,24 @@ static struct ibnd_node *create_node(struct ibnd_fabric *fabric, > return node; > } > > -static struct ibnd_port *find_existing_port_node(struct ibnd_node *node, > +static struct ibnd_port *find_existing_port_node(ibnd_node_t * node, > struct ibnd_port *port) > { > - if (port->port.portnum > node->node.numports > - || node->node.ports == NULL) > + if (port->port.portnum > node->numports || node->ports == NULL) > return (NULL); > > - return (CONV_PORT_INTERNAL(node->node.ports[port->port.portnum])); > + return (CONV_PORT_INTERNAL(node->ports[port->port.portnum])); > } > > static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, > - struct ibnd_node *node, > + ibnd_node_t * node, > struct ibnd_port *temp) > { > struct ibnd_port *port; > > - if (node->node.ports == NULL) { > - node->node.ports = > - calloc(sizeof(*node->node.ports), node->node.numports + 1); > - if (!node->node.ports) { > + if (node->ports == NULL) { > + node->ports = calloc(sizeof(*node->ports), node->numports + 1); > + if (!node->ports) { > IBND_ERROR("Failed to allocate the ports array\n"); > return (NULL); > } > @@ -467,20 +461,19 @@ static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, > port->port.node = (ibnd_node_t *) node; > port->port.ext_portnum = 0; > > - node->node.ports[temp->port.portnum] = (ibnd_port_t *) port; > + node->ports[temp->port.portnum] = (ibnd_port_t *) port; > > add_to_portguid_hash(port, fabric->portstbl); > return port; > } > > -static void link_ports(struct ibnd_node *node, struct ibnd_port *port, > - struct ibnd_node *remotenode, > - struct ibnd_port *remoteport) > +static void link_ports(ibnd_node_t * node, struct ibnd_port *port, > + ibnd_node_t * remotenode, struct ibnd_port *remoteport) > { > IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 > - " %p->%p:%u\n", node->node.guid, node, port, > - port->port.portnum, remotenode->node.guid, remotenode, > - remoteport, remoteport->port.portnum); > + " %p->%p:%u\n", node->guid, node, port, port->port.portnum, > + remotenode->guid, remotenode, remoteport, > + remoteport->port.portnum); > if (port->port.remoteport) > port->port.remoteport->remoteport = NULL; > if (remoteport->port.remoteport) > @@ -490,14 +483,14 @@ static void link_ports(struct ibnd_node *node, struct ibnd_port *port, > } > > static int get_remote_node(struct ibmad_port *ibmad_port, > - struct ibnd_fabric *fabric, struct ibnd_node *node, > + struct ibnd_fabric *fabric, ibnd_node_t * node, > struct ibnd_port *port, ib_portid_t * path, > int portnum, int dist) > { > int rc = 0; > - struct ibnd_node node_buf; > + ibnd_node_t node_buf; > struct ibnd_port port_buf; > - struct ibnd_node *remotenode, *oldnode; > + ibnd_node_t *remotenode, *oldnode; > struct ibnd_port *remoteport, *oldport; > > memset(&node_buf, 0, sizeof(node_buf)); > @@ -554,9 +547,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, > int rc = 0; > struct ibnd_fabric *fabric = NULL; > ib_portid_t my_portid = { 0 }; > - struct ibnd_node node_buf; > + ibnd_node_t node_buf; > struct ibnd_port port_buf; > - struct ibnd_node *node; > + ibnd_node_t *node; > struct ibnd_port *port; > int i; > int dist = 0; > @@ -605,7 +598,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, > goto error; > > rc = get_remote_node(ibmad_port, fabric, node, port, from, > - mad_get_field(node->node.info, 0, > + mad_get_field(node->info, 0, > IB_NODE_LOCAL_PORT_F), 0); > if (rc < 0) > goto error; > @@ -616,13 +609,13 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, > > for (node = fabric->nodesdist[dist]; node; node = node->dnext) { > > - path = &node->node.path_portid; > + path = &node->path_portid; > > IBND_DEBUG("dist %d node %p\n", dist, node); > dump_endnode(path, "processing", node, port); > > - for (i = 1; i <= node->node.numports; i++) { > - if (i == mad_get_field(node->node.info, 0, > + for (i = 1; i <= node->numports; i++) { > + if (i == mad_get_field(node->info, 0, > IB_NODE_LOCAL_PORT_F)) > continue; > > @@ -644,9 +637,9 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, > goto error; > > /* If switch, set port GUID to node port GUID */ > - if (node->node.type == IB_NODE_SWITCH) { > + if (node->type == IB_NODE_SWITCH) { > port->port.guid = > - mad_get_field64(node->node.info, 0, > + mad_get_field64(node->info, 0, > IB_NODE_PORT_GUID_F); > } > > @@ -666,14 +659,14 @@ error: > return (NULL); > } > > -static void destroy_node(struct ibnd_node *node) > +static void destroy_node(ibnd_node_t * node) > { > int p = 0; > > - for (p = 0; p <= node->node.numports; p++) { > - free(node->node.ports[p]); > + for (p = 0; p <= node->numports; p++) { > + free(node->ports[p]); > } > - free(node->node.ports); > + free(node->ports); > free(node); > } > > @@ -681,8 +674,8 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric) > { > struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > int dist = 0; > - struct ibnd_node *node = NULL; > - struct ibnd_node *next = NULL; > + ibnd_node_t *node = NULL; > + ibnd_node_t *next = NULL; > ibnd_chassis_t *ch, *ch_next; > > if (!fabric) > @@ -747,8 +740,8 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, > int node_type, void *user_data) > { > struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > - struct ibnd_node *list = NULL; > - struct ibnd_node *cur = NULL; > + ibnd_node_t *list = NULL; > + ibnd_node_t *cur = NULL; > > if (!fabric) { > IBND_DEBUG("fabric parameter NULL\n"); > diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h > index 38555a0..449bd70 100644 > --- a/infiniband-diags/libibnetdisc/src/internal.h > +++ b/infiniband-diags/libibnetdisc/src/internal.h > @@ -49,18 +49,6 @@ > #define IBND_ERROR(fmt, ...) \ > fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__) > > -struct ibnd_node { > - /* This member MUST BE FIRST */ > - ibnd_node_t node; > - > - /* internal use only */ > - unsigned char ch_found; > - struct ibnd_node *htnext; /* hash table list */ > - struct ibnd_node *dnext; /* nodesdist next */ > - struct ibnd_node *type_next; /* next based on type */ > -}; > -#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node) > - > struct ibnd_port { > /* This member MUST BE FIRST */ > ibnd_port_t port; > @@ -79,15 +67,15 @@ struct ibnd_fabric { > ibnd_fabric_t fabric; > > /* internal use only */ > - struct ibnd_node *nodestbl[HTSZ]; > + ibnd_node_t *nodestbl[HTSZ]; > struct ibnd_port *portstbl[HTSZ]; > - struct ibnd_node *nodesdist[MAXHOPS + 1]; > + ibnd_node_t *nodesdist[MAXHOPS + 1]; > ibnd_chassis_t *first_chassis; > ibnd_chassis_t *current_chassis; > ibnd_chassis_t *last_chassis; > - struct ibnd_node *switches; > - struct ibnd_node *ch_adapters; > - struct ibnd_node *routers; > + ibnd_node_t *switches; > + ibnd_node_t *ch_adapters; > + ibnd_node_t *routers; > ib_portid_t selfportid; > }; > #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) > -- > 1.5.4.5 > From sashak at voltaire.com Sun Aug 16 03:41:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 13:41:14 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix CHECK_AND_SET_VAL macro In-Reply-To: References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me> Message-ID: <20090816104114.GR25501@me> On 06:20 Sun 16 Aug , Hal Rosenstock wrote: > On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky wrote: > > > On 09:46 Sat 15 Aug , Hal Rosenstock wrote: > > > > > > Changed check from > to != since using integer comparison > > > and some masks can use full range and hence be negative > > > > Any example? > > > Pkey for one. Will you pass negative value of PKey? Sasha From hal.rosenstock at gmail.com Sun Aug 16 03:56:37 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 16 Aug 2009 06:56:37 -0400 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix CHECK_AND_SET_VAL macro In-Reply-To: <20090816104114.GR25501@me> References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me> <20090816104114.GR25501@me> Message-ID: On Sun, Aug 16, 2009 at 6:41 AM, Sasha Khapyorsky wrote: > On 06:20 Sun 16 Aug , Hal Rosenstock wrote: > > On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky >wrote: > > > > > On 09:46 Sat 15 Aug , Hal Rosenstock wrote: > > > > > > > > Changed check from > to != since using integer comparison > > > > and some masks can use full range and hence be negative > > > > > > Any example? > > > > > > Pkey for one. > > Will you pass negative value of PKey? Aren't Pkeys 0x8001 - 0xffff valid ? -- Hal > > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Aug 16 04:02:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 14:02:00 +0300 Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object. In-Reply-To: <20090813204306.dffc3237.weiny2@llnl.gov> References: <20090813204306.dffc3237.weiny2@llnl.gov> Message-ID: <20090816110200.GS25501@me> On 20:43 Thu 13 Aug , Ira Weiny wrote: > > From: Ira Weiny > Date: Thu, 13 Aug 2009 20:16:01 -0700 > Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object. > > This object must be created before query functions can be used and is > used to control the functionality of the queries. Why is it needed? I see that it complicates API, but what is a benefits? Sasha From hnrose at comcast.net Sun Aug 16 04:02:24 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 16 Aug 2009 07:02:24 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/saquery.c: Allow pkey and qkey to be hidden Message-ID: <20090816110224.GA23535@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 313f9a7..3a35aa7 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1543,6 +1543,10 @@ static int process_opt(void *context, int ch, char *optarg) p->numb_path = strtoul(optarg, NULL, 0); break; case 18: + if (!isxdigit(*optarg) && !(optarg = getpass("P_Key: "))) { + fprintf(stderr, "cannot get P_Key\n"); + ibdiag_show_usage(); + } p->pkey = (uint16_t) strtoul(optarg, NULL, 0); break; case 'Q': @@ -1561,6 +1565,10 @@ static int process_opt(void *context, int ch, char *optarg) p->pkt_life = (uint8_t) strtoul(optarg, NULL, 0); break; case 'q': + if (!isxdigit(*optarg) && !(optarg = getpass("Q_Key: "))) { + fprintf(stderr, "cannot get Q_Key\n"); + ibdiag_show_usage(); + } p->qkey = strtoul(optarg, NULL, 0); break; case 'T': @@ -1637,7 +1645,9 @@ int main(int argc, char **argv) {"mgid", 17, 1, "", "Multicast GID (MCMemberRecord)"}, {"reversible", 'r', 1, NULL, "Reversible path (PathRecord)"}, {"numb_path", 'n', 1, NULL, "Number of paths (PathRecord)"}, - {"pkey", 18, 1, NULL, "P_Key (PathRecord, MCMemberRecord)"}, + {"pkey", 18, 1, NULL, "P_Key (PathRecord, MCMemberRecord)." + " If non-numeric value (like 'x') is specified then" + " saquery will prompt for a value"}, {"qos_class", 'Q', 1, NULL, "QoS Class (PathRecord)"}, {"sl", 19, 1, NULL, "Service level (PathRecord, MCMemberRecord)"}, @@ -1647,7 +1657,9 @@ int main(int argc, char **argv) "Rate and selector (PathRecord, MCMemberRecord)"}, {"pkt_lifetime", 20, 1, NULL, "Packet lifetime and selector (PathRecord, MCMemberRecord)"}, - {"qkey", 'q', 1, NULL, "Q_Key (MCMemberRecord)"}, + {"qkey", 'q', 1, NULL, "Q_Key (MCMemberRecord)." + " If non-numeric value (like 'x') is specified then" + " saquery will prompt for a value"}, {"tclass", 'T', 1, NULL, "Traffic Class (PathRecord, MCMemberRecord)"}, {"flow_label", 'F', 1, NULL, From sashak at voltaire.com Sun Aug 16 04:21:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 14:21:25 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix CHECK_AND_SET_VAL macro In-Reply-To: References: <20090815134624.GA25048@comcast.net> <20090816100245.GJ25501@me> <20090816104114.GR25501@me> Message-ID: <20090816112125.GT25501@me> On 06:56 Sun 16 Aug , Hal Rosenstock wrote: > On Sun, Aug 16, 2009 at 6:41 AM, Sasha Khapyorsky wrote: > > > On 06:20 Sun 16 Aug , Hal Rosenstock wrote: > > > On Sun, Aug 16, 2009 at 6:02 AM, Sasha Khapyorsky > >wrote: > > > > > > > On 09:46 Sat 15 Aug , Hal Rosenstock wrote: > > > > > > > > > > Changed check from > to != since using integer comparison > > > > > and some masks can use full range and hence be negative > > > > > > > > Any example? > > > > > > > > > Pkey for one. > > > > Will you pass negative value of PKey? > > > Aren't Pkeys 0x8001 - 0xffff valid ? Yes, it is valid. And now I'm starting to understand what are you was about - during the check it will be converted to int16_t and then negative generated. More detailed patch description would be helpful. Sasha From sashak at voltaire.com Sun Aug 16 04:22:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 14:22:14 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Fix CHECK_AND_SET_VAL macro In-Reply-To: <20090815134624.GA25048@comcast.net> References: <20090815134624.GA25048@comcast.net> Message-ID: <20090816112214.GU25501@me> On 09:46 Sat 15 Aug , Hal Rosenstock wrote: > > Changed check from > to != since using integer comparison > and some masks can use full range and hence be negative > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 16 04:28:50 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 14:28:50 +0300 Subject: [ofa-general] Re: [PATCH 2/5] libibnetdisc: make all fields of ibnd_port_t public In-Reply-To: <20090813204246.59efeb5e.weiny2@llnl.gov> References: <20090813204246.59efeb5e.weiny2@llnl.gov> Message-ID: <20090816112850.GV25501@me> On 20:42 Thu 13 Aug , Ira Weiny wrote: > > From: Ira Weiny > Date: Thu, 13 Aug 2009 19:54:00 -0700 > Subject: [PATCH] libibnetdisc: make all fields of ibnd_port_t public > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 16 04:41:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 14:41:27 +0300 Subject: [ofa-general] Re: [PATCH 3/5] libibnetdisc: make all fields of ibnd_fabric_t public In-Reply-To: <20090813204251.df6446c1.weiny2@llnl.gov> References: <20090813204251.df6446c1.weiny2@llnl.gov> Message-ID: <20090816114127.GW25501@me> On 20:42 Thu 13 Aug , Ira Weiny wrote: > > @@ -108,8 +107,8 @@ typedef struct ibnd_port { > /** ========================================================================= > * Chassis > */ > -typedef struct chassis { > - struct chassis *next; > +typedef struct ibnd_chassis { > + struct ibnd_chassis *next; > uint64_t chassisguid; > unsigned char chassisnum; > > @@ -124,11 +123,17 @@ typedef struct chassis { > ibnd_node_t *linenode[LINES_MAX_NUM + 1]; > } ibnd_chassis_t; > > +/* HASH table defines */ > +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) Why should this macro be published (by moving from internal.h to ibnetdisc.h)? As far I can see it is only used in ibnetdisc.c, so actually we can keep it internally and to move to this file. Sasha From sashak at voltaire.com Sun Aug 16 04:42:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Aug 2009 14:42:27 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Allow pkey and qkey to be hidden In-Reply-To: <20090816110224.GA23535@comcast.net> References: <20090816110224.GA23535@comcast.net> Message-ID: <20090816114227.GX25501@me> On 07:02 Sun 16 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From yevgenyp at mellanox.co.il Sun Aug 16 06:12:54 2009 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Sun, 16 Aug 2009 16:12:54 +0300 Subject: [ofa-general][PATCH] mlx4_core: Avoid double icms free Message-ID: <4A8805D6.10803@mellanox.co.il> On cleanup flow on init_hca, the function calls close_hca(), followed by free_icms() and UNMAP_FA(). Both those functions are also called from close_hca(). Signed-off-by: Yevgeny Petrilin --- drivers/net/mlx4/main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index dac621b..a0a52f1 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -786,7 +786,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev) return 0; err_close: - mlx4_close_hca(dev); + mlx4_CLOSE_HCA(dev, 0); err_free_icm: mlx4_free_icms(dev); -- 1.6.0 From bart.vanassche at gmail.com Sun Aug 16 08:51:55 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sun, 16 Aug 2009 17:51:55 +0200 Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: On Sat, Aug 15, 2009 at 12:15 AM, Roland Dreier wrote: > How about this approach?  Basically it just open-codes delayed work by > splitting the timer and the work struct, and switches to mod_timer() > instead of del_timer() + add_timer().  It passes very light testing here > (basically I started ipoib and nothing blew up). [ ... ] Update: after two days of stress testing, still no lockdep complaints. So it seems like the posted patch solves this issue. Thanks ! Bart. From sdake at redhat.com Sun Aug 16 17:27:55 2009 From: sdake at redhat.com (Steven Dake) Date: Sun, 16 Aug 2009 17:27:55 -0700 Subject: [ofa-general] librdmacm - okay to select on a cm channel's file descriptor? In-Reply-To: <20090815225538.ABA412391C7@ece06.nas.nasa.gov> References: <20090815225538.ABA412391C7@ece06.nas.nasa.gov> Message-ID: <1250468875.19265.13.camel@localhost.localdomain> On Sat, 2009-08-15 at 15:55 -0700, Bryan Green wrote: > Hi, > > I'm using librdmacm for connection management (on Linux). > > In an attempt to get unexpected DISCONNECT notifications during > ib communication, I'm trying to use 'select()' on the cm channel's file > descriptor, testing it for readability. I've found that this works some of > the time, but not all of the time. > What I have done is the following: struct rdma_event_channel *mcast_channel; mcast_channel = rdma_create_event_channel(); then select/poll on mcast_channel->fd and call my connection manager event handler when there is a new event. My event handler looks like: res = rdma_get_cm_event (mcast_channel, &event); switch (event->event) { } rdma_ack_cm_event (event); This ack_cm_event removes the event from the file descriptor. It isn't clear from your message if this is what your doing, but this works for me. Note I am using UD mode so I am not certain if there may be a bug in disconnect events (since UD doesn't generate these) for your application. > Is this a legitimate way to test for disconnections, or am I required to > either make the descriptor nonblocking and just poll, or use a background > thread for receiving cm events? I'd rather not use the nonblocking > approach, because I'd like to simultaneously select on the cm channel > descriptor and an ibv_comp_channel descriptor. I'm not sure if > selecting on the ibv_comp_channel descriptor is acceptable either, but it > appears to work. > selecting on ibv_comp_channel created descriptors works for me for the completion queue events only. Regards -steve > > I'd appreciate it if anyone can enlighten me on this. > > Thanks, > -bryan > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Sun Aug 16 20:47:02 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 16 Aug 2009 20:47:02 -0700 Subject: [ofa-general] librdmacm - okay to select on a cm channel's file descriptor? In-Reply-To: <20090815225538.ABA412391C7@ece06.nas.nasa.gov> (Bryan Green's message of "Sat, 15 Aug 2009 15:55:38 -0700") References: <20090815225538.ABA412391C7@ece06.nas.nasa.gov> Message-ID: > In an attempt to get unexpected DISCONNECT notifications during > ib communication, I'm trying to use 'select()' on the cm channel's file > descriptor, testing it for readability. I've found that this works some of > the time, but not all of the time. What happens when it doesn't work? select() doesn't give you an event but when you try to read there actually is an event there? I took a quick look at the ucma kernel code and the implementation of select() (the kernel uses poll() as the name but it all ends up in the same code) looks straightforwardly correct -- there's only one place where events are added to the queue for a file, and that place wakes up the poll wait queue. But maybe there is a funny bug somehow. - R. From niftyompi at niftyegg.com Sun Aug 16 21:44:37 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Sun, 16 Aug 2009 21:44:37 -0700 Subject: [ofa-general] Manipulating Credits in Infiniband In-Reply-To: References: <20090812023759.GA3060@tosh2egg.ca.sanfran.comcast.net> Message-ID: <20090817044437.GA3592@compegg> On Thu, Aug 13, 2009 at 02:41:37AM -0400, Ashwath Narasimhan wrote: > > Dear Tom/all > > I understand the end to end credit based flow control at the link > layer where we have a 32 bit Flow control packet being sent for each VL > (with FCCL and FCTBS fields) but I fail to understand where this scheme > is implemented in the driver. (OFED linux- 1.4 stack, hw-mthca) . I can > see a file with a credit table mapped to different credits counts and > another that computes the AETH based on this credit table. > > 1. Is this the place where the flow control packets are formulated? If you do some back of the envelope computations you will find that much of the low level flow control must be done in a firmware/ hardware state machine. The maximum interrupt service rate and the maximum IB packet rates are not even close. Thus you will not find it in the driver. So as you scan the Mellanox driver you will discover a hand off from the driver to the firmware. In some cases the driver will initialize the link layer and you will see this. You might see it in the error recovery/ reset part of the driver but for hw-mtca I think it is well hidden in firmware. Error recovery is one place to look because it might need to restore the credit balance so data can flow. Without credit data does not flow. > 2. If yes, I don't see them computing this for each VL. why? If no, is > it a mid layer flow control? VL's are interesting, the IB specification is full of may, might, optional, future, etc and as such most hardware does the minimum with VL. This is changing. One valuable thing to research is the other credit based link level interfaces on common modern hardware. i.e. AMD uses this on their HT links. See also ATM links, Fibre Channel, PCIe... Also identify management packets, reliable and unreliable transport. > 3. And thats why I have this basic question--> is the link layer > implemented as part of OFED stack at all? or does it go into the > hardware HCA as firmware? As I understand the hardware vendor only > provides verbs to communicate with the HCA. Link layer is 99 and 44/100% hardware. > Pardon me if i am bundling you all with a lot with questions. I am new > to all this and I am trying my best to understand the stack. You might compare and contrast the QLogic drivers and the Mellanox drivers. The hardware design is very different. To that point the older QLogic hardware (Pathscale) has no firmware in the way that Mellanox does. This will let you see informative learning differences. In general there is no need to manipulate credits unless you are designing hardware or you are a hardware vendor. > Thank you, > > Ashwath > > On Tue, Aug 11, 2009 at 10:37 PM, Nifty Tom Mitchell > <[1]niftyompi at niftyegg.com> wrote: > > On Mon, Aug 10, 2009 at 12:11:22PM -0400, Ashwath Narasimhan wrote: > > > > I looked into the infiniband driver files. As I understand, in > order to > > limit the data rate we manipulate the credits on either ends. > Since the > > number of credits available depends on the receiver's work receive > > queue size, I decided to limit the queue size to say 5 instead of > 8192 > > (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my > higher > > layer protocol is ipoib). I just want to confirm if I am doing the > > right thing? > > Data rate is not manipulated by credits. > Credits and queue sizes are different and have different purposes. > Visit the Infiniband Trade Association web site and grab the IB > specifications to understand some of the hardware level parts. > [2]http://www.infinibandta.org/ > InfiniBand offers credit based flow control and given the nature of > modern IB switches and processors a very small credit count can > still > result in full data rate. Having said that flow control is the > lowest > level throttle in the system. Reducing the credit count forces the > higher levels in the protocol stack to source or sink the data > through > the hardware before any more can be delivered. Thus flow control > can > simplify the implementation of higher level protocols. It can also > be used > to cost reduce or simplify hardware design (smaller hardware > buffers). > The IB specifications are way too long. Start with this FAQ. > > [3]http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf > The IB specification is way too full of optional features. A vendor > may > have XYZ working fine and dandy on one card and since it is optional > not > at all on another. > The various queue sizes for the various protocols built on top of > IB establish transfer behavior in keeping with system interrupt, > system process time slice, system kernel activity loads and needs. > It is counter intuitive but in some cases small queues result in > more responsive and agile systems, especially in the presence of > errors. > Since there are often multiple protocols on the IB stack all > protocols > will be impacted by credit tinkering. Most vendors know their > hardware > so most drivers will have credit related code optimum. > In the case of TCP/IP the interaction between IB bandwidths&MTU > (IPoIB), > ethernet bandwidth&MTU and even localhost (127.0.0.1) bandwidth&MTU > can > be "interesting" depending on host names, subnets, routing etc. > TCP/IP > has lots of tuning flags well above the IB driver. I see 500+ > net.* > sysctl knobs on this system. > As you change things do make the changes on all the moving parts, > benchmark > and keep a log. Since there are multiple IB hardware vendors > it is important to track hardware specifics. "lspci" is a good tool > to gather chip info. With some cards you also need specifics about > the active firmware. > So go forth (RPN forever) and conquer. > -- > T o m M i t c h e l l > Found me a new hat, now what? > > -- > regards, > Ashwath > > References > > 1. mailto:niftyompi at niftyegg.com > 2. http://www.infinibandta.org/ > 3. http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf -- T o m M i t c h e l l Found me a new hat, now what? From vlad at lists.openfabrics.org Mon Aug 17 03:01:21 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 17 Aug 2009 03:01:21 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090817-0200 daily build status Message-ID: <20090817100122.21366E61C00@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090817-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From Robert at saq.co.uk Mon Aug 17 03:41:13 2009 From: Robert at saq.co.uk (Robert Dunkley) Date: Mon, 17 Aug 2009 11:41:13 +0100 Subject: [ofa-general] OFED on 2.6.18 Xen Kernel Message-ID: Hi Everyone, I haven't even got as far as trying this yet but does OFED work on the Xen.org 2.6.18 kernel? Should all the Infiniband options be left disabled in the menu config? Thanks, Rob The SAQ Group Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ SAQ is the trading name of SEMTEC Limited. Registered in England & Wales Company Number: 06481952 http://www.saqnet.co.uk AS29219 SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. ISPA Member From mdidomenico4 at gmail.com Mon Aug 17 05:16:59 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 17 Aug 2009 08:16:59 -0400 Subject: [ofa-general] change mtu Message-ID: How do I change the MTU of an MT23108 card? I have an AMD 8131 chipset server that needs this turned down below 1500, or atleast that's what's suspected. From dotanba at gmail.com Mon Aug 17 05:48:06 2009 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 17 Aug 2009 15:48:06 +0300 Subject: [ofa-general] change mtu In-Reply-To: References: Message-ID: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Hi. Which MTU do you try to change? (of the IB link?) Dotan On Mon, Aug 17, 2009 at 3:16 PM, Michael Di Domenico wrote: > How do I change the MTU of an MT23108 card?  I have an AMD 8131 > chipset server that needs this turned down below 1500, or atleast > that's what's suspected. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mdidomenico4 at gmail.com Mon Aug 17 05:52:24 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 17 Aug 2009 08:52:24 -0400 Subject: [ofa-general] change mtu In-Reply-To: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: Yes, for the IB card itself, not IPoIB On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak wrote: > Hi. > > Which MTU do you try to change? > (of the IB link?) > > Dotan > > On Mon, Aug 17, 2009 at 3:16 PM, Michael Di > Domenico wrote: >> How do I change the MTU of an MT23108 card? I have an AMD 8131 >> chipset server that needs this turned down below 1500, or atleast >> that's what's suspected. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > From hal.rosenstock at gmail.com Mon Aug 17 05:58:12 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 17 Aug 2009 08:58:12 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: Message-ID: On Mon, Aug 17, 2009 at 8:16 AM, Michael Di Domenico wrote: > How do I change the MTU of an MT23108 card? Is this for 23108 <-> 23108 communication (and wanting a PathRecord with performance optimized smaller MTU) ? Are you using OpenSM ? -- Hal > I have an AMD 8131 > chipset server that needs this turned down below 1500, or atleast > that's what's suspected. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Mon Aug 17 06:00:13 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 17 Aug 2009 09:00:13 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: On Mon, Aug 17, 2009 at 8:52 AM, Michael Di Domenico wrote: > Yes, for the IB card itself, not IPoIB The MTUCap of the card can affect the NeighborMTU which in turn can affect IPoIB MTU (for UD mode). -- Hal > > > On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak wrote: > > Hi. > > > > Which MTU do you try to change? > > (of the IB link?) > > > > Dotan > > > > On Mon, Aug 17, 2009 at 3:16 PM, Michael Di > > Domenico wrote: > >> How do I change the MTU of an MT23108 card? I have an AMD 8131 > >> chipset server that needs this turned down below 1500, or atleast > >> that's what's suspected. > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > >> > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Mon Aug 17 06:02:15 2009 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 17 Aug 2009 16:02:15 +0300 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: <2f3bf9a60908170602pa7ce57esbbdd1daf74cf373c@mail.gmail.com> On Mon, Aug 17, 2009 at 3:52 PM, Michael Di Domenico wrote: > Yes, for the IB card itself, not IPoIB > Why do you want to do it? (Do you know that even if the MTU of the link is 2K you can connect the QPs to use MTU of 1K between them?) Dotan From mdidomenico4 at gmail.com Mon Aug 17 06:02:25 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 17 Aug 2009 09:02:25 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: Yes, I have a group of machines with 8131 and MT23108 cards. Supposedly, based on the AMD errata i need to turn the cards MTU down from 2048 to below 1500 No i'm not running OpenSM, i'd prefer to do this at the machine level if possible rather then the fabric level. On Mon, Aug 17, 2009 at 9:00 AM, Hal Rosenstock wrote: > > > On Mon, Aug 17, 2009 at 8:52 AM, Michael Di Domenico > wrote: >> >> Yes, for the IB card itself, not IPoIB > > > The MTUCap of the card can affect the NeighborMTU which in turn can > affect IPoIB MTU (for UD mode). > > -- Hal > >> >> On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak wrote: >> > Hi. >> > >> > Which MTU do you try to change? >> > (of the IB link?) >> > >> > Dotan >> > >> > On Mon, Aug 17, 2009 at 3:16 PM, Michael Di >> > Domenico wrote: >> >> How do I change the MTU of an MT23108 card? I have an AMD 8131 >> >> chipset server that needs this turned down below 1500, or atleast >> >> that's what's suspected. >> >> _______________________________________________ >> >> general mailing list >> >> general at lists.openfabrics.org >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> >> >> To unsubscribe, please visit >> >> http://openib.org/mailman/listinfo/openib-general >> >> >> > >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > From dotanba at gmail.com Mon Aug 17 06:08:24 2009 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 17 Aug 2009 16:08:24 +0300 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: <2f3bf9a60908170608jeeee0dcx2ed693eb0744c31@mail.gmail.com> Are you sure that AMD wants you to lower the MTU of the IB fabric? (IB doesn't support MTU of 1500 bytes, only Ethernet) Dotan From hal.rosenstock at gmail.com Mon Aug 17 06:09:50 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 17 Aug 2009 09:09:50 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: On Mon, Aug 17, 2009 at 9:02 AM, Michael Di Domenico wrote: > Yes, I have a group of machines with 8131 and MT23108 cards. > Supposedly, based on the AMD errata i need to turn the cards MTU down > from 2048 to below 1500 Then you need to use 1024 or smaller (512 or 256). > > > No i'm not running OpenSM, And that SM doesn't support cranking the PR MTU down for Tavor communication ? > i'd prefer to do this at the machine level > if possible rather then the fabric level. I think the MTUCap of the .ini file would need to be changed in order to get the SM to negotiate the link MTU (NeighborMTU) smaller. MTUCap 3 is 1024. -- Hal > > > On Mon, Aug 17, 2009 at 9:00 AM, Hal Rosenstock > wrote: > > > > > > On Mon, Aug 17, 2009 at 8:52 AM, Michael Di Domenico > > wrote: > >> > >> Yes, for the IB card itself, not IPoIB > > > > > > The MTUCap of the card can affect the NeighborMTU which in turn can > > affect IPoIB MTU (for UD mode). > > > > -- Hal > > > >> > >> On Mon, Aug 17, 2009 at 8:48 AM, Dotan Barak wrote: > >> > Hi. > >> > > >> > Which MTU do you try to change? > >> > (of the IB link?) > >> > > >> > Dotan > >> > > >> > On Mon, Aug 17, 2009 at 3:16 PM, Michael Di > >> > Domenico wrote: > >> >> How do I change the MTU of an MT23108 card? I have an AMD 8131 > >> >> chipset server that needs this turned down below 1500, or atleast > >> >> that's what's suspected. > >> >> _______________________________________________ > >> >> general mailing list > >> >> general at lists.openfabrics.org > >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >> > >> >> To unsubscribe, please visit > >> >> http://openib.org/mailman/listinfo/openib-general > >> >> > >> > > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Mon Aug 17 06:40:19 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 17 Aug 2009 09:40:19 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock wrote: >> No i'm not running OpenSM, > > And that SM doesn't support cranking the PR MTU down for Tavor communication Dunno, didn't check... But i shall. How do i do it in OpenSM? >> >> i'd prefer to do this at the machine level >> if possible rather then the fabric level. > > I think the MTUCap of the .ini file would need to be changed in order to get > the SM to negotiate the link MTU (NeighborMTU) smaller. MTUCap 3 is 1024. From hal.rosenstock at gmail.com Mon Aug 17 06:53:27 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 17 Aug 2009 09:53:27 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: On Mon, Aug 17, 2009 at 9:40 AM, Michael Di Domenico wrote: > On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock > wrote: > >> No i'm not running OpenSM, > > > > And that SM doesn't support cranking the PR MTU down for Tavor > communication > > Dunno, didn't check... But i shall. How do i do it in OpenSM? enable_quirks option > > > >> > >> i'd prefer to do this at the machine level > >> if possible rather then the fabric level. > > > > I think the MTUCap of the .ini file would need to be changed in order to > get > > the SM to negotiate the link MTU (NeighborMTU) smaller. MTUCap 3 is 1024. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Mon Aug 17 07:18:12 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 17 Aug 2009 10:18:12 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: On Mon, Aug 17, 2009 at 9:53 AM, Hal Rosenstock wrote: > > > On Mon, Aug 17, 2009 at 9:40 AM, Michael Di Domenico > wrote: >> >> On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock >> wrote: >> >> No i'm not running OpenSM, >> > >> > And that SM doesn't support cranking the PR MTU down for Tavor >> > communication >> >> Dunno, didn't check... But i shall. How do i do it in OpenSM? > > > enable_quirks option Did the option really have to be case sensitive? Come on... I set the opensm config and the rdma_cm tavor_quirk=1 it's showing in the log file that it picked up the cached options, but its still showing 2048 MTU under ibv_devinfo Did i miss a step? From hal.rosenstock at gmail.com Mon Aug 17 07:59:19 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 17 Aug 2009 10:59:19 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: On Mon, Aug 17, 2009 at 10:18 AM, Michael Di Domenico < mdidomenico4 at gmail.com> wrote: > On Mon, Aug 17, 2009 at 9:53 AM, Hal Rosenstock > wrote: > > > > > > On Mon, Aug 17, 2009 at 9:40 AM, Michael Di Domenico > > wrote: > >> > >> On Mon, Aug 17, 2009 at 9:09 AM, Hal Rosenstock< > hal.rosenstock at gmail.com> > >> wrote: > >> >> No i'm not running OpenSM, > >> > > >> > And that SM doesn't support cranking the PR MTU down for Tavor > >> > communication > >> > >> Dunno, didn't check... But i shall. How do i do it in OpenSM? > > > > > > enable_quirks option > > Did the option really have to be case sensitive? Come on... > > I set the opensm config and the rdma_cm tavor_quirk=1 I'm unfamiliar with this RDMA CM option. > > it's showing in the log file that it picked up the cached options, but > its still showing 2048 MTU under ibv_devinfo It wouldn't show there; only in SA PR responses. -- Hal > > > Did i miss a step? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Mon Aug 17 08:04:10 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 17 Aug 2009 11:04:10 -0400 Subject: [ofa-general] change mtu In-Reply-To: References: <2f3bf9a60908170548v6c62b01fy221e7ddd63bc6591@mail.gmail.com> Message-ID: >> it's showing in the log file that it picked up the cached options, but >> its still showing 2048 MTU under ibv_devinfo > > It wouldn't show there; only in SA PR responses. Okay. Here's what i've enabled so far, for those keep track.... Bios = MTRR changed from Continuous to Discrete options ib_mthca msi_x=1 tune_pci=1 options rdma_cm tavor_quirk=1 I'm now pushing 550MB/sec, which is a remarkable improvement. From weiny2 at llnl.gov Mon Aug 17 08:30:23 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 17 Aug 2009 08:30:23 -0700 Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object. In-Reply-To: <20090816110200.GS25501@me> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> Message-ID: <20090817083023.da17378b.weiny2@llnl.gov> On Sun, 16 Aug 2009 14:02:00 +0300 Sasha Khapyorsky wrote: > On 20:43 Thu 13 Aug , Ira Weiny wrote: > > > > From: Ira Weiny > > Date: Thu, 13 Aug 2009 20:16:01 -0700 > > Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object. > > > > This object must be created before query functions can be used and is > > used to control the functionality of the queries. > > Why is it needed? I see that it complicates API, but what is a benefits? The immediate benefit is coming with the multi-threaded implementation where I plan on adding the following function.[*] ibnd_set_num_threads(ibnd_ctx_t *ctx, int num); Set/get functions can be added for anything which we need to pass to discover without changing the discover (or other query) functionality and breaking the API. This also allows us to keep some state for the library private. For example, I might persist threads across calls to discover and only destroy them on a ibnd_destroy_ctx call. Ira [*] and the reason behind this function is that I feel the proper number of threads is going to be variable depending on the size and layout of the fabric being processed as well as the number of CPU's available on the node. > > Sasha -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov From taylor at hpc.ufl.edu Mon Aug 17 09:10:25 2009 From: taylor at hpc.ufl.edu (Charles A. Taylor) Date: Mon, 17 Aug 2009 12:10:25 -0400 Subject: [ofa-general] IPoIB Transmit Timeouts Message-ID: <1250525425.20238.37.camel@hotshot.phys.ufl.edu> We upgraded our file servers to OFED 1.4.1 last Thursday and have since been hit with a daily ration of the following across all eight of our servers... Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449 msecs Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770, tx_tail 868165647 The difference between the head/tail is always 123. The send queue size is 128 according to... cat /sys/module/ib_ipoib/parameters/send_queue_size 128 >From the post below, others seem to have encountered this but we have not seen any patches or work-arounds. Has anyone solved this problem? They were very stable under OFED 1.2. We are running the Lustre-patched kernel but we did that under OFED 1.2 + lustre 1.6.4.2 as well and I'm pretty sure they don't touch the IB modules. Relevant information: ===================== CentOS 5.3 Lustre 1.8.0.1 2.6.18-128.1.6.el5_lustre.1.8.0.1smp X86_64 (Opteron 275s) hca_id: mthca0 fw_ver: 4.8.200 node_guid: 0005:ad00:0004:668c sys_image_guid: 0002:c900:0100:d050 vendor_id: 0x02c9 vendor_part_id: 25208 hw_ver: 0xA0 board_id: MT_00A0000001 phys_port_cnt: 2 port: 1 state: active (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 49 port_lmc: 0x00 port: 2 state: active (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 98 port_lmc: 0x00 Charlie Taylor UF HPC Center > On Wed, Jul 29, 2009 at 2:14 PM, Pradeep Satyanarayana < > prade... at linux.vnet.ibm.com> wrote: > > > Hal Rosenstock wrote: > > > Hi, > > > > > > I'm seeing the following messages from IPoIB: > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > NETDEV WATCHDOG: ib0: transmit timed out > > > ib0: transmit timeout: latency 1374 msecs > > > ib0: queue stopped 1, tx_head 140245691, tx_tail 140245565 > > > > > > What are the possible (and most likely) causes of post_send failures ? I > > > went through the code for all the errors (some at the driver level) but > > > none popped out at me. > > > > > > > Is it possible that the receiver is overwhelmed and hence the tx_ring is > > full? > From bryan.d.green at nasa.gov Mon Aug 17 13:20:36 2009 From: bryan.d.green at nasa.gov (Bryan Green) Date: Mon, 17 Aug 2009 13:20:36 -0700 Subject: [ofa-general] librdmacm - okay to select on a cm channel's file descriptor? In-Reply-To: Your message of "Sun, 16 Aug 2009 22:47:02 CDT." Message-ID: <20090817202036.46FA42391C7@ece06.nas.nasa.gov> Roland Dreier writes: > > > In an attempt to get unexpected DISCONNECT notifications during > > ib communication, I'm trying to use 'select()' on the cm channel's file > > descriptor, testing it for readability. I've found that this works some= > of > > the time, but not all of the time. > > What happens when it doesn't work? select() doesn't give you an event > but when you try to read there actually is an event there? Yes. > I took a quick look at the ucma kernel code and the implementation of > select() (the kernel uses poll() as the name but it all ends up in the > same code) looks straightforwardly correct -- there's only one place > where events are added to the queue for a file, and that place wakes up > the poll wait queue. But maybe there is a funny bug somehow. > > - R. Thanks for confirming that use of select() should work. I wasn't sure because I couldn't find any reference to select() in the docs or examples. After further investigation, I found that I was neglecting to reinitialize the fd_set structure after each call to select. It's working great now. Thanks, -bryan From weiny2 at llnl.gov Mon Aug 17 14:03:38 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 17 Aug 2009 14:03:38 -0700 Subject: [ofa-general] Re: [PATCH 3/5 V2] libibnetdisc: make all fields of ibnd_fabric_t public In-Reply-To: <20090816114127.GW25501@me> References: <20090813204251.df6446c1.weiny2@llnl.gov> <20090816114127.GW25501@me> Message-ID: <20090817140338.edd83fe0.weiny2@llnl.gov> On Sun, 16 Aug 2009 14:41:27 +0300 Sasha Khapyorsky wrote: > On 20:42 Thu 13 Aug , Ira Weiny wrote: > > > > @@ -108,8 +107,8 @@ typedef struct ibnd_port { > > /** ========================================================================= > > * Chassis > > */ > > -typedef struct chassis { > > - struct chassis *next; > > +typedef struct ibnd_chassis { > > + struct ibnd_chassis *next; > > uint64_t chassisguid; > > unsigned char chassisnum; > > > > @@ -124,11 +123,17 @@ typedef struct chassis { > > ibnd_node_t *linenode[LINES_MAX_NUM + 1]; > > } ibnd_chassis_t; > > > > +/* HASH table defines */ > > +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) > > Why should this macro be published (by moving from internal.h to > ibnetdisc.h)? > > As far I can see it is only used in ibnetdisc.c, so actually we can keep > it internally and to move to this file. > You are right, good catch. I just copied it blindly with HTSZ which must be there. git am is not working now on the last two patches [4/5 and 5/5] so I am sending new versions of them so that they apply cleanly. V2 below, Ira From: Ira Weiny Date: Thu, 13 Aug 2009 20:08:51 -0700 Subject: [PATCH] libibnetdisc: make all fields of ibnd_fabric_t public In addition clean up the name of the chassis struct Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 38 ++++++++---- infiniband-diags/libibnetdisc/src/chassis.c | 23 ++++---- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 63 +++++++++----------- infiniband-diags/libibnetdisc/src/internal.h | 21 ------- 4 files changed, 66 insertions(+), 79 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index 4a57855..c55ce00 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -38,8 +38,7 @@ #include #include -struct ib_fabric; /* forward declare */ -struct chassis; /* forward declare */ +struct ibnd_chassis; /* forward declare */ struct ibnd_port; /* forward declare */ /** ========================================================================= @@ -67,13 +66,13 @@ typedef struct ibnd_node { char nodedesc[IB_SMP_DATA_SIZE]; - struct ibnd_port **ports; /* in order array of port pointers - the size of this array is info.numports + 1 - items MAY BE NULL! (ie 0 == switches only) */ + struct ibnd_port **ports; /* in order array of port pointers + the size of this array is info.numports + 1 + items MAY BE NULL! (ie 0 == switches only) */ /* chassis info */ struct ibnd_node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ - struct chassis *chassis; /* if != NULL the chassis this node belongs to */ + struct ibnd_chassis *chassis; /* if != NULL the chassis this node belongs to */ unsigned char ch_type; unsigned char ch_anafanum; unsigned char ch_slotnum; @@ -92,9 +91,9 @@ typedef struct ibnd_node { typedef struct ibnd_port { uint64_t guid; int portnum; - int ext_portnum; /* optional if != 0 external port num */ - ibnd_node_t *node; /* node this port belongs to */ - struct ibnd_port *remoteport; /* null if SMA, or does not exist */ + int ext_portnum; /* optional if != 0 external port num */ + ibnd_node_t *node; /* node this port belongs to */ + struct ibnd_port *remoteport; /* null if SMA, or does not exist */ /* quick cache of info below */ uint16_t base_lid; uint8_t lmc; @@ -108,8 +107,8 @@ typedef struct ibnd_port { /** ========================================================================= * Chassis */ -typedef struct chassis { - struct chassis *next; +typedef struct ibnd_chassis { + struct ibnd_chassis *next; uint64_t chassisguid; unsigned char chassisnum; @@ -124,11 +123,14 @@ typedef struct chassis { ibnd_node_t *linenode[LINES_MAX_NUM + 1]; } ibnd_chassis_t; +#define HTSZ 137 +#define MAXHOPS 63 + /** ========================================================================= * Fabric * Main fabric object which is returned and represents the data discovered */ -typedef struct ib_fabric { +typedef struct ibnd_fabric { /* the node the discover was initiated from * "from" parameter in ibnd_discover_fabric * or by default the node you ar running on @@ -139,6 +141,18 @@ typedef struct ib_fabric { /* NULL terminated list of all chassis found in the fabric */ ibnd_chassis_t *chassis; int maxhops_discovered; + + /* internal use only */ + ibnd_node_t *nodestbl[HTSZ]; + ibnd_port_t *portstbl[HTSZ]; + ibnd_node_t *nodesdist[MAXHOPS + 1]; + ibnd_chassis_t *first_chassis; + ibnd_chassis_t *current_chassis; + ibnd_chassis_t *last_chassis; + ibnd_node_t *switches; + ibnd_node_t *ch_adapters; + ibnd_node_t *routers; + ib_portid_t selfportid; } ibnd_fabric_t; /** ========================================================================= diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 0dd259a..4886cfc 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -91,7 +91,7 @@ char *ibnd_get_chassis_slot_str(ibnd_node_t * node, char *str, size_t size) return (str); } -static ibnd_chassis_t *find_chassisnum(struct ibnd_fabric *fabric, +static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric, unsigned char chassisnum) { ibnd_chassis_t *current; @@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node) return sysimgguid; } -static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f, +static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric, ibnd_node_t * node) { ibnd_chassis_t *current; uint64_t chguid; chguid = get_chassisguid(node); - for (current = f->first_chassis; current; current = current->next) { + for (current = fabric->first_chassis; current; current = current->next) { if (current->chassisguid == chguid) return current; } @@ -224,7 +224,6 @@ static ibnd_chassis_t *find_chassisguid(struct ibnd_fabric *f, uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); ibnd_chassis_t *chassis; if (!fabric) { @@ -232,7 +231,7 @@ uint64_t ibnd_get_chassis_guid(ibnd_fabric_t * fabric, unsigned char chassisnum) return 0; } - chassis = find_chassisnum(f, chassisnum); + chassis = find_chassisnum(fabric, chassisnum); if (chassis) return chassis->chassisguid; else @@ -783,7 +782,7 @@ static void voltaire_portmap(ibnd_port_t * port) port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; } -static int add_chassis(struct ibnd_fabric *fabric) +static int add_chassis(ibnd_fabric_t * fabric) { if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) { IBND_ERROR("OOM: failed to allocate chassis object\n"); @@ -819,7 +818,7 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node) Returns: 0 on success, -1 on failure */ -int group_nodes(struct ibnd_fabric *fabric) +int group_nodes(ibnd_fabric_t * fabric) { ibnd_node_t *node; int dist; @@ -833,7 +832,7 @@ int group_nodes(struct ibnd_fabric *fabric) /* an appropriate chassis record (slotnum and position) */ /* according to internal connectivity */ /* not very efficient but clear code so... */ - for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) @@ -844,7 +843,7 @@ int group_nodes(struct ibnd_fabric *fabric) /* separate every Voltaire chassis from each other and build linked list of them */ /* algorithm: catch spine and find all surrounding nodes */ - for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) @@ -863,7 +862,7 @@ int group_nodes(struct ibnd_fabric *fabric) /* now make pass on nodes for chassis which are not Voltaire */ /* grouped by common SystemImageGUID */ - for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { for (node = fabric->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) @@ -913,12 +912,12 @@ int group_nodes(struct ibnd_fabric *fabric) } } } - if (dist == fabric->fabric.maxhops_discovered) + if (dist == fabric->maxhops_discovered) dist = MAXHOPS; /* skip to CAs */ else dist++; } - fabric->fabric.chassis = fabric->first_chassis; + fabric->chassis = fabric->first_chassis; return (0); } diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 5d506ee..c69467e 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -67,7 +67,7 @@ void decode_port_info(ibnd_port_t * port) } static int get_port_info(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, ibnd_port_t * port, + ibnd_fabric_t * fabric, ibnd_port_t * port, int portnum, ib_portid_t * portid) { char width[64], speed[64]; @@ -98,7 +98,7 @@ static int get_port_info(struct ibmad_port *ibmad_port, * Returns -1 if error. */ static int query_node_info(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, ibnd_node_t * node, + ibnd_fabric_t * fabric, ibnd_node_t * node, ib_portid_t * portid) { if (!smp_query_via(&(node->info), portid, IB_ATTR_NODE_INFO, 0, 0, @@ -116,7 +116,7 @@ static int query_node_info(struct ibmad_port *ibmad_port, /* * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. */ -static int query_node(struct ibmad_port *ibmad_port, struct ibnd_fabric *fabric, +static int query_node(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, ibnd_node_t * node, ibnd_port_t * port, ib_portid_t * portid) { @@ -175,28 +175,28 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport) return path->cnt; } -static int extend_dpath(struct ibmad_port *ibmad_port, struct ibnd_fabric *f, +static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, ib_portid_t * portid, int nextport) { int rc = 0; if (portid->lid) { /* If we were LID routed we need to set up the drslid */ - if (!f->selfportid.lid) - if (ib_resolve_self_via(&f->selfportid, NULL, NULL, + if (!fabric->selfportid.lid) + if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL, ibmad_port) < 0) { IBND_ERROR("Failed to resolve self\n"); return -1; } - portid->drpath.drslid = (uint16_t) f->selfportid.lid; + portid->drpath.drslid = (uint16_t) fabric->selfportid.lid; portid->drpath.drdlid = 0xFFFF; } rc = add_port_to_dpath(&portid->drpath, nextport); - if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) - f->fabric.maxhops_discovered = portid->drpath.cnt; + if ((rc != -1) && (portid->drpath.cnt > fabric->maxhops_discovered)) + fabric->maxhops_discovered = portid->drpath.cnt; return (rc); } @@ -215,7 +215,7 @@ static void dump_endnode(ib_portid_t * path, char *prompt, port->base_lid + (1 << port->lmc) - 1, node->nodedesc); } -static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric, +static ibnd_node_t *find_existing_node(ibnd_fabric_t * fabric, ibnd_node_t * new) { int hash = HASHGUID(new->guid) % HTSZ; @@ -230,7 +230,6 @@ static ibnd_node_t *find_existing_node(struct ibnd_fabric *fabric, ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int hash = HASHGUID(guid) % HTSZ; ibnd_node_t *node; @@ -239,7 +238,7 @@ ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid) return (NULL); } - for (node = f->nodestbl[hash]; node; node = node->htnext) + for (node = fabric->nodestbl[hash]; node; node = node->htnext) if (node->guid == guid) return (ibnd_node_t *) node; @@ -267,7 +266,6 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, char portinfo_port0[IB_SMP_DATA_SIZE]; void *nd = node->nodedesc; int p = 0; - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); if (_check_ibmad_port(ibmad_port) < 0) return (NULL); @@ -282,7 +280,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, return (NULL); } - if (query_node_info(ibmad_port, f, node, &(node->path_portid))) + if (query_node_info(ibmad_port, fabric, node, &(node->path_portid))) return (NULL); if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, 0, @@ -291,7 +289,7 @@ ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, /* update all the port info's */ for (p = 1; p >= node->numports; p++) { - get_port_info(ibmad_port, f, node->ports[p], + get_port_info(ibmad_port, fabric, node->ports[p], p, &(node->path_portid)); } @@ -319,7 +317,6 @@ done: ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int i = 0; ibnd_node_t *rc; ib_dr_path_t path; @@ -329,7 +326,7 @@ ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str) return (NULL); } - rc = f->fabric.from_node; + rc = fabric->from_node; if (str2drpath(&path, dr_str, 0, 0) == -1) { return (NULL); @@ -368,7 +365,7 @@ static void add_to_portguid_hash(ibnd_port_t * port, ibnd_port_t * hash[]) hash[hash_idx] = port; } -static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric) +static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric) { switch (node->type) { case IB_NODE_CA: @@ -386,7 +383,7 @@ static void add_to_type_list(ibnd_node_t * node, struct ibnd_fabric *fabric) } } -static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric) +static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric) { int dist = node->dist; if (node->type != IB_NODE_SWITCH) @@ -396,7 +393,7 @@ static void add_to_nodedist(ibnd_node_t * node, struct ibnd_fabric *fabric) fabric->nodesdist[dist] = node; } -static ibnd_node_t *create_node(struct ibnd_fabric *fabric, +static ibnd_node_t *create_node(ibnd_fabric_t * fabric, ibnd_node_t * temp, ib_portid_t * path, int dist) { @@ -415,8 +412,8 @@ static ibnd_node_t *create_node(struct ibnd_fabric *fabric, add_to_nodeguid_hash(node, fabric->nodestbl); /* add this to the all nodes list */ - node->next = fabric->fabric.nodes; - fabric->fabric.nodes = (ibnd_node_t *) node; + node->next = fabric->nodes; + fabric->nodes = (ibnd_node_t *) node; add_to_type_list(node, fabric); add_to_nodedist(node, fabric); @@ -433,7 +430,7 @@ static struct ibnd_port *find_existing_port_node(ibnd_node_t * node, return (node->ports[port->portnum]); } -static struct ibnd_port *add_port_to_node(struct ibnd_fabric *fabric, +static struct ibnd_port *add_port_to_node(ibnd_fabric_t * fabric, ibnd_node_t * node, ibnd_port_t * temp) { @@ -479,7 +476,7 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port, } static int get_remote_node(struct ibmad_port *ibmad_port, - struct ibnd_fabric *fabric, ibnd_node_t * node, + ibnd_fabric_t * fabric, ibnd_node_t * node, ibnd_port_t * port, ib_portid_t * path, int portnum, int dist) { @@ -541,7 +538,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, ib_portid_t * from, int hops) { int rc = 0; - struct ibnd_fabric *fabric = NULL; + ibnd_fabric_t *fabric = NULL; ib_portid_t my_portid = { 0 }; ibnd_node_t node_buf; ibnd_port_t port_buf; @@ -587,7 +584,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, if (!node) goto error; - fabric->fabric.from_node = (ibnd_node_t *) node; + fabric->from_node = (ibnd_node_t *) node; port = add_port_to_node(fabric, node, &port_buf); if (!port) @@ -668,7 +665,6 @@ static void destroy_node(ibnd_node_t * node) void ibnd_destroy_fabric(ibnd_fabric_t * fabric) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); int dist = 0; ibnd_node_t *node = NULL; ibnd_node_t *next = NULL; @@ -677,21 +673,21 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric) if (!fabric) return; - ch = f->first_chassis; + ch = fabric->first_chassis; while (ch) { ch_next = ch->next; free(ch); ch = ch_next; } for (dist = 0; dist <= MAXHOPS; dist++) { - node = f->nodesdist[dist]; + node = fabric->nodesdist[dist]; while (node) { next = node->dnext; destroy_node(node); node = next; } } - free(f); + free(fabric); } void ibnd_debug(int i) @@ -735,7 +731,6 @@ void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, int node_type, void *user_data) { - struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); ibnd_node_t *list = NULL; ibnd_node_t *cur = NULL; @@ -751,13 +746,13 @@ void ibnd_iter_nodes_type(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, switch (node_type) { case IB_NODE_SWITCH: - list = f->switches; + list = fabric->switches; break; case IB_NODE_CA: - list = f->ch_adapters; + list = fabric->ch_adapters; break; case IB_NODE_ROUTER: - list = f->routers; + list = fabric->routers; break; default: IBND_DEBUG("Invalid node_type specified %d\n", node_type); diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index f06d2c3..21ff476 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -40,8 +40,6 @@ #include -#define MAXHOPS 63 - #define IBND_DEBUG(fmt, ...) \ if (ibdebug) { \ printf("%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__); \ @@ -51,24 +49,5 @@ /* HASH table defines */ #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) -#define HTSZ 137 - -struct ibnd_fabric { - /* This member MUST BE FIRST */ - ibnd_fabric_t fabric; - - /* internal use only */ - ibnd_node_t *nodestbl[HTSZ]; - ibnd_port_t *portstbl[HTSZ]; - ibnd_node_t *nodesdist[MAXHOPS + 1]; - ibnd_chassis_t *first_chassis; - ibnd_chassis_t *current_chassis; - ibnd_chassis_t *last_chassis; - ibnd_node_t *switches; - ibnd_node_t *ch_adapters; - ibnd_node_t *routers; - ib_portid_t selfportid; -}; -#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) #endif /* _INTERNAL_H_ */ -- 1.5.4.5 From weiny2 at llnl.gov Mon Aug 17 14:03:41 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 17 Aug 2009 14:03:41 -0700 Subject: [ofa-general] Re: [PATCH 4/5 v2] infiniband-diags/libibnetdisc: Introduce a context object. In-Reply-To: <20090817083023.da17378b.weiny2@llnl.gov> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> <20090817083023.da17378b.weiny2@llnl.gov> Message-ID: <20090817140341.3dcccc10.weiny2@llnl.gov> From: Ira Weiny Date: Mon, 17 Aug 2009 13:10:45 -0700 Subject: [PATCH] infiniband-diags/libibnetdisc: Introduce a context object. This object must be created before "query" functions can be used. The purpose of this is to allow for future data to be passed to query functions (ie ibnd_discover_fabric) without having to change the API of those functions. Adjusted to apply to v2 of "libibnetdisc: make all fields of ibnd_fabric_t public" Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/Makefile.am | 4 +- .../libibnetdisc/include/infiniband/ibnetdisc.h | 23 ++++-- .../libibnetdisc/man/ibnd_create_ctx.3 | 2 + .../libibnetdisc/man/ibnd_destroy_ctx.3 | 2 + .../libibnetdisc/man/ibnd_discover_fabric.3 | 41 ++++++++--- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 74 ++++++++++++++------ infiniband-diags/libibnetdisc/src/internal.h | 5 ++ infiniband-diags/libibnetdisc/src/libibnetdisc.map | 2 + infiniband-diags/libibnetdisc/test/testleaks.c | 7 ++- infiniband-diags/src/iblinkinfo.c | 14 ++-- infiniband-diags/src/ibnetdiscover.c | 13 +++- infiniband-diags/src/ibqueryerrors.c | 18 +++-- 12 files changed, 147 insertions(+), 58 deletions(-) create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am index 7085f14..5619aad 100644 --- a/infiniband-diags/libibnetdisc/Makefile.am +++ b/infiniband-diags/libibnetdisc/Makefile.am @@ -45,7 +45,9 @@ man_MANS = man/ibnd_debug.3 \ man/ibnd_iter_nodes.3 \ man/ibnd_iter_nodes_type.3 \ man/ibnd_show_progress.3 \ - man/ibnd_update_node.3 + man/ibnd_update_node.3 \ + man/ibnd_create_ctx.3 \ + man/ibnd_destroy_ctx.3 EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver $(man_MANS) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index c55ce00..ce1c74f 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -38,8 +38,11 @@ #include #include -struct ibnd_chassis; /* forward declare */ -struct ibnd_port; /* forward declare */ +typedef struct ibnd_ctx ibnd_ctx_t; + +/* forward declares */ +struct ibnd_chassis; +struct ibnd_port; /** ========================================================================= * Node @@ -156,15 +159,21 @@ typedef struct ibnd_fabric { } ibnd_fabric_t; /** ========================================================================= - * Initialization (fabric operations) + * Initialization */ MAD_EXPORT void ibnd_debug(int i); -MAD_EXPORT void ibnd_show_progress(int i); -MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, +MAD_EXPORT ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port); +MAD_EXPORT void ibnd_destroy_ctx(ibnd_ctx_t * ctx); +MAD_EXPORT int ibnd_show_progress(ibnd_ctx_t * ctx, int i); + +/** ========================================================================= + * Fabric Operations + */ +MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, ib_portid_t * from, int hops); /** - * open: (required) ibmad_port object from libibmad + * ctx : (required) context created by ibnd_create_ctx. * from: (optional) specify the node to start scanning from. * If NULL start from the node we are running on. * hops: (optional) Specify how much of the fabric to traverse. @@ -178,7 +187,7 @@ MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t * fabric); MAD_EXPORT ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t * fabric, uint64_t guid); MAD_EXPORT ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t * fabric, char *dr_str); -MAD_EXPORT ibnd_node_t *ibnd_update_node(struct ibmad_port *ibmad_port, +MAD_EXPORT ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, ibnd_node_t * node); diff --git a/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 new file mode 100644 index 0000000..8b321b0 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_create_ctx.3 @@ -0,0 +1,2 @@ +.\".TH IBND_CREATE_CTX 3 "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 new file mode 100644 index 0000000..bb9d96a --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_destroy_ctx.3 @@ -0,0 +1,2 @@ +.\".TH IBND_DESTROY_CTX 3 "Aug 12, 2009" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 index dfeaf47..f014977 100644 --- a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 +++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 @@ -1,46 +1,65 @@ .TH IBND_DISCOVER_FABRIC 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" .SH "NAME" -ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- initialize ibnetdiscover library. +ibnd_create_ctx, ibnd_destroy_ctx, +ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug, ibnd_show_progress \- +initialize ibnetdiscover library and query the fabric. .SH "SYNOPSIS" .nf .B #include .sp -.bi "ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops)" +.bi "ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port)" +.BI "void ibnd_destroy_ctx(ibnd_ctx_t *ctx)" +.bi "ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t *ctx, ib_portid_t *from, int hops)" .BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)" .BI "void ibnd_debug(int i)" -.BI "void ibnd_show_progress(int i)" +.BI "int ibnd_show_progress(ibnd_ctx_t *ctx, int i)" .SH "DESCRIPTION" -.B ibnd_discover_fabric() -Discover the fabric connected to the port specified by ibmad_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. +.B ibnd_create_ctx() +Create a context for the ibnetdiscover library to be used in query operations. ibmad_port must be opened with at least IB_SMI_CLASS and IB_SMI_DIRECT_CLASS -classes for ibnd_discover_fabric to work. +classes for queries to work. + +.B ibnd_discover_fabric() +Discover the fabric using the context specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. .B ibnd_destroy_fabric() free all memory and resources associated with the fabric. +.B ibnd_destroy_ctx() +free all memory and resources associated with the context. + .B ibnd_debug() Set the debug level to be printed as library operations take place. -.B ibnd_debug() -Indicate that the library should print debug output which shows it's progress +.B ibnd_show_progress() +Indicate that the library should print output which shows it's progress through the fabric. .SH "RETURN VALUE" +.B ibnd_create_ctx() +return NULL on failure, otherwise a valid ibnd_ctx_t object. + .B ibnd_discover_fabric() return NULL on failure, otherwise a valid ibnd_fabric_t object. -.B ibnd_destory_fabric(), ibnd_debug() +.B ibnd_show_progress() +Returnes the previous setting for this value. + +.B ibnd_destory_fabric(), ibnd_debug(), ibnd_destroy_ctx() NONE + .SH "EXAMPLES" .B Discover the entire fabric connected to device "mthca0", port 1. int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; struct ibmad_port *ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); - ibnd_fabric_t *fabric = ibnd_discover_fabric(ibmad_port, 100, NULL, 0); + ibnd_ctx_t *ctx = ibnd_create_ctx(ibmad_port); + ibnd_fabric_t *fabric = ibnd_discover_fabric(ctx, NULL, 0); ... ibnd_destroy_fabric(fabric); + ibnd_destroy_ctx(ctx); mad_rpc_close_port(ibmad_port); .B Discover only a single node and those nodes connected to it. @@ -48,7 +67,7 @@ NONE ... str2drpath(&(port_id.drpath), from, 0, 0); ... - ibnd_discover_fabric(ibmad_port, 100, &port_id, 1); + ibnd_discover_fabric(ctx, &port_id, 1); ... .SH "SEE ALSO" libibmad, mad_rpc_open_port diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index c69467e..7295189 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -57,9 +57,23 @@ #include "internal.h" #include "chassis.h" -static int show_progress = 0; int ibdebug; +ibnd_ctx_t *ibnd_create_ctx(struct ibmad_port *ibmad_port) +{ + ibnd_ctx_t *rc = calloc(1, sizeof *rc); + if (!rc) + return (NULL); + + rc->ibmad_port = ibmad_port; + return (rc); +} + +void ibnd_destroy_ctx(ibnd_ctx_t * ctx) +{ + free(ctx); +} + void decode_port_info(ibnd_port_t * port) { port->base_lid = (uint16_t) mad_get_field(port->info, 0, IB_PORT_LID_F); @@ -204,8 +218,6 @@ static void dump_endnode(ib_portid_t * path, char *prompt, ibnd_node_t * node, ibnd_port_t * port) { char type[64]; - if (!show_progress) - return; mad_dump_node_type(type, 64, &(node->type), sizeof(int)); printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", @@ -260,16 +272,29 @@ static int _check_ibmad_port(struct ibmad_port *ibmad_port) return (0); } -ibnd_node_t *ibnd_update_node(struct ibmad_port * ibmad_port, - ibnd_fabric_t * fabric, ibnd_node_t * node) +static int check_ctx(ibnd_ctx_t * ctx) +{ + if (!ctx) { + IBND_DEBUG("ctx must be specified\n"); + return (-1); + } + + return (_check_ibmad_port(ctx->ibmad_port)); +} + +ibnd_node_t *ibnd_update_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, + ibnd_node_t * node) { char portinfo_port0[IB_SMP_DATA_SIZE]; void *nd = node->nodedesc; int p = 0; + struct ibmad_port *ibmad_port; - if (_check_ibmad_port(ibmad_port) < 0) + if (check_ctx(ctx) < 0) return (NULL); + ibmad_port = ctx->ibmad_port; + if (!fabric) { IBND_DEBUG("fabric parameter NULL\n"); return (NULL); @@ -475,12 +500,12 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port, remoteport->remoteport = (ibnd_port_t *) port; } -static int get_remote_node(struct ibmad_port *ibmad_port, - ibnd_fabric_t * fabric, ibnd_node_t * node, - ibnd_port_t * port, ib_portid_t * path, - int portnum, int dist) +static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, + ibnd_node_t * node, ibnd_port_t * port, + ib_portid_t * path, int portnum, int dist) { int rc = 0; + struct ibmad_port *ibmad_port = ctx->ibmad_port; ibnd_node_t node_buf; ibnd_port_t port_buf; ibnd_node_t *remotenode, *oldnode; @@ -524,8 +549,9 @@ static int get_remote_node(struct ibmad_port *ibmad_port, goto error; } - dump_endnode(path, oldnode ? "known remote" : "new remote", - remotenode, remoteport); + if (ctx->show_progress) + dump_endnode(path, oldnode ? "known remote" : "new remote", + remotenode, remoteport); link_ports(node, port, remotenode, remoteport); @@ -534,7 +560,7 @@ error: return (rc); } -ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, +ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, ib_portid_t * from, int hops) { int rc = 0; @@ -549,7 +575,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, ib_portid_t *path; int max_hops = MAXHOPS - 1; /* default find everything */ - if (_check_ibmad_port(ibmad_port) < 0) + if (check_ctx(ctx) < 0) return (NULL); /* if not everything how much? */ @@ -575,7 +601,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, memset(&node_buf, 0, sizeof(node_buf)); memset(&port_buf, 0, sizeof(port_buf)); - if (query_node(ibmad_port, fabric, &node_buf, &port_buf, from)) { + if (query_node(ctx->ibmad_port, fabric, &node_buf, &port_buf, from)) { IBND_DEBUG("can't reach node %s\n", portid2str(from)); goto error; } @@ -590,7 +616,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, if (!port) goto error; - rc = get_remote_node(ibmad_port, fabric, node, port, from, + rc = get_remote_node(ctx, fabric, node, port, from, mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F), 0); if (rc < 0) @@ -605,14 +631,15 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, path = &node->path_portid; IBND_DEBUG("dist %d node %p\n", dist, node); - dump_endnode(path, "processing", node, port); + if (ctx->show_progress) + dump_endnode(path, "processing", node, port); for (i = 1; i <= node->numports; i++) { if (i == mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F)) continue; - if (get_port_info(ibmad_port, fabric, + if (get_port_info(ctx->ibmad_port, fabric, &port_buf, i, path)) { IBND_ERROR ("can't reach node %s port %d", @@ -636,7 +663,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, IB_NODE_PORT_GUID_F); } - if (get_remote_node(ibmad_port, fabric, node, + if (get_remote_node(ctx, fabric, node, port, path, i, dist) < 0) goto error; } @@ -703,9 +730,14 @@ void ibnd_debug(int i) } } -void ibnd_show_progress(int i) +int ibnd_show_progress(ibnd_ctx_t * ctx, int i) { - show_progress = i; + int rc = 0; + if (check_ctx(ctx)) + return (-1); + rc = ctx->show_progress; + ctx->show_progress = i; + return (rc); } void ibnd_iter_nodes(ibnd_fabric_t * fabric, ibnd_iter_node_func_t func, diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index 21ff476..b989b68 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -50,4 +50,9 @@ /* HASH table defines */ #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) +struct ibnd_ctx { + struct ibmad_port *ibmad_port; + int show_progress; +}; + #endif /* _INTERNAL_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map index bd108ab..56560ec 100644 --- a/infiniband-diags/libibnetdisc/src/libibnetdisc.map +++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map @@ -2,6 +2,8 @@ IBNETDISC_1.0 { global: ibnd_debug; ibnd_show_progress; + ibnd_create_ctx; + ibnd_destroy_ctx; ibnd_discover_fabric; ibnd_destroy_fabric; ibnd_find_node_guid; diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c index cb5651e..b121bdd 100644 --- a/infiniband-diags/libibnetdisc/test/testleaks.c +++ b/infiniband-diags/libibnetdisc/test/testleaks.c @@ -87,6 +87,7 @@ int main(int argc, char **argv) int hops = 0; ib_portid_t port_id; int iters = -1; + ibnd_ctx_t *ctx = NULL; struct ibmad_port *ibmad_port; int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; @@ -156,11 +157,12 @@ int main(int argc, char **argv) mad_rpc_set_timeout(ibmad_port, timeout_ms); + ctx = ibnd_create_ctx(ibmad_port); while (iters == -1 || iters-- > 0) { if (from) { /* only scan part of the fabric */ str2drpath(&(port_id.drpath), from, 0, 0); - if ((fabric = ibnd_discover_fabric(ibmad_port, + if ((fabric = ibnd_discover_fabric(ctx, &port_id, hops)) == NULL) { fprintf(stderr, "discover failed\n"); @@ -170,7 +172,7 @@ int main(int argc, char **argv) guid = 0; } else { if ((fabric = - ibnd_discover_fabric(ibmad_port, NULL, + ibnd_discover_fabric(ctx, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; @@ -182,6 +184,7 @@ int main(int argc, char **argv) } close_port: + ibnd_destroy_ctx(ctx); mad_rpc_close_port(ibmad_port); exit(rc); } diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 29c4352..f14c6c3 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -274,6 +274,7 @@ int main(int argc, char **argv) int rc = 0; int resolved = -1; ibnd_fabric_t *fabric = NULL; + ibnd_ctx_t *ctx = NULL; struct ibmad_port *ibmad_port; ib_portid_t port_id = { 0 }; int mgmt_classes[3] = @@ -323,6 +324,8 @@ int main(int argc, char **argv) node_name_map = open_node_name_map(node_name_map_file); + ctx = ibnd_create_ctx(ibmad_port); + if (dr_path) { /* only scan part of the fabric */ if ((resolved = @@ -340,14 +343,12 @@ int main(int argc, char **argv) } if (resolved >= 0) - if ((fabric = ibnd_discover_fabric(ibmad_port, &port_id, - hops)) == NULL) - IBWARN - ("Single node discover failed; attempting full scan\n"); + if ((fabric = ibnd_discover_fabric(ctx, &port_id, + hops)) == NULL) + IBWARN("Single node discover failed; attempting full scan\n"); if (!fabric) - if ((fabric = - ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; goto close_port; @@ -381,6 +382,7 @@ int main(int argc, char **argv) ibnd_destroy_fabric(fabric); close_port: + ibnd_destroy_ctx(ctx); close_node_name_map(node_name_map); mad_rpc_close_port(ibmad_port); exit(rc); diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 2aa29c8..7811976 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -65,6 +65,7 @@ static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; static int report_max_hops = 0; +static int show_progress = 0; /** * Define our own conversion functions to maintain compatibility with the old @@ -616,7 +617,7 @@ static int process_opt(void *context, int ch, char *optarg) node_name_map_file = strdup(optarg); break; case 's': - ibnd_show_progress(1); + show_progress = 1; break; case 'l': list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; @@ -649,6 +650,7 @@ static int process_opt(void *context, int ch, char *optarg) int main(int argc, char **argv) { ibnd_fabric_t *fabric = NULL; + ibnd_ctx_t *ctx = NULL; struct ibmad_port *ibmad_port; int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; @@ -690,8 +692,14 @@ int main(int argc, char **argv) IBERROR("can't open file %s for writing", argv[0]); node_name_map = open_node_name_map(node_name_map_file); + ctx = ibnd_create_ctx(ibmad_port); - if ((fabric = ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) + if (!ctx) + IBERROR("failed to create libibnetdisc context\n"); + + ibnd_show_progress(ctx, show_progress); + + if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) IBERROR("discover failed\n"); if (ports_report) @@ -702,6 +710,7 @@ int main(int argc, char **argv) dump_topology(group, fabric); ibnd_destroy_fabric(fabric); + ibnd_destroy_ctx(ctx); close_node_name_map(node_name_map); mad_rpc_close_port(ibmad_port); exit(0); diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c index f73ca6f..0e4747c 100644 --- a/infiniband-diags/src/ibqueryerrors.c +++ b/infiniband-diags/src/ibqueryerrors.c @@ -388,6 +388,7 @@ int main(int argc, char **argv) ib_portid_t portid = { 0 }; int rc = 0; ibnd_fabric_t *fabric = NULL; + ibnd_ctx_t *ctx = NULL; int mgmt_classes[4] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS @@ -431,6 +432,8 @@ int main(int argc, char **argv) node_name_map = open_node_name_map(node_name_map_file); + ctx = ibnd_create_ctx(ibmad_port); + /* limit the scan the fabric around the target */ if (dr_path) { if ((resolved = @@ -448,14 +451,12 @@ int main(int argc, char **argv) } if (resolved >= 0) - if ((fabric = ibnd_discover_fabric(ibmad_port, &portid, - 0)) == NULL) - IBWARN - ("Single node discover failed; attempting full scan\n"); - - if (!fabric) /* do a full scan */ - if ((fabric = - ibnd_discover_fabric(ibmad_port, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ctx, &portid, + 0)) == NULL) + IBWARN("Single node discover failed; attempting full scan\n"); + + if (!fabric) /* do a full scan */ + if ((fabric = ibnd_discover_fabric(ctx, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; goto close_port; @@ -490,6 +491,7 @@ int main(int argc, char **argv) ibnd_destroy_fabric(fabric); close_port: + ibnd_destroy_ctx(ctx); mad_rpc_close_port(ibmad_port); close_node_name_map(node_name_map); exit(rc); -- 1.5.4.5 From weiny2 at llnl.gov Mon Aug 17 14:03:44 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 17 Aug 2009 14:03:44 -0700 Subject: [ofa-general] Re: [PATCH 5/5 v2] infiniband-diags/libibnetdisc: remove members of the fabric struct which are used in the scan only. In-Reply-To: <20090813204316.c6ce0de3.weiny2@llnl.gov> References: <20090813204316.c6ce0de3.weiny2@llnl.gov> Message-ID: <20090817140344.3ce003c4.weiny2@llnl.gov> From: Ira Weiny Date: Mon, 17 Aug 2009 13:16:51 -0700 Subject: [PATCH] infiniband-diags/libibnetdisc: remove members of the fabric struct which are used in the scan only. There is no need to have these be in the public interface. They can cause confusion on which variable present the information to the user. Adjusted to apply to v2 of "libibnetdisc: make all fields of ibnd_fabric_t public" Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 6 -- infiniband-diags/libibnetdisc/src/chassis.c | 52 +++++++------- infiniband-diags/libibnetdisc/src/chassis.h | 2 +- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 80 ++++++++++++-------- infiniband-diags/libibnetdisc/src/internal.h | 13 +++ 5 files changed, 88 insertions(+), 65 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index ce1c74f..51a35a3 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -127,7 +127,6 @@ typedef struct ibnd_chassis { } ibnd_chassis_t; #define HTSZ 137 -#define MAXHOPS 63 /** ========================================================================= * Fabric @@ -148,14 +147,9 @@ typedef struct ibnd_fabric { /* internal use only */ ibnd_node_t *nodestbl[HTSZ]; ibnd_port_t *portstbl[HTSZ]; - ibnd_node_t *nodesdist[MAXHOPS + 1]; - ibnd_chassis_t *first_chassis; - ibnd_chassis_t *current_chassis; - ibnd_chassis_t *last_chassis; ibnd_node_t *switches; ibnd_node_t *ch_adapters; ibnd_node_t *routers; - ib_portid_t selfportid; } ibnd_fabric_t; /** ========================================================================= diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 4886cfc..d11d7df 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -96,7 +96,7 @@ static ibnd_chassis_t *find_chassisnum(ibnd_fabric_t * fabric, { ibnd_chassis_t *current; - for (current = fabric->first_chassis; current; current = current->next) { + for (current = fabric->chassis; current; current = current->next) { if (current->chassisnum == chassisnum) return current; } @@ -207,14 +207,14 @@ static uint64_t get_chassisguid(ibnd_node_t * node) return sysimgguid; } -static ibnd_chassis_t *find_chassisguid(ibnd_fabric_t * fabric, +static ibnd_chassis_t *find_chassisguid(struct ibnd_chassis_ctx *ch_ctx, ibnd_node_t * node) { ibnd_chassis_t *current; uint64_t chguid; chguid = get_chassisguid(node); - for (current = fabric->first_chassis; current; current = current->next) { + for (current = ch_ctx->first_chassis; current; current = current->next) { if (current->chassisguid == chguid) return current; } @@ -782,19 +782,19 @@ static void voltaire_portmap(ibnd_port_t * port) port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; } -static int add_chassis(ibnd_fabric_t * fabric) +static int add_chassis(struct ibnd_chassis_ctx *ch_ctx) { - if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) { + if (!(ch_ctx->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) { IBND_ERROR("OOM: failed to allocate chassis object\n"); return (-1); } - if (fabric->first_chassis == NULL) { - fabric->first_chassis = fabric->current_chassis; - fabric->last_chassis = fabric->current_chassis; + if (ch_ctx->first_chassis == NULL) { + ch_ctx->first_chassis = ch_ctx->current_chassis; + ch_ctx->last_chassis = ch_ctx->current_chassis; } else { - fabric->last_chassis->next = fabric->current_chassis; - fabric->last_chassis = fabric->current_chassis; + ch_ctx->last_chassis->next = ch_ctx->current_chassis; + ch_ctx->last_chassis = ch_ctx->current_chassis; } return (0); } @@ -818,22 +818,22 @@ static void add_node_to_chassis(ibnd_chassis_t * chassis, ibnd_node_t * node) Returns: 0 on success, -1 on failure */ -int group_nodes(ibnd_fabric_t * fabric) +int group_nodes(struct ibnd_scan_ctx *scan_ctx, ibnd_fabric_t * fabric) { ibnd_node_t *node; int dist; int chassisnum = 0; ibnd_chassis_t *chassis; + struct ibnd_chassis_ctx ch_ctx; - fabric->first_chassis = NULL; - fabric->current_chassis = NULL; + memset(&ch_ctx, 0, sizeof ch_ctx); /* first pass on switches and build for every Voltaire node */ /* an appropriate chassis record (slotnum and position) */ /* according to internal connectivity */ /* not very efficient but clear code so... */ for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) if (fill_voltaire_chassis_record(node)) @@ -844,7 +844,7 @@ int group_nodes(ibnd_fabric_t * fabric) /* separate every Voltaire chassis from each other and build linked list of them */ /* algorithm: catch spine and find all surrounding nodes */ for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) continue; @@ -852,10 +852,10 @@ int group_nodes(ibnd_fabric_t * fabric) || (node->chassis && node->chassis->chassisnum) || !is_spine(node)) continue; - if (add_chassis(fabric)) + if (add_chassis(&ch_ctx)) return (-1); - fabric->current_chassis->chassisnum = ++chassisnum; - if (build_chassis(node, fabric->current_chassis)) + ch_ctx.current_chassis->chassisnum = ++chassisnum; + if (build_chassis(node, ch_ctx.current_chassis)) return (-1); } } @@ -863,25 +863,25 @@ int group_nodes(ibnd_fabric_t * fabric) /* now make pass on nodes for chassis which are not Voltaire */ /* grouped by common SystemImageGUID */ for (dist = 0; dist <= fabric->maxhops_discovered; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) continue; if (mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F)) { chassis = - find_chassisguid(fabric, + find_chassisguid(&ch_ctx, (ibnd_node_t *) node); if (chassis) chassis->nodecount++; else { /* Possible new chassis */ - if (add_chassis(fabric)) + if (add_chassis(&ch_ctx)) return (-1); - fabric->current_chassis->chassisguid = + ch_ctx.current_chassis->chassisguid = get_chassisguid((ibnd_node_t *) node); - fabric->current_chassis->nodecount = 1; + ch_ctx.current_chassis->nodecount = 1; } } } @@ -890,14 +890,14 @@ int group_nodes(ibnd_fabric_t * fabric) /* now, make another pass to see which nodes are part of chassis */ /* (defined as chassis->nodecount > 1) */ for (dist = 0; dist <= MAXHOPS;) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx->nodesdist[dist]; node; node = node->dnext) { if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) continue; if (mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F)) { chassis = - find_chassisguid(fabric, + find_chassisguid(&ch_ctx, (ibnd_node_t *) node); if (chassis && chassis->nodecount > 1) { if (!chassis->chassisnum) @@ -918,6 +918,6 @@ int group_nodes(ibnd_fabric_t * fabric) dist++; } - fabric->chassis = fabric->first_chassis; + fabric->chassis = ch_ctx.first_chassis; return (0); } diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h index 2191046..707140c 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.h +++ b/infiniband-diags/libibnetdisc/src/chassis.h @@ -82,6 +82,6 @@ enum ibnd_chassis_type { }; enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; -int group_nodes(struct ibnd_fabric *fabric); +int group_nodes(struct ibnd_scan_ctx *scan_ctx, struct ibnd_fabric *fabric); #endif /* _CHASSIS_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 7295189..84aac0a 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -189,21 +189,27 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport) return path->cnt; } -static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, +static int extend_dpath(struct ibnd_scan_ctx *scan_ctx, + struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, ib_portid_t * portid, int nextport) { int rc = 0; if (portid->lid) { + if (!scan_ctx) { + IBND_ERROR("Invalid internal scan state"); + return (-1); + } /* If we were LID routed we need to set up the drslid */ - if (!fabric->selfportid.lid) - if (ib_resolve_self_via(&fabric->selfportid, NULL, NULL, - ibmad_port) < 0) { + if (!scan_ctx->selfportid.lid) + if (ib_resolve_self_via + (&scan_ctx->selfportid, NULL, NULL, + ibmad_port) < 0) { IBND_ERROR("Failed to resolve self\n"); return -1; } - portid->drpath.drslid = (uint16_t) fabric->selfportid.lid; + portid->drpath.drslid = (uint16_t) scan_ctx->selfportid.lid; portid->drpath.drdlid = 0xFFFF; } @@ -408,19 +414,25 @@ static void add_to_type_list(ibnd_node_t * node, ibnd_fabric_t * fabric) } } -static void add_to_nodedist(ibnd_node_t * node, ibnd_fabric_t * fabric) +static void add_to_nodedist(ibnd_node_t * node, struct ibnd_scan_ctx *scan_ctx) { int dist = node->dist; + + if (!scan_ctx) { + IBND_ERROR("Invalid internal scan state"); + return; + } + if (node->type != IB_NODE_SWITCH) dist = MAXHOPS; /* special Ca list */ - node->dnext = fabric->nodesdist[dist]; - fabric->nodesdist[dist] = node; + node->dnext = scan_ctx->nodesdist[dist]; + scan_ctx->nodesdist[dist] = node; } -static ibnd_node_t *create_node(ibnd_fabric_t * fabric, - ibnd_node_t * temp, ib_portid_t * path, - int dist) +static ibnd_node_t *create_node(struct ibnd_scan_ctx *scan_ctx, + ibnd_fabric_t * fabric, ibnd_node_t * temp, + ib_portid_t * path, int dist) { ibnd_node_t *node; @@ -441,7 +453,7 @@ static ibnd_node_t *create_node(ibnd_fabric_t * fabric, fabric->nodes = (ibnd_node_t *) node; add_to_type_list(node, fabric); - add_to_nodedist(node, fabric); + add_to_nodedist(node, scan_ctx); return node; } @@ -500,9 +512,10 @@ static void link_ports(ibnd_node_t * node, ibnd_port_t * port, remoteport->remoteport = (ibnd_port_t *) port; } -static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, - ibnd_node_t * node, ibnd_port_t * port, - ib_portid_t * path, int portnum, int dist) +static int get_remote_node(ibnd_ctx_t * ctx, struct ibnd_scan_ctx *scan_ctx, + ibnd_fabric_t * fabric, ibnd_node_t * node, + ibnd_port_t * port, ib_portid_t * path, + int portnum, int dist) { int rc = 0; struct ibmad_port *ibmad_port = ctx->ibmad_port; @@ -521,7 +534,7 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, != IB_PORT_PHYS_STATE_LINKUP) return 1; /* positive == non-fatal error */ - if (extend_dpath(ibmad_port, fabric, path, portnum) < 0) + if (extend_dpath(scan_ctx, ibmad_port, fabric, path, portnum) < 0) return -1; if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) { @@ -534,7 +547,9 @@ static int get_remote_node(ibnd_ctx_t * ctx, ibnd_fabric_t * fabric, oldnode = find_existing_node(fabric, &node_buf); if (oldnode) remotenode = oldnode; - else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) { + else if (! + (remotenode = + create_node(scan_ctx, fabric, &node_buf, path, dist + 1))) { rc = -1; goto error; } @@ -574,10 +589,13 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, int dist = 0; ib_portid_t *path; int max_hops = MAXHOPS - 1; /* default find everything */ + struct ibnd_scan_ctx scan_ctx; if (check_ctx(ctx) < 0) return (NULL); + memset(&scan_ctx, 0, sizeof scan_ctx); + /* if not everything how much? */ if (hops >= 0) { max_hops = hops; @@ -606,7 +624,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, goto error; } - node = create_node(fabric, &node_buf, from, 0); + node = create_node(&scan_ctx, fabric, &node_buf, from, 0); if (!node) goto error; @@ -616,7 +634,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, if (!port) goto error; - rc = get_remote_node(ctx, fabric, node, port, from, + rc = get_remote_node(ctx, &scan_ctx, fabric, node, port, from, mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F), 0); if (rc < 0) @@ -626,7 +644,7 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, for (dist = 0; dist <= max_hops; dist++) { - for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + for (node = scan_ctx.nodesdist[dist]; node; node = node->dnext) { path = &node->path_portid; @@ -663,14 +681,15 @@ ibnd_fabric_t *ibnd_discover_fabric(ibnd_ctx_t * ctx, IB_NODE_PORT_GUID_F); } - if (get_remote_node(ctx, fabric, node, - port, path, i, dist) < 0) + if (get_remote_node + (ctx, &scan_ctx, fabric, node, port, path, + i, dist) < 0) goto error; } } } - if (group_nodes(fabric)) + if (group_nodes(&scan_ctx, fabric)) goto error; return ((ibnd_fabric_t *) fabric); @@ -692,7 +711,6 @@ static void destroy_node(ibnd_node_t * node) void ibnd_destroy_fabric(ibnd_fabric_t * fabric) { - int dist = 0; ibnd_node_t *node = NULL; ibnd_node_t *next = NULL; ibnd_chassis_t *ch, *ch_next; @@ -700,19 +718,17 @@ void ibnd_destroy_fabric(ibnd_fabric_t * fabric) if (!fabric) return; - ch = fabric->first_chassis; + ch = fabric->chassis; while (ch) { ch_next = ch->next; free(ch); ch = ch_next; } - for (dist = 0; dist <= MAXHOPS; dist++) { - node = fabric->nodesdist[dist]; - while (node) { - next = node->dnext; - destroy_node(node); - node = next; - } + node = fabric->nodes; + while (node) { + next = node->next; + destroy_node(node); + node = next; } free(fabric); } diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index b989b68..c866b12 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -50,6 +50,19 @@ /* HASH table defines */ #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) +#define MAXHOPS 63 + +struct ibnd_chassis_ctx { + ibnd_chassis_t *first_chassis; + ibnd_chassis_t *current_chassis; + ibnd_chassis_t *last_chassis; +}; + +struct ibnd_scan_ctx { + ibnd_node_t *nodesdist[MAXHOPS + 1]; + ib_portid_t selfportid; +}; + struct ibnd_ctx { struct ibmad_port *ibmad_port; int show_progress; -- 1.5.4.5 From nmehrotra at riorey.com Mon Aug 17 15:44:23 2009 From: nmehrotra at riorey.com (Nitin Mehrotra) Date: Mon, 17 Aug 2009 18:44:23 -0400 Subject: [ofa-general] What does IBV_WC_REM_OP_ERR after a verb send indicate? In-Reply-To: <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com> References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com> <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com> <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com> Message-ID: <4A89DD47.5030902@riorey.com> Folks, I am getting this error on a verb send operation and I can't figure out what could be the cause; I searched for all instances of this error in the IB code and while I found 4, none was illuminating. As a background, we are developing an IB application that uses RDMA for connection set up and the verbs interface for data transfer. We have tested the two ends as user space applications and they work - they can connect and exchange data. We are now converting the server end into a kernel module and this error is being encountered on the client when it posts a send to the RDMA connected QP. I have verified that the connection is setup and that recv WR with buffers are posted on the QP. Could it be a protection domain problem? Because we have multiple clients that connect to the one server we create the PD on the rdma_id that is used to connect to the server not the one that connection event gives us. Could that be the problem? We assume that the PD is tied to the IB device and there is only physical IB port in the system. If that is the problem, then why does this work in the userspace version and fail in the module version. Appreciate any pointers on this. Thanks, Nitin From vlad at lists.openfabrics.org Tue Aug 18 03:00:15 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 18 Aug 2009 03:00:15 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090818-0200 daily build status Message-ID: <20090818100015.79298E61B8A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090818-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From juliav at voltaire.com Tue Aug 18 03:28:17 2009 From: juliav at voltaire.com (Julia Volynsky) Date: Tue, 18 Aug 2009 13:28:17 +0300 Subject: [ofa-general] question about management/libibmad/include/infiniband/mad.h Message-ID: <39C75744D164D948A170E9792AF8E7CA027F628C@exil.voltaire.com> Hello, all. I have a question about management/libibmad/include/infiniband/mad.h file. Why variable ibdebug defined without extern in an h file? (MAD_EXPORT int ibdebug;) In doc/libibmad.txt ibdebug is mentioned as extern int ibdebug; Thank you. Julia. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon at opengridcomputing.com Tue Aug 18 12:12:03 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 18 Aug 2009 14:12:03 -0500 Subject: [ofa-general] [PATCH] krping: Add support for fast_reg_mr with dma_local_lkey Message-ID: <20090818191203.GB20947@opengridcomputing.com> For devices that do not support reg_phys_mr (like mlx4), an alternative is need to use krping over fast_reg_mr. In the reg_phys_mr place, use dma_local_lkey (previously called stag0). This patch renames the relevant pieces, adding support in the fastreg case for dma_local_lkey, and adds debug code in the completion queue for unexpected errors. Signed-Off-By: Jon Mason diff --git a/README b/README index cfdd771..b5f251f 100644 --- a/README +++ b/README @@ -1,5 +1,5 @@ Kernel Mode RDMA Ping Module - Steve Wise - 6/2008 + Steve Wise - 8/2009 ============ Introduction @@ -137,8 +137,8 @@ server_inv none Valid only in fastreg mode, this client's fastreg mr via SEND_WITH_INVALIDATE messages from the server. -stag0 none Use lkey 0 for source of writes and - sends, and in recvs +local_dma_lkey none Use the local dma lkey for the source + of writes and sends, and in recvs read_inv none Server will use READ_WITH_INV. Only valid in fastreg mem_mode. diff --git a/krping.c b/krping.c index 7f50cf5..5f6e893 100644 --- a/krping.c +++ b/krping.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Ammasso, Inc. All rights reserved. - * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2006-2009 Open Grid Computing, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -89,7 +89,7 @@ static const struct krping_option krping_opts[] = { {"duplex", OPT_NOPARAM, 'd'}, {"txdepth", OPT_INT, 'T'}, {"poll", OPT_NOPARAM, 'P'}, - {"stag0", OPT_NOPARAM, 'Z'}, + {"local_dma_lkey", OPT_NOPARAM, 'Z'}, {"read_inv", OPT_NOPARAM, 'R'}, {NULL, 0, 0} }; @@ -239,11 +239,11 @@ struct krping_cb { int duplex; /* run bw full duplex test */ int poll; /* poll or block for rlat test */ int txdepth; /* SQ depth */ - int stag0; /* use 0 for lkey */ + int local_dma_lkey; /* use 0 for lkey */ /* CM stuff */ struct rdma_cm_id *cm_id; /* connection on client side,*/ - /* listener on service side. */ + /* listener on server side. */ struct rdma_cm_id *child_cm_id; /* connection on server side */ struct list_head list; }; @@ -376,9 +376,9 @@ static void krping_cq_event_handler(struct ib_cq *cq, void *ctx) DEBUG_LOG("cq flushed\n"); continue; } else { - printk(KERN_ERR PFX - "cq completion failed status %d\n", - wc.status); + printk(KERN_ERR PFX "cq completion failed with " + "wr_id %x status %d opcode %d vender_err %x\n", + wc.wr_id, wc.status, wc.opcode, wc.vendor_err); goto error; } } @@ -429,8 +429,16 @@ static void krping_cq_event_handler(struct ib_cq *cq, void *ctx) wake_up_interruptible(&cb->sem); break; + case IB_WC_LOCAL_INV: + case IB_WC_FAST_REG_MR: + printk(KERN_ERR PFX + "Unexpected opcode %d, most likely unsignalled\n", + __func__, __LINE__, wc.opcode); + break; default: - DEBUG_LOG("unknown!!!!! completion\n"); + printk(KERN_ERR PFX + "Unexpected opcode %d, Shutting down\n", + __func__, __LINE__, wc.opcode); goto error; } } @@ -476,8 +484,8 @@ static void krping_setup_wr(struct krping_cb *cb) { cb->recv_sgl.addr = cb->recv_dma_addr; cb->recv_sgl.length = sizeof cb->recv_buf; - if (cb->stag0) - cb->recv_sgl.lkey = 0; + if (cb->local_dma_lkey) + cb->recv_sgl.lkey = cb->qp->device->local_dma_lkey; else if (cb->mem == DMA) cb->recv_sgl.lkey = cb->dma_mr->lkey; else @@ -487,8 +495,8 @@ static void krping_setup_wr(struct krping_cb *cb) cb->send_sgl.addr = cb->send_dma_addr; cb->send_sgl.length = sizeof cb->send_buf; - if (cb->stag0) - cb->send_sgl.lkey = 0; + if (cb->local_dma_lkey) + cb->send_sgl.lkey = cb->qp->device->local_dma_lkey; else if (cb->mem == DMA) cb->send_sgl.lkey = cb->dma_mr->lkey; else @@ -560,34 +568,35 @@ static int krping_setup_buffers(struct krping_cb *cb) goto bail; } } else { + if (!cb->local_dma_lkey) { + buf.addr = cb->recv_dma_addr; + buf.size = sizeof cb->recv_buf; + DEBUG_LOG(PFX "recv buf dma_addr %llx size %d\n", buf.addr, + (int)buf.size); + iovbase = cb->recv_dma_addr; + cb->recv_mr = ib_reg_phys_mr(cb->pd, &buf, 1, + IB_ACCESS_LOCAL_WRITE, + &iovbase); + + if (IS_ERR(cb->recv_mr)) { + DEBUG_LOG(PFX "recv_buf reg_mr failed\n"); + ret = PTR_ERR(cb->recv_mr); + goto bail; + } - buf.addr = cb->recv_dma_addr; - buf.size = sizeof cb->recv_buf; - DEBUG_LOG(PFX "recv buf dma_addr %llx size %d\n", buf.addr, - (int)buf.size); - iovbase = cb->recv_dma_addr; - cb->recv_mr = ib_reg_phys_mr(cb->pd, &buf, 1, - IB_ACCESS_LOCAL_WRITE, - &iovbase); - - if (IS_ERR(cb->recv_mr)) { - DEBUG_LOG(PFX "recv_buf reg_mr failed\n"); - ret = PTR_ERR(cb->recv_mr); - goto bail; - } - - buf.addr = cb->send_dma_addr; - buf.size = sizeof cb->send_buf; - DEBUG_LOG(PFX "send buf dma_addr %llx size %d\n", buf.addr, - (int)buf.size); - iovbase = cb->send_dma_addr; - cb->send_mr = ib_reg_phys_mr(cb->pd, &buf, 1, - 0, &iovbase); - - if (IS_ERR(cb->send_mr)) { - DEBUG_LOG(PFX "send_buf reg_mr failed\n"); - ret = PTR_ERR(cb->send_mr); - goto bail; + buf.addr = cb->send_dma_addr; + buf.size = sizeof cb->send_buf; + DEBUG_LOG(PFX "send buf dma_addr %llx size %d\n", buf.addr, + (int)buf.size); + iovbase = cb->send_dma_addr; + cb->send_mr = ib_reg_phys_mr(cb->pd, &buf, 1, + 0, &iovbase); + + if (IS_ERR(cb->send_mr)) { + DEBUG_LOG(PFX "send_buf reg_mr failed\n"); + ret = PTR_ERR(cb->send_mr); + goto bail; + } } } @@ -921,6 +930,7 @@ static u32 krping_rdma_rkey(struct krping_cb *cb, u64 buf, int post_inv) rkey = cb->dma_mr->rkey; break; default: + printk(KERN_ERR PFX "%s:%d case ERROR\n", __func__, __LINE__); cb->state = ERROR; break; } @@ -1040,8 +1050,8 @@ static void krping_test_server(struct krping_cb *cb) cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey; cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr; cb->rdma_sq_wr.sg_list->length = strlen(cb->rdma_buf) + 1; - if (cb->stag0) - cb->rdma_sgl.lkey = 0; + if (cb->local_dma_lkey) + cb->rdma_sgl.lkey = cb->qp->device->local_dma_lkey; else cb->rdma_sgl.lkey = krping_rdma_rkey(cb, cb->rdma_dma_addr, 0); @@ -2087,8 +2097,8 @@ int krping_doit(char *cmd) DEBUG_LOG("txdepth %d\n", (int) cb->txdepth); break; case 'Z': - cb->stag0 = 1; - DEBUG_LOG("using stag 0 for lkeys\n"); + cb->local_dma_lkey = 1; + DEBUG_LOG("using local dma lkey\n"); break; case 'R': cb->read_inv = 1; From arlin.r.davis at intel.com Tue Aug 18 12:33:59 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 18 Aug 2009 12:33:59 -0700 Subject: [ofa-general] [PATCH] uDAPL v2 ucm: add new provider using a DAPL based IB-UD cm mechanism for MPI implementations. Message-ID: New provider uses it's own CM protocol on top of IB-UD queue pairs. During device open, this provider creates a UD queue pair and returns local address information via dat_ia_query. This 24 byte opaque address must be exchange out-of-band before connecting to a server via dat_ep_connect. This provider is targeted for MPI implementations that already exchange address information during boot/init phase. dtest, dtestx, and dtestcm was modified to report the lid and qpn information on the server side so you can provide appropriate destination address information for the client test suite. Dapltest will not work with this provider. Signed-off-by: Arlin Davis --- Makefile.am | 148 +++- dapl/openib_cma/dapl_ib_util.h | 7 +- dapl/openib_common/dapl_ib_common.h | 136 ++- dapl/openib_common/qp.c | 336 ++++--- dapl/openib_scm/cm.c | 553 +++++------ dapl/openib_scm/dapl_ib_util.h | 17 +- dapl/openib_ucm/README | 40 + dapl/openib_ucm/SOURCES | 53 + dapl/openib_ucm/cm.c | 1837 ++++++++++++++++++++++++++++++++++ dapl/openib_ucm/dapl_ib_util.h | 119 +++ dapl/openib_ucm/device.c | 603 +++++++++++ dapl/openib_ucm/linux/openib_osd.h | 21 + dapl/openib_ucm/udapl.rc | 48 + dapl/openib_ucm/windows/openib_osd.h | 35 + test/dtest/dtest.c | 177 +++- test/dtest/dtestcm.c | 146 ++- test/dtest/dtestx.c | 88 ++- 17 files changed, 3782 insertions(+), 582 deletions(-) create mode 100644 dapl/openib_ucm/README create mode 100644 dapl/openib_ucm/SOURCES create mode 100644 dapl/openib_ucm/cm.c create mode 100644 dapl/openib_ucm/dapl_ib_util.h create mode 100644 dapl/openib_ucm/device.c create mode 100644 dapl/openib_ucm/linux/openib_osd.h create mode 100644 dapl/openib_ucm/udapl.rc create mode 100644 dapl/openib_ucm/windows/openib_osd.h diff --git a/Makefile.am b/Makefile.am index 6842c05..1fe71d9 100755 --- a/Makefile.am +++ b/Makefile.am @@ -17,12 +17,10 @@ endif if EXT_TYPE_IB XFLAGS = -DDAT_EXTENSIONS -XPROGRAMS_CMA = dapl/openib_common/ib_extensions.c -XPROGRAMS_SCM = dapl/openib_common/ib_extensions.c +XPROGRAMS = dapl/openib_common/ib_extensions.c else XFLAGS = -XPROGRAMS_CMA = -XPROGRAMS_SCM = +XPROGRAMS = endif if DEBUG @@ -34,10 +32,12 @@ endif datlibdir = $(libdir) dapllibofadir = $(libdir) daplliboscmdir = $(libdir) +daplliboucmdir = $(libdir) datlib_LTLIBRARIES = dat/udat/libdat2.la dapllibofa_LTLIBRARIES = dapl/udapl/libdaplofa.la daplliboscm_LTLIBRARIES = dapl/udapl/libdaploscm.la +daplliboucm_LTLIBRARIES = dapl/udapl/libdaploucm.la dat_udat_libdat2_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ -I$(srcdir)/dat/include/ -I$(srcdir)/dat/udat/ \ @@ -59,14 +59,24 @@ dapl_udapl_libdaploscm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAG -I$(srcdir)/dapl/openib_scm \ -I$(srcdir)/dapl/openib_scm/linux +dapl_udapl_libdaploucm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ + -DOPENIB -DCQ_WAIT_OBJECT \ + -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \ + -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \ + -I$(srcdir)/dapl/openib_common \ + -I$(srcdir)/dapl/openib_ucm \ + -I$(srcdir)/dapl/openib_ucm/linux + if HAVE_LD_VERSION_SCRIPT dat_version_script = -Wl,--version-script=$(srcdir)/dat/udat/libdat2.map daplofa_version_script = -Wl,--version-script=$(srcdir)/dapl/udapl/libdaplofa.map daploscm_version_script = -Wl,--version-script=$(srcdir)/dapl/udapl/libdaploscm.map + daploucm_version_script = -Wl,--version-script=$(srcdir)/dapl/udapl/libdaploucm.map else dat_version_script = daplofa_version_script = daploscm_version_script = + daploucm_version_script = endif # @@ -192,14 +202,14 @@ dapl_udapl_libdaplofa_la_SOURCES = dapl/udapl/dapl_init.c \ dapl/openib_common/qp.c \ dapl/openib_common/util.c \ dapl/openib_cma/cm.c \ - dapl/openib_cma/device.c $(XPROGRAMS_CMA) + dapl/openib_cma/device.c $(XPROGRAMS) dapl_udapl_libdaplofa_la_LDFLAGS = -version-info 2:0:0 $(daplofa_version_script) \ -Wl,-init,dapl_init -Wl,-fini,dapl_fini \ -lpthread -libverbs -lrdmacm # -# uDAPL OpenFabrics Socket CM version: libdaplscm.so +# uDAPL OpenFabrics Socket CM version for IB: libdaplscm.so # dapl_udapl_libdaploscm_la_SOURCES = dapl/udapl/dapl_init.c \ dapl/udapl/dapl_evd_create.c \ @@ -306,11 +316,125 @@ dapl_udapl_libdaploscm_la_SOURCES = dapl/udapl/dapl_init.c \ dapl/openib_common/qp.c \ dapl/openib_common/util.c \ dapl/openib_scm/cm.c \ - dapl/openib_scm/device.c $(XPROGRAMS_SCM) + dapl/openib_scm/device.c $(XPROGRAMS) dapl_udapl_libdaploscm_la_LDFLAGS = -version-info 2:0:0 $(daploscm_version_script) \ -Wl,-init,dapl_init -Wl,-fini,dapl_fini \ -lpthread -libverbs + +# +# uDAPL OpenFabrics UD CM version for IB: libdaplucm.so +# +dapl_udapl_libdaploucm_la_SOURCES = dapl/udapl/dapl_init.c \ + dapl/udapl/dapl_evd_create.c \ + dapl/udapl/dapl_evd_query.c \ + dapl/udapl/dapl_cno_create.c \ + dapl/udapl/dapl_cno_modify_agent.c \ + dapl/udapl/dapl_cno_free.c \ + dapl/udapl/dapl_cno_wait.c \ + dapl/udapl/dapl_cno_query.c \ + dapl/udapl/dapl_lmr_create.c \ + dapl/udapl/dapl_evd_wait.c \ + dapl/udapl/dapl_evd_disable.c \ + dapl/udapl/dapl_evd_enable.c \ + dapl/udapl/dapl_evd_modify_cno.c \ + dapl/udapl/dapl_evd_set_unwaitable.c \ + dapl/udapl/dapl_evd_clear_unwaitable.c \ + dapl/udapl/linux/dapl_osd.c \ + dapl/common/dapl_cookie.c \ + dapl/common/dapl_cr_accept.c \ + dapl/common/dapl_cr_query.c \ + dapl/common/dapl_cr_reject.c \ + dapl/common/dapl_cr_util.c \ + dapl/common/dapl_cr_callback.c \ + dapl/common/dapl_cr_handoff.c \ + dapl/common/dapl_ep_connect.c \ + dapl/common/dapl_ep_create.c \ + dapl/common/dapl_ep_disconnect.c \ + dapl/common/dapl_ep_dup_connect.c \ + dapl/common/dapl_ep_free.c \ + dapl/common/dapl_ep_reset.c \ + dapl/common/dapl_ep_get_status.c \ + dapl/common/dapl_ep_modify.c \ + dapl/common/dapl_ep_post_rdma_read.c \ + dapl/common/dapl_ep_post_rdma_write.c \ + dapl/common/dapl_ep_post_recv.c \ + dapl/common/dapl_ep_post_send.c \ + dapl/common/dapl_ep_query.c \ + dapl/common/dapl_ep_util.c \ + dapl/common/dapl_evd_dequeue.c \ + dapl/common/dapl_evd_free.c \ + dapl/common/dapl_evd_post_se.c \ + dapl/common/dapl_evd_resize.c \ + dapl/common/dapl_evd_util.c \ + dapl/common/dapl_evd_cq_async_error_callb.c \ + dapl/common/dapl_evd_qp_async_error_callb.c \ + dapl/common/dapl_evd_un_async_error_callb.c \ + dapl/common/dapl_evd_connection_callb.c \ + dapl/common/dapl_evd_dto_callb.c \ + dapl/common/dapl_get_consumer_context.c \ + dapl/common/dapl_get_handle_type.c \ + dapl/common/dapl_hash.c \ + dapl/common/dapl_hca_util.c \ + dapl/common/dapl_ia_close.c \ + dapl/common/dapl_ia_open.c \ + dapl/common/dapl_ia_query.c \ + dapl/common/dapl_ia_util.c \ + dapl/common/dapl_llist.c \ + dapl/common/dapl_lmr_free.c \ + dapl/common/dapl_lmr_query.c \ + dapl/common/dapl_lmr_util.c \ + dapl/common/dapl_lmr_sync_rdma_read.c \ + dapl/common/dapl_lmr_sync_rdma_write.c \ + dapl/common/dapl_mr_util.c \ + dapl/common/dapl_provider.c \ + dapl/common/dapl_sp_util.c \ + dapl/common/dapl_psp_create.c \ + dapl/common/dapl_psp_create_any.c \ + dapl/common/dapl_psp_free.c \ + dapl/common/dapl_psp_query.c \ + dapl/common/dapl_pz_create.c \ + dapl/common/dapl_pz_free.c \ + dapl/common/dapl_pz_query.c \ + dapl/common/dapl_pz_util.c \ + dapl/common/dapl_rmr_create.c \ + dapl/common/dapl_rmr_free.c \ + dapl/common/dapl_rmr_bind.c \ + dapl/common/dapl_rmr_query.c \ + dapl/common/dapl_rmr_util.c \ + dapl/common/dapl_rsp_create.c \ + dapl/common/dapl_rsp_free.c \ + dapl/common/dapl_rsp_query.c \ + dapl/common/dapl_cno_util.c \ + dapl/common/dapl_set_consumer_context.c \ + dapl/common/dapl_ring_buffer_util.c \ + dapl/common/dapl_name_service.c \ + dapl/common/dapl_timer_util.c \ + dapl/common/dapl_ep_create_with_srq.c \ + dapl/common/dapl_ep_recv_query.c \ + dapl/common/dapl_ep_set_watermark.c \ + dapl/common/dapl_srq_create.c \ + dapl/common/dapl_srq_free.c \ + dapl/common/dapl_srq_query.c \ + dapl/common/dapl_srq_resize.c \ + dapl/common/dapl_srq_post_recv.c \ + dapl/common/dapl_srq_set_lw.c \ + dapl/common/dapl_srq_util.c \ + dapl/common/dapl_debug.c \ + dapl/common/dapl_ia_ha.c \ + dapl/common/dapl_csp.c \ + dapl/common/dapl_ep_post_send_invalidate.c \ + dapl/common/dapl_ep_post_rdma_read_to_rmr.c \ + dapl/openib_common/mem.c \ + dapl/openib_common/cq.c \ + dapl/openib_common/qp.c \ + dapl/openib_common/util.c \ + dapl/openib_ucm/cm.c \ + dapl/openib_ucm/device.c $(XPROGRAMS) + +dapl_udapl_libdaploucm_la_LDFLAGS = -version-info 2:0:0 $(daploscm_version_script) \ + -Wl,-init,dapl_init -Wl,-fini,dapl_fini \ + -lpthread -libverbs libdatincludedir = $(includedir)/dat2 @@ -375,9 +499,12 @@ EXTRA_DIST = dat/common/dat_dictionary.h \ dapl/openib_cma/linux/openib_osd.h \ dapl/openib_scm/dapl_ib_util.h \ dapl/openib_scm/linux/openib_osd.h \ + dapl/openib_ucm/dapl_ib_util.h \ + dapl/openib_ucm/linux/openib_osd.h \ dat/udat/libdat2.map \ dapl/udapl/libdaplofa.map \ dapl/udapl/libdaploscm.map \ + dapl/udapl/libdaploucm.map \ dapl.spec.in \ $(man_MANS) \ test/dapltest/include/dapl_bpool.h \ @@ -419,12 +546,14 @@ install-exec-hook: sed -e '/ofa-v2-.* u2/d' < $(DESTDIR)$(sysconfdir)/dat.conf > /tmp/$$$$ofadapl; \ cp /tmp/$$$$ofadapl $(DESTDIR)$(sysconfdir)/dat.conf; \ fi; \ - echo ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ - echo ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 '"ib0 0" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 '"ib1 0" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mthca0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mthca0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ + echo ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ + echo ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ + echo ucm-mlx4-1 u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 '"mlx4_0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ + echo ucm-mlx4-2 u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ehca0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ @@ -433,6 +562,7 @@ install-exec-hook: uninstall-hook: if test -e $(DESTDIR)$(sysconfdir)/dat.conf; then \ sed -e '/ofa-v2-.* u2/d' < $(DESTDIR)$(sysconfdir)/dat.conf > /tmp/$$$$ofadapl; \ + sed -e '/ucm-.* u2/d' < $(DESTDIR)$(sysconfdir)/dat.conf > /tmp/$$$$ofadapl; \ cp /tmp/$$$$ofadapl $(DESTDIR)$(sysconfdir)/dat.conf; \ fi; diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h index c9ab4d6..35900e7 100755 --- a/dapl/openib_cma/dapl_ib_util.h +++ b/dapl/openib_cma/dapl_ib_util.h @@ -72,8 +72,8 @@ struct dapl_cm_id { DAT_SOCK_ADDR6 r_addr; int p_len; unsigned char p_data[256]; /* dapl max private data size */ - ib_qp_cm_t dst; /* dapls_modify_qp_state */ - struct ibv_ah *ah; /* dapls_modify_qp_state */ + ib_cm_msg_t dst; + struct ibv_ah *ah; }; typedef struct dapl_cm_id *dp_ib_cm_handle_t; @@ -123,9 +123,6 @@ void dapli_async_event_cb(struct _ib_hca_transport *tp); void dapli_cq_event_cb(struct _ib_hca_transport *tp); dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep); void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep); -DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, - IN ib_qp_state_t qp_state, - IN dp_ib_cm_handle_t cm); STATIC _INLINE_ void dapls_print_cm_list(IN DAPL_IA * ia_ptr) { diff --git a/dapl/openib_common/dapl_ib_common.h b/dapl/openib_common/dapl_ib_common.h index 2195767..3cd8885 100644 --- a/dapl/openib_common/dapl_ib_common.h +++ b/dapl/openib_common/dapl_ib_common.h @@ -50,25 +50,56 @@ typedef struct ibv_pd *ib_pd_handle_t; typedef struct ibv_mr *ib_mr_handle_t; typedef struct ibv_mw *ib_mw_handle_t; typedef struct ibv_wc ib_work_completion_t; +typedef struct ibv_ah *ib_ah_handle_t; +typedef union ibv_gid *ib_gid_handle_t; /* HCA context type maps to IB verbs */ typedef struct ibv_context *ib_hca_handle_t; typedef ib_hca_handle_t dapl_ibal_ca_t; /* QP info to exchange, wire protocol version for these CM's */ -#define DCM_VER 4 -typedef struct _ib_qp_cm -{ +#define DCM_VER 5 + +/* CM private data areas, same for all operations */ +#define DCM_MAX_PDATA_SIZE 128 + +/* + * DAPL IB/QP address (type, port, lid, qp_num, gid) mapping to + * DAT_IA_ADDRESS_PTR, DAT_SOCK_ADDR2 (24 bytes) + * For applications, like MPI, that exchange IA_ADDRESS + * across the fabric before connecting, it eliminates the + * overhead of name and address resolution to the destination's + * CM services. UCM provider uses this for DAT_IA_ADDRESS. + */ +union dcm_addr { + DAT_SOCK_ADDR6 so; + struct { + uint8_t qp_type; + uint8_t port_num; + uint16_t lid; + uint32_t qpn; + union ibv_gid gid; + } ib; +}; + +/* 256 bytes total; default max_inline_send, min IB MTU size */ +typedef struct _ib_cm_msg +{ uint16_t ver; - uint16_t rej; - uint16_t lid; - uint16_t port; - uint32_t qpn; - uint32_t p_size; - union ibv_gid gid; - DAT_SOCK_ADDR6 ia_address; - uint16_t qp_type; -} ib_qp_cm_t; + uint16_t op; + uint16_t sport; /* src cm port */ + uint16_t dport; /* dst cm port */ + uint32_t sqpn; /* src cm qpn */ + uint32_t dqpn; /* dst cm qpn */ + uint16_t p_size; + uint8_t resv[14]; + union dcm_addr saddr; + union dcm_addr daddr; + union dcm_addr saddr_alt; + union dcm_addr daddr_alt; + uint8_t p_data[DCM_MAX_PDATA_SIZE]; + +} ib_cm_msg_t; /* CM events */ typedef enum { @@ -113,11 +144,27 @@ typedef uint16_t ib_hca_port_t; /* inline send rdma threshold */ #define INLINE_SEND_IWARP_DEFAULT 64 -#define INLINE_SEND_IB_DEFAULT 200 +#define INLINE_SEND_IB_DEFAULT 256 /* qkey for UD QP's */ #define DAT_UD_QKEY 0x78654321 +/* RC timer - retry count defaults */ +#define DCM_ACK_TIMER 16 /* 5 bits, 4.096us*2^ack_timer. 16== 268ms */ +#define DCM_ACK_RETRY 7 /* 3 bits, 7 * 268ms = 1.8 seconds */ +#define DCM_RNR_TIMER 12 /* 5 bits, 12 =.64ms, 28 =163ms, 31 =491ms */ +#define DCM_RNR_RETRY 7 /* 3 bits, 7 == infinite */ +#define DCM_IB_MTU 2048 + +/* Global routing defaults */ +#define DCM_GLOBAL 0 /* global routing is disabled */ +#define DCM_HOP_LIMIT 0xff +#define DCM_TCLASS 0 + +/* DAPL uCM timers */ +#define DCM_RETRY_CNT 7 +#define DCM_RETRY_TIME_MS 1000 + /* DTO OPs, ordered for DAPL ENUM definitions */ #define OP_RDMA_WRITE IBV_WR_RDMA_WRITE #define OP_RDMA_WRITE_IMM IBV_WR_RDMA_WRITE_WITH_IMM @@ -201,6 +248,36 @@ typedef enum } ib_thread_state_t; +typedef enum dapl_cm_op +{ + DCM_REQ, + DCM_REP, + DCM_REJ_USER, /* user reject */ + DCM_REJ_CM, /* cm reject, no SID */ + DCM_RTU, + DCM_DREQ, + DCM_DREP + +} DAPL_CM_OP; + +typedef enum dapl_cm_state +{ + DCM_INIT, + DCM_LISTEN, + DCM_CONN_PENDING, + DCM_RTU_PENDING, + DCM_ACCEPTING, + DCM_ACCEPTING_DATA, + DCM_ACCEPTED, + DCM_REJECTING, + DCM_REJECTED, + DCM_CONNECTED, + DCM_RELEASED, + DCM_DISC_PENDING, + DCM_DISCONNECTED, + DCM_DESTROY + +} DAPL_CM_STATE; /* provider specfic fields for shared memory support */ typedef uint32_t ib_shm_transport_t; @@ -214,6 +291,19 @@ enum ibv_mtu dapl_ib_mtu(int mtu); char *dapl_ib_mtu_str(enum ibv_mtu mtu); DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR *addr, int addr_len); +/* qp.c */ +DAT_RETURN dapls_modify_qp_ud(IN DAPL_HCA *hca, IN ib_qp_handle_t qp); +DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, + IN ib_qp_state_t qp_state, + IN uint32_t qpn, + IN uint16_t lid, + IN ib_gid_handle_t gid); +ib_ah_handle_t dapls_create_ah( IN DAPL_HCA *hca, + IN ib_pd_handle_t pd, + IN ib_qp_handle_t qp, + IN uint16_t lid, + IN ib_gid_handle_t gid); + /* inline functions */ STATIC _INLINE_ IB_HCA_NAME dapl_ib_convert_name (IN char *name) { @@ -260,22 +350,6 @@ dapl_convert_errno( IN int err, IN const char *str ) } } -typedef enum dapl_cm_state -{ - DCM_INIT, - DCM_LISTEN, - DCM_CONN_PENDING, - DCM_RTU_PENDING, - DCM_ACCEPTING, - DCM_ACCEPTING_DATA, - DCM_ACCEPTED, - DCM_REJECTED, - DCM_CONNECTED, - DCM_RELEASED, - DCM_DISCONNECTED, - DCM_DESTROY -} DAPL_CM_STATE; - STATIC _INLINE_ char * dapl_cm_state_str(IN int st) { static char *state[] = { @@ -286,13 +360,15 @@ STATIC _INLINE_ char * dapl_cm_state_str(IN int st) "CM_ACCEPTING", "CM_ACCEPTING_DATA", "CM_ACCEPTED", + "CM_REJECTING", "CM_REJECTED", "CM_CONNECTED", "CM_RELEASED", + "CM_DISC_PENDING", "CM_DISCONNECTED", "CM_DESTROY" }; - return ((st < 0 || st > 11) ? "Invalid CM state?" : state[st]); + return ((st < 0 || st > 13) ? "Invalid CM state?" : state[st]); } #endif /* _DAPL_IB_COMMON_H_ */ diff --git a/dapl/openib_common/qp.c b/dapl/openib_common/qp.c index 9aa0594..73d2c3f 100644 --- a/dapl/openib_common/qp.c +++ b/dapl/openib_common/qp.c @@ -176,7 +176,7 @@ dapls_ib_qp_alloc(IN DAPL_IA * ia_ptr, /* Setup QP attributes for INIT state on the way out */ if (dapls_modify_qp_state(ep_ptr->qp_handle, - IBV_QPS_INIT, NULL) != DAT_SUCCESS) { + IBV_QPS_INIT, 0, 0, 0) != DAT_SUCCESS) { ibv_destroy_qp(ep_ptr->qp_handle); ep_ptr->qp_handle = IB_INVALID_HANDLE; return DAT_INTERNAL_ERROR; @@ -219,7 +219,7 @@ DAT_RETURN dapls_ib_qp_free(IN DAPL_IA * ia_ptr, IN DAPL_EP * ep_ptr) if (ep_ptr->qp_handle != NULL) { /* force error state to flush queue, then destroy */ - dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, NULL); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0,0,0); if (ibv_destroy_qp(ep_ptr->qp_handle)) return (dapl_convert_errno(errno, "destroy_qp")); @@ -280,8 +280,8 @@ dapls_ib_qp_modify(IN DAPL_IA * ia_ptr, /* move to error state if necessary */ if ((ep_ptr->qp_state == IB_QP_STATE_ERROR) && (ep_ptr->qp_handle->state != IBV_QPS_ERR)) { - return (dapls_modify_qp_state(ep_ptr->qp_handle, - IBV_QPS_ERR, NULL)); + return (dapls_modify_qp_state(ep_ptr->qp_handle, + IBV_QPS_ERR, 0, 0, 0)); } /* @@ -345,8 +345,8 @@ void dapls_ib_reinit_ep(IN DAPL_EP * ep_ptr) if (ep_ptr->qp_handle != IB_INVALID_HANDLE && ep_ptr->qp_handle->qp_type != IBV_QPT_UD) { /* move to RESET state and then to INIT */ - dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_RESET, 0); - dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_INIT, 0); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_RESET,0,0,0); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_INIT,0,0,0); } } #endif // _WIN32 || _WIN64 @@ -354,152 +354,137 @@ void dapls_ib_reinit_ep(IN DAPL_EP * ep_ptr) /* * Generic QP modify for init, reset, error, RTS, RTR * For UD, create_ah on RTR, qkey on INIT + * CM msg provides QP attributes, info in network order */ DAT_RETURN -dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, - IN ib_qp_state_t qp_state, - IN dp_ib_cm_handle_t cm_ptr) +dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, + IN ib_qp_state_t qp_state, + IN uint32_t qpn, + IN uint16_t lid, + IN ib_gid_handle_t gid) { struct ibv_qp_attr qp_attr; enum ibv_qp_attr_mask mask = IBV_QP_STATE; DAPL_EP *ep_ptr = (DAPL_EP *) qp_handle->qp_context; DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; - ib_qp_cm_t *qp_cm = &cm_ptr->dst; int ret; dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr)); qp_attr.qp_state = qp_state; + switch (qp_state) { - /* additional attributes with RTR and RTS */ case IBV_QPS_RTR: - { - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " QPS_RTR: type %d state %d qpn %x lid %x" - " port %x ep %p qp_state %d\n", - qp_handle->qp_type, qp_handle->qp_type, - qp_cm->qpn, qp_cm->lid, qp_cm->port, - ep_ptr, ep_ptr->qp_state); - - mask |= IBV_QP_AV | - IBV_QP_PATH_MTU | - IBV_QP_DEST_QPN | - IBV_QP_RQ_PSN | - IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER; - - qp_attr.dest_qp_num = qp_cm->qpn; - qp_attr.rq_psn = 1; - qp_attr.path_mtu = ia_ptr->hca_ptr->ib_trans.mtu; - qp_attr.max_dest_rd_atomic = - ep_ptr->param.ep_attr.max_rdma_read_out; - qp_attr.min_rnr_timer = - ia_ptr->hca_ptr->ib_trans.rnr_timer; - - /* address handle. RC and UD */ - qp_attr.ah_attr.dlid = qp_cm->lid; - if (ia_ptr->hca_ptr->ib_trans.global) { - qp_attr.ah_attr.is_global = 1; - qp_attr.ah_attr.grh.dgid = qp_cm->gid; - qp_attr.ah_attr.grh.hop_limit = - ia_ptr->hca_ptr->ib_trans.hop_limit; - qp_attr.ah_attr.grh.traffic_class = - ia_ptr->hca_ptr->ib_trans.tclass; - } - qp_attr.ah_attr.sl = 0; - qp_attr.ah_attr.src_path_bits = 0; - qp_attr.ah_attr.port_num = ia_ptr->hca_ptr->port_num; -#ifdef DAT_EXTENSIONS - /* UD: create AH for remote side */ - if (qp_handle->qp_type == IBV_QPT_UD) { - ib_pd_handle_t pz; - pz = ((DAPL_PZ *) - ep_ptr->param.pz_handle)->pd_handle; - mask = IBV_QP_STATE; - cm_ptr->ah = ibv_create_ah(pz, - &qp_attr.ah_attr); - if (!cm_ptr->ah) - return (dapl_convert_errno(errno, - "ibv_ah")); - - /* already RTR, multi remote AH's on QP */ - if (ep_ptr->qp_state == IBV_QPS_RTR || - ep_ptr->qp_state == IBV_QPS_RTS) - return DAT_SUCCESS; - } -#endif - break; + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " QPS_RTR: type %d qpn 0x%x lid 0x%x" + " port %d ep %p qp_state %d \n", + qp_handle->qp_type, + ntohl(qpn), ntohs(lid), + ia_ptr->hca_ptr->port_num, + ep_ptr, ep_ptr->qp_state); + + mask |= IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER; + + qp_attr.dest_qp_num = ntohl(qpn); + qp_attr.rq_psn = 1; + qp_attr.path_mtu = ia_ptr->hca_ptr->ib_trans.mtu; + qp_attr.max_dest_rd_atomic = + ep_ptr->param.ep_attr.max_rdma_read_out; + qp_attr.min_rnr_timer = + ia_ptr->hca_ptr->ib_trans.rnr_timer; + + /* address handle. RC and UD */ + qp_attr.ah_attr.dlid = ntohs(lid); + if (ia_ptr->hca_ptr->ib_trans.global) { + qp_attr.ah_attr.is_global = 1; + qp_attr.ah_attr.grh.dgid.global.subnet_prefix = + ntohll(gid->global.subnet_prefix); + qp_attr.ah_attr.grh.dgid.global.interface_id = + ntohll(gid->global.interface_id); + qp_attr.ah_attr.grh.hop_limit = + ia_ptr->hca_ptr->ib_trans.hop_limit; + qp_attr.ah_attr.grh.traffic_class = + ia_ptr->hca_ptr->ib_trans.tclass; } + qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.src_path_bits = 0; + qp_attr.ah_attr.port_num = ia_ptr->hca_ptr->port_num; + + /* UD: already in RTR, RTS state */ + if (qp_handle->qp_type == IBV_QPT_UD) { + if (ep_ptr->qp_state == IBV_QPS_RTR || + ep_ptr->qp_state == IBV_QPS_RTS) + return DAT_SUCCESS; + } + break; case IBV_QPS_RTS: - { - /* RC only */ - if (qp_handle->qp_type == IBV_QPT_RC) { - mask |= IBV_QP_SQ_PSN | - IBV_QP_TIMEOUT | - IBV_QP_RETRY_CNT | - IBV_QP_RNR_RETRY | IBV_QP_MAX_QP_RD_ATOMIC; - qp_attr.timeout = - ia_ptr->hca_ptr->ib_trans.ack_timer; - qp_attr.retry_cnt = - ia_ptr->hca_ptr->ib_trans.ack_retry; - qp_attr.rnr_retry = - ia_ptr->hca_ptr->ib_trans.rnr_retry; - qp_attr.max_rd_atomic = - ep_ptr->param.ep_attr.max_rdma_read_out; - } - /* RC and UD */ - qp_attr.qp_state = IBV_QPS_RTS; - qp_attr.sq_psn = 1; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " QPS_RTS: psn %x rd_atomic %d ack %d " - " retry %d rnr_retry %d ep %p qp_state %d\n", - qp_attr.sq_psn, qp_attr.max_rd_atomic, - qp_attr.timeout, qp_attr.retry_cnt, - qp_attr.rnr_retry, ep_ptr, - ep_ptr->qp_state); -#ifdef DAT_EXTENSIONS - if (qp_handle->qp_type == IBV_QPT_UD) { - /* already RTS, multi remote AH's on QP */ - if (ep_ptr->qp_state == IBV_QPS_RTS) - return DAT_SUCCESS; - else - mask = IBV_QP_STATE | IBV_QP_SQ_PSN; - } -#endif - break; + if (qp_handle->qp_type == IBV_QPT_RC) { + mask |= IBV_QP_SQ_PSN | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | IBV_QP_MAX_QP_RD_ATOMIC; + qp_attr.timeout = + ia_ptr->hca_ptr->ib_trans.ack_timer; + qp_attr.retry_cnt = + ia_ptr->hca_ptr->ib_trans.ack_retry; + qp_attr.rnr_retry = + ia_ptr->hca_ptr->ib_trans.rnr_retry; + qp_attr.max_rd_atomic = + ep_ptr->param.ep_attr.max_rdma_read_out; + } + /* RC and UD */ + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 1; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " QPS_RTS: psn %x rd_atomic %d ack %d " + " retry %d rnr_retry %d ep %p qp_state %d\n", + qp_attr.sq_psn, qp_attr.max_rd_atomic, + qp_attr.timeout, qp_attr.retry_cnt, + qp_attr.rnr_retry, ep_ptr, + ep_ptr->qp_state); + + if (qp_handle->qp_type == IBV_QPT_UD) { + /* already RTS, multi remote AH's on QP */ + if (ep_ptr->qp_state == IBV_QPS_RTS) + return DAT_SUCCESS; + else + mask = IBV_QP_STATE | IBV_QP_SQ_PSN; } + break; case IBV_QPS_INIT: - { - mask |= IBV_QP_PKEY_INDEX | IBV_QP_PORT; - if (qp_handle->qp_type == IBV_QPT_RC) { - mask |= IBV_QP_ACCESS_FLAGS; - qp_attr.qp_access_flags = - IBV_ACCESS_LOCAL_WRITE | - IBV_ACCESS_REMOTE_WRITE | - IBV_ACCESS_REMOTE_READ | - IBV_ACCESS_REMOTE_ATOMIC | - IBV_ACCESS_MW_BIND; - } -#ifdef DAT_EXTENSIONS - if (qp_handle->qp_type == IBV_QPT_UD) { - /* already INIT, multi remote AH's on QP */ - if (ep_ptr->qp_state == IBV_QPS_INIT) - return DAT_SUCCESS; - mask |= IBV_QP_QKEY; - qp_attr.qkey = DAT_UD_QKEY; - } -#endif - qp_attr.pkey_index = 0; - qp_attr.port_num = ia_ptr->hca_ptr->port_num; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " QPS_INIT: pi %x port %x acc %x qkey 0x%x\n", - qp_attr.pkey_index, qp_attr.port_num, - qp_attr.qp_access_flags, qp_attr.qkey); - break; + mask |= IBV_QP_PKEY_INDEX | IBV_QP_PORT; + if (qp_handle->qp_type == IBV_QPT_RC) { + mask |= IBV_QP_ACCESS_FLAGS; + qp_attr.qp_access_flags = + IBV_ACCESS_LOCAL_WRITE | + IBV_ACCESS_REMOTE_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_REMOTE_ATOMIC | + IBV_ACCESS_MW_BIND; + } + + if (qp_handle->qp_type == IBV_QPT_UD) { + /* already INIT, multi remote AH's on QP */ + if (ep_ptr->qp_state == IBV_QPS_INIT) + return DAT_SUCCESS; + mask |= IBV_QP_QKEY; + qp_attr.qkey = DAT_UD_QKEY; } + + qp_attr.pkey_index = 0; + qp_attr.port_num = ia_ptr->hca_ptr->port_num; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " QPS_INIT: pi %x port %x acc %x qkey 0x%x\n", + qp_attr.pkey_index, qp_attr.port_num, + qp_attr.qp_access_flags, qp_attr.qkey); + break; default: break; - } ret = ibv_modify_qp(qp_handle, &qp_attr, mask); @@ -511,6 +496,93 @@ dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, } } +/* Modify UD type QP from init, rtr, rts, info network order */ +DAT_RETURN +dapls_modify_qp_ud(IN DAPL_HCA *hca, IN ib_qp_handle_t qp) +{ + struct ibv_qp_attr qp_attr; + + /* modify QP, setup and prepost buffers */ + dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr)); + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.pkey_index = 0; + qp_attr.port_num = hca->port_num; + qp_attr.qkey = DAT_UD_QKEY; + if (ibv_modify_qp(qp, &qp_attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_QKEY)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " modify_ud_qp INIT: ERR %s\n", strerror(errno)); + return (dapl_convert_errno(errno, "modify_qp")); + } + dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr)); + qp_attr.qp_state = IBV_QPS_RTR; + if (ibv_modify_qp(qp, &qp_attr,IBV_QP_STATE)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " modify_ud_qp RTR: ERR %s\n", strerror(errno)); + return (dapl_convert_errno(errno, "modify_qp")); + } + dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr)); + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 1; + if (ibv_modify_qp(qp, &qp_attr, + IBV_QP_STATE | IBV_QP_SQ_PSN)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " modify_ud_qp RTS: ERR %s\n", strerror(errno)); + return (dapl_convert_errno(errno, "modify_qp")); + } + return DAT_SUCCESS; +} + +/* Create address handle for remote QP, info in network order */ +ib_ah_handle_t +dapls_create_ah(IN DAPL_HCA *hca, + IN ib_pd_handle_t pd, + IN ib_qp_handle_t qp, + IN uint16_t lid, + IN ib_gid_handle_t gid) +{ + struct ibv_qp_attr qp_attr; + ib_ah_handle_t ah; + + if (qp->qp_type != IBV_QPT_UD) + return NULL; + + dapl_os_memzero((void *)&qp_attr, sizeof(qp_attr)); + qp_attr.qp_state = IBV_QP_STATE; + + /* address handle. RC and UD */ + qp_attr.ah_attr.dlid = ntohs(lid); + if (gid != NULL) { + qp_attr.ah_attr.is_global = 1; + qp_attr.ah_attr.grh.dgid.global.subnet_prefix = + ntohll(gid->global.subnet_prefix); + qp_attr.ah_attr.grh.dgid.global.interface_id = + ntohll(gid->global.interface_id); + qp_attr.ah_attr.grh.hop_limit = hca->ib_trans.hop_limit; + qp_attr.ah_attr.grh.traffic_class = hca->ib_trans.tclass; + } + qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.src_path_bits = 0; + qp_attr.ah_attr.port_num = hca->port_num; + + /* UD: create AH for remote side */ + ah = ibv_create_ah(pd, &qp_attr.ah_attr); + if (!ah) { + dapl_log(DAPL_DBG_TYPE_ERR, + " create_ah: ERR %s\n", strerror(errno)); + return NULL; + } + + dapl_log(DAPL_DBG_TYPE_CM, + " dapls_create_ah: AH %p for lid %x\n", + ah, qp_attr.ah_attr.dlid); + + return ah; +} + /* * Local variables: * c-indent-level: 4 diff --git a/dapl/openib_scm/cm.c b/dapl/openib_scm/cm.c index 416ee71..e779d41 100644 --- a/dapl/openib_scm/cm.c +++ b/dapl/openib_scm/cm.c @@ -46,11 +46,6 @@ * **************************************************************************/ -#if defined(_WIN32) -#define FD_SETSIZE 1024 -#define DAPL_FD_SETSIZE FD_SETSIZE -#endif - #include "dapl.h" #include "dapl_adapter_util.h" #include "dapl_evd_util.h" @@ -252,7 +247,7 @@ dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep) if (dapl_os_lock_init(&cm_ptr->lock)) goto bail; - cm_ptr->dst.ver = htons(DCM_VER); + cm_ptr->msg.ver = htons(DCM_VER); cm_ptr->socket = DAPL_INVALID_SOCKET; cm_ptr->ep = ep; return cm_ptr; @@ -437,7 +432,7 @@ DAT_RETURN dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr) */ static void dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err) { - int ret, len, opt = 1; + int ret, len, exp, opt = 1; struct iovec iov[2]; struct dapl_ep *ep_ptr = cm_ptr->ep; @@ -450,56 +445,60 @@ static void dapli_socket_connected(dp_ib_cm_handle_t cm_ptr, int err) ep_ptr->param. remote_ia_address_ptr)->sin_addr), ntohs(((struct sockaddr_in *) - &cm_ptr->dst.ia_address)->sin_port)); + &cm_ptr->msg.daddr.so)->sin_port)); goto bail; } - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " socket connected, write QP and private data\n"); /* no delay for small packets */ ret = setsockopt(cm_ptr->socket, IPPROTO_TCP, TCP_NODELAY, (char *)&opt, sizeof(opt)); if (ret) dapl_log(DAPL_DBG_TYPE_WARN, - " connected: NODELAY setsockopt: %s\n", + " CONN_PENDING: NODELAY setsockopt: %s\n", strerror(errno)); /* send qp info and pdata to remote peer */ - iov[0].iov_base = (void *)&cm_ptr->dst; - iov[0].iov_len = sizeof(ib_qp_cm_t); - if (cm_ptr->dst.p_size) { - iov[1].iov_base = cm_ptr->p_data; - iov[1].iov_len = ntohl(cm_ptr->dst.p_size); + exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE; + iov[0].iov_base = (void *)&cm_ptr->msg; + iov[0].iov_len = exp; + if (cm_ptr->msg.p_size) { + iov[1].iov_base = cm_ptr->msg.p_data; + iov[1].iov_len = ntohs(cm_ptr->msg.p_size); len = writev(cm_ptr->socket, iov, 2); } else { len = writev(cm_ptr->socket, iov, 1); } - if (len != (ntohl(cm_ptr->dst.p_size) + sizeof(ib_qp_cm_t))) { + if (len != (exp + ntohs(cm_ptr->msg.p_size))) { dapl_log(DAPL_DBG_TYPE_ERR, - " CONN_PENDING write: ERR %s, wcnt=%d -> %s\n", - strerror(errno), len, inet_ntoa(((struct sockaddr_in *) - ep_ptr->param. - remote_ia_address_ptr)-> - sin_addr)); + " CONN_PENDING len ERR %s, wcnt=%d(%d) -> %s\n", + strerror(errno), len, + exp + ntohs(cm_ptr->msg.p_size), + inet_ntoa(((struct sockaddr_in *) + ep_ptr->param. + remote_ia_address_ptr)->sin_addr)); goto bail; } - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " connected: sending SRC port=0x%x lid=0x%x," + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " CONN_PENDING: sending SRC port=%d lid=0x%x," " qpn=0x%x, psize=%d\n", - ntohs(cm_ptr->dst.port), ntohs(cm_ptr->dst.lid), - ntohl(cm_ptr->dst.qpn), ntohl(cm_ptr->dst.p_size)); + cm_ptr->msg.saddr.ib.port_num, + ntohs(cm_ptr->msg.saddr.ib.lid), + ntohl(cm_ptr->msg.saddr.ib.qpn), + ntohs(cm_ptr->msg.p_size)); dapl_dbg_log(DAPL_DBG_TYPE_CM, - " connected: sending SRC GID subnet %016llx id %016llx\n", + " CONN_PENDING: SRC GID subnet %016llx id %016llx\n", (unsigned long long) - htonll(cm_ptr->dst.gid.global.subnet_prefix), + htonll(cm_ptr->msg.saddr.ib.gid.global.subnet_prefix), (unsigned long long) - htonll(cm_ptr->dst.gid.global.interface_id)); + htonll(cm_ptr->msg.saddr.ib.gid.global.interface_id)); /* queue up to work thread to avoid blocking consumer */ cm_ptr->state = DCM_RTU_PENDING; return; - bail: + +bail: /* close socket, free cm structure and post error event */ dapls_ib_cm_free(cm_ptr, cm_ptr->ep); dapl_evd_connection_callback(NULL, IB_CME_LOCAL_FAILURE, NULL, ep_ptr); @@ -554,25 +553,24 @@ dapli_socket_connect(DAPL_EP * ep_ptr, return DAT_INVALID_ADDRESS; } - /* Send QP info, IA address, and private data */ - cm_ptr->dst.qpn = htonl(ep_ptr->qp_handle->qp_num); -#ifdef DAT_EXTENSIONS - cm_ptr->dst.qp_type = htons(ep_ptr->qp_handle->qp_type); -#endif - cm_ptr->dst.port = htons(ia_ptr->hca_ptr->port_num); - cm_ptr->dst.lid = ia_ptr->hca_ptr->ib_trans.lid; - cm_ptr->dst.gid = ia_ptr->hca_ptr->ib_trans.gid; + /* REQ: QP info in msg.saddr, IA address in msg.daddr, and pdata */ + cm_ptr->msg.op = ntohs(DCM_REQ); + cm_ptr->msg.saddr.ib.qpn = htonl(ep_ptr->qp_handle->qp_num); + cm_ptr->msg.saddr.ib.qp_type = ep_ptr->qp_handle->qp_type; + cm_ptr->msg.saddr.ib.port_num = ia_ptr->hca_ptr->port_num; + cm_ptr->msg.saddr.ib.lid = ia_ptr->hca_ptr->ib_trans.lid; + cm_ptr->msg.saddr.ib.gid = ia_ptr->hca_ptr->ib_trans.gid; /* save references */ cm_ptr->hca = ia_ptr->hca_ptr; cm_ptr->ep = ep_ptr; - cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address; + cm_ptr->msg.daddr.so = ia_ptr->hca_ptr->hca_address; ((struct sockaddr_in *) - &cm_ptr->dst.ia_address)->sin_port = ntohs(r_qual); + &cm_ptr->msg.daddr.so)->sin_port = ntohs((uint16_t)r_qual); if (p_size) { - cm_ptr->dst.p_size = htonl(p_size); - dapl_os_memcpy(cm_ptr->p_data, p_data, p_size); + cm_ptr->msg.p_size = htons(p_size); + dapl_os_memcpy(cm_ptr->msg.p_data, p_data, p_size); } /* connected or pending, either way results via async event */ @@ -581,18 +579,22 @@ dapli_socket_connect(DAPL_EP * ep_ptr, else cm_ptr->state = DCM_CONN_PENDING; + dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: p_data=%p %p\n", + cm_ptr->msg.p_data, cm_ptr->msg.p_data); + dapl_dbg_log(DAPL_DBG_TYPE_EP, - " connect: socket %d to %s r_qual %d pending\n", - cm_ptr->socket, - inet_ntoa(addr.sin_addr), (unsigned int)r_qual); + " connect: %s r_qual %d pending, p_sz=%d, %d %d ...\n", + inet_ntoa(addr.sin_addr), (unsigned int)r_qual, + ntohs(cm_ptr->msg.p_size), + cm_ptr->msg.p_data[0], cm_ptr->msg.p_data[1]); dapli_cm_queue(cm_ptr); return DAT_SUCCESS; - bail: + +bail: dapl_log(DAPL_DBG_TYPE_ERR, - " socket connect ERROR: %s query lid(0x%x)/gid" - " -> %s r_qual %d\n", - strerror(errno), ntohs(cm_ptr->dst.lid), + " connect ERROR: %s -> %s r_qual %d\n", + strerror(errno), inet_ntoa(((struct sockaddr_in *)r_addr)->sin_addr), (unsigned int)r_qual); @@ -607,64 +609,60 @@ dapli_socket_connect(DAPL_EP * ep_ptr, static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) { DAPL_EP *ep_ptr = cm_ptr->ep; - int len; - short rtu_data = htons(0x0E0F); - ib_cm_events_t event = IB_CME_DESTINATION_REJECT; + int len, exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE; + ib_cm_events_t event = IB_CME_LOCAL_FAILURE; /* read DST information into cm_ptr, overwrite SRC info */ dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect_rtu: recv peer QP data\n"); - len = recv(cm_ptr->socket, (char *)&cm_ptr->dst, sizeof(ib_qp_cm_t), 0); - if (len != sizeof(ib_qp_cm_t) || ntohs(cm_ptr->dst.ver) != DCM_VER) { + len = recv(cm_ptr->socket, (char *)&cm_ptr->msg, exp, 0); + if (len != exp || ntohs(cm_ptr->msg.ver) != DCM_VER) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU read: ERR %s, rcnt=%d, ver=%d -> %s\n", - strerror(errno), len, cm_ptr->dst.ver, + strerror(errno), len, cm_ptr->msg.ver, inet_ntoa(((struct sockaddr_in *) ep_ptr->param.remote_ia_address_ptr)-> sin_addr)); goto bail; } - /* convert peer response values to host order */ - cm_ptr->dst.port = ntohs(cm_ptr->dst.port); - cm_ptr->dst.lid = ntohs(cm_ptr->dst.lid); - cm_ptr->dst.qpn = ntohl(cm_ptr->dst.qpn); -#ifdef DAT_EXTENSIONS - cm_ptr->dst.qp_type = ntohs(cm_ptr->dst.qp_type); -#endif - cm_ptr->dst.p_size = ntohl(cm_ptr->dst.p_size); - - /* save remote address information */ + /* keep the QP, address info in network order */ + + /* save remote address information, in msg.daddr */ dapl_os_memcpy(&ep_ptr->remote_ia_address, - &cm_ptr->dst.ia_address, - sizeof(ep_ptr->remote_ia_address)); + &cm_ptr->msg.daddr.so, + sizeof(union dcm_addr)); dapl_dbg_log(DAPL_DBG_TYPE_EP, - " CONN_RTU: DST %s port=0x%x lid=0x%x," + " CONN_RTU: DST %s %d port=0x%x lid=0x%x," " qpn=0x%x, qp_type=%d, psize=%d\n", inet_ntoa(((struct sockaddr_in *) - &cm_ptr->dst.ia_address)->sin_addr), - cm_ptr->dst.port, cm_ptr->dst.lid, - cm_ptr->dst.qpn, cm_ptr->dst.qp_type, cm_ptr->dst.p_size); + &cm_ptr->msg.daddr.so)->sin_addr), + ntohs(((struct sockaddr_in *) + &cm_ptr->msg.daddr.so)->sin_port), + cm_ptr->msg.saddr.ib.port_num, + ntohs(cm_ptr->msg.saddr.ib.lid), + ntohl(cm_ptr->msg.saddr.ib.qpn), + cm_ptr->msg.saddr.ib.qp_type, + ntohs(cm_ptr->msg.p_size)); /* validate private data size before reading */ - if (cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE) { + if (ntohs(cm_ptr->msg.p_size) > DCM_MAX_PDATA_SIZE) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU read: psize (%d) wrong -> %s\n", - cm_ptr->dst.p_size, inet_ntoa(((struct sockaddr_in *) - ep_ptr->param. - remote_ia_address_ptr)-> - sin_addr)); + ntohs(cm_ptr->msg.p_size), + inet_ntoa(((struct sockaddr_in *) + ep_ptr->param. + remote_ia_address_ptr)->sin_addr)); goto bail; } /* read private data into cm_handle if any present */ - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " socket connected, read private data\n"); - if (cm_ptr->dst.p_size) { - len = - recv(cm_ptr->socket, cm_ptr->p_data, cm_ptr->dst.p_size, 0); - if (len != cm_ptr->dst.p_size) { + dapl_dbg_log(DAPL_DBG_TYPE_EP," CONN_RTU: read private data\n"); + exp = ntohs(cm_ptr->msg.p_size); + if (exp) { + len = recv(cm_ptr->socket, cm_ptr->msg.p_data, exp, 0); + if (len != exp) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU read pdata: ERR %s, rcnt=%d -> %s\n", strerror(errno), len, @@ -675,17 +673,22 @@ static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) } } - /* check for consumer reject */ - if (cm_ptr->dst.rej) { + /* check for consumer or protocol stack reject */ + if (ntohs(cm_ptr->msg.op) == DCM_REP) + event = IB_CME_CONNECTED; + else if (ntohs(cm_ptr->msg.op) == DCM_REJ_USER) + event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA; + else + event = IB_CME_DESTINATION_REJECT; + + if (event != IB_CME_CONNECTED) { dapl_log(DAPL_DBG_TYPE_CM, - " CONN_RTU read: PEER REJ reason=0x%x -> %s\n", - ntohs(cm_ptr->dst.rej), + " CONN_RTU: reject from %s\n", inet_ntoa(((struct sockaddr_in *) - ep_ptr->param.remote_ia_address_ptr)-> - sin_addr)); - event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA; + ep_ptr->param. + remote_ia_address_ptr)->sin_addr)); #ifdef DAT_EXTENSIONS - if (cm_ptr->dst.qp_type == IBV_QPT_UD) + if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) goto ud_bail; else #endif @@ -695,32 +698,39 @@ static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) /* modify QP to RTR and then to RTS with remote info */ dapl_os_lock(&ep_ptr->header.lock); if (dapls_modify_qp_state(ep_ptr->qp_handle, - IBV_QPS_RTR, cm_ptr) != DAT_SUCCESS) { + IBV_QPS_RTR, + cm_ptr->msg.saddr.ib.qpn, + cm_ptr->msg.saddr.ib.lid, + NULL) != DAT_SUCCESS) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU: QPS_RTR ERR %s -> %s\n", - strerror(errno), inet_ntoa(((struct sockaddr_in *) - ep_ptr->param. - remote_ia_address_ptr)-> - sin_addr)); + strerror(errno), + inet_ntoa(((struct sockaddr_in *) + ep_ptr->param. + remote_ia_address_ptr)->sin_addr)); dapl_os_unlock(&ep_ptr->header.lock); goto bail; } if (dapls_modify_qp_state(ep_ptr->qp_handle, - IBV_QPS_RTS, cm_ptr) != DAT_SUCCESS) { + IBV_QPS_RTS, + cm_ptr->msg.saddr.ib.qpn, + cm_ptr->msg.saddr.ib.lid, + NULL) != DAT_SUCCESS) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU: QPS_RTS ERR %s -> %s\n", - strerror(errno), inet_ntoa(((struct sockaddr_in *) - ep_ptr->param. - remote_ia_address_ptr)-> - sin_addr)); + strerror(errno), + inet_ntoa(((struct sockaddr_in *) + ep_ptr->param. + remote_ia_address_ptr)->sin_addr)); dapl_os_unlock(&ep_ptr->header.lock); goto bail; } dapl_os_unlock(&ep_ptr->header.lock); dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect_rtu: send RTU\n"); - /* complete handshake after final QP state change */ - if (send(cm_ptr->socket, (char *)&rtu_data, sizeof(rtu_data), 0) == -1) { + /* complete handshake after final QP state change, Just ver+op */ + cm_ptr->msg.op = ntohs(DCM_RTU); + if (send(cm_ptr->socket, (char *)&cm_ptr->msg, 4, 0) == -1) { dapl_log(DAPL_DBG_TYPE_ERR, " CONN_RTU: write error = %s\n", strerror(errno)); goto bail; @@ -732,30 +742,41 @@ static void dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) #ifdef DAT_EXTENSIONS ud_bail: - if (cm_ptr->dst.qp_type == IBV_QPT_UD) { + if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) { DAT_IB_EXTENSION_EVENT_DATA xevent; + ib_pd_handle_t pd_handle = + ((DAPL_PZ *)ep_ptr->param.pz_handle)->pd_handle; + + cm_ptr->ah = dapls_create_ah(cm_ptr->hca, pd_handle, + ep_ptr->qp_handle, + cm_ptr->msg.saddr.ib.lid, + NULL); + if (!cm_ptr->ah) { + event = IB_CME_LOCAL_FAILURE; + goto bail; + } /* post EVENT, modify_qp created ah */ xevent.status = 0; xevent.type = DAT_IB_UD_REMOTE_AH; xevent.remote_ah.ah = cm_ptr->ah; - xevent.remote_ah.qpn = cm_ptr->dst.qpn; + xevent.remote_ah.qpn = cm_ptr->msg.saddr.ib.qpn; dapl_os_memcpy(&xevent.remote_ah.ia_addr, - &cm_ptr->dst.ia_address, - sizeof(cm_ptr->dst.ia_address)); + &ep_ptr->remote_ia_address, + sizeof(union dcm_addr)); if (event == IB_CME_CONNECTED) event = DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED; else event = DAT_IB_UD_CONNECTION_REJECT_EVENT; - dapls_evd_post_connection_event_ext((DAPL_EVD *) ep_ptr->param. - connect_evd_handle, - event, - (DAT_EP_HANDLE) ep_ptr, - (DAT_COUNT) cm_ptr->dst.p_size, - (DAT_PVOID *) cm_ptr->p_data, - (DAT_PVOID *) &xevent); + dapls_evd_post_connection_event_ext( + (DAPL_EVD *) ep_ptr->param.connect_evd_handle, + event, + (DAT_EP_HANDLE) ep_ptr, + (DAT_COUNT) cm_ptr->msg.p_size, + (DAT_PVOID *) cm_ptr->msg.p_data, + (DAT_PVOID *) &xevent); /* done with socket, don't destroy cm_ptr, need pdata */ closesocket(cm_ptr->socket); @@ -766,17 +787,17 @@ ud_bail: { ep_ptr->cm_handle = cm_ptr; /* only RC, multi CR's on UD */ dapl_evd_connection_callback(cm_ptr, - IB_CME_CONNECTED, - cm_ptr->p_data, ep_ptr); + event, + cm_ptr->msg.p_data, ep_ptr); } return; bail: /* close socket, and post error event */ - dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0, 0, 0); closesocket(cm_ptr->socket); cm_ptr->socket = DAPL_INVALID_SOCKET; - dapl_evd_connection_callback(NULL, event, cm_ptr->p_data, ep_ptr); + dapl_evd_connection_callback(NULL, event, cm_ptr->msg.p_data, ep_ptr); } /* @@ -856,8 +877,6 @@ static void dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) dp_ib_cm_handle_t acm_ptr; int ret, len, opt = 1; - dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket_accept\n"); - /* * Accept all CR's on this port to avoid half-connection (SYN_RCV) * stalls with many to one connection storms @@ -870,25 +889,28 @@ static void dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) acm_ptr->sp = cm_ptr->sp; acm_ptr->hca = cm_ptr->hca; - len = sizeof(acm_ptr->dst.ia_address); + len = sizeof(union dcm_addr); acm_ptr->socket = accept(cm_ptr->socket, (struct sockaddr *) - &acm_ptr->dst.ia_address, - (socklen_t *) & len); + &acm_ptr->msg.daddr.so, + (socklen_t *) &len); if (acm_ptr->socket == DAPL_INVALID_SOCKET) { dapl_log(DAPL_DBG_TYPE_ERR, - " accept: ERR %s on FD %d l_cr %p\n", + " ACCEPT: ERR %s on FD %d l_cr %p\n", strerror(errno), cm_ptr->socket, cm_ptr); dapls_ib_cm_free(acm_ptr, acm_ptr->ep); return; } + dapl_dbg_log(DAPL_DBG_TYPE_CM, " accepting from %s\n", + inet_ntoa(((struct sockaddr_in *) + &acm_ptr->msg.daddr.so)->sin_addr)); /* no delay for small packets */ ret = setsockopt(acm_ptr->socket, IPPROTO_TCP, TCP_NODELAY, (char *)&opt, sizeof(opt)); if (ret) dapl_log(DAPL_DBG_TYPE_WARN, - " accept: NODELAY setsockopt: %s\n", + " ACCEPT: NODELAY setsockopt: %s\n", strerror(errno)); acm_ptr->state = DCM_ACCEPTING; @@ -902,65 +924,57 @@ static void dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) */ static void dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr) { - int len; + int len, exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE; void *p_data = NULL; dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket accepted, read QP data\n"); /* read in DST QP info, IA address. check for private data */ - len = - recv(acm_ptr->socket, (char *)&acm_ptr->dst, sizeof(ib_qp_cm_t), 0); - if (len != sizeof(ib_qp_cm_t) || ntohs(acm_ptr->dst.ver) != DCM_VER) { + len = recv(acm_ptr->socket, (char *)&acm_ptr->msg, exp, 0); + if (len != exp || ntohs(acm_ptr->msg.ver) != DCM_VER) { dapl_log(DAPL_DBG_TYPE_ERR, - " accept read: ERR %s, rcnt=%d, ver=%d\n", - strerror(errno), len, ntohs(acm_ptr->dst.ver)); + " ACCEPT read: ERR %s, rcnt=%d, ver=%d\n", + strerror(errno), len, ntohs(acm_ptr->msg.ver)); goto bail; } - /* convert accepted values to host order */ - acm_ptr->dst.port = ntohs(acm_ptr->dst.port); - acm_ptr->dst.lid = ntohs(acm_ptr->dst.lid); - acm_ptr->dst.qpn = ntohl(acm_ptr->dst.qpn); -#ifdef DAT_EXTENSIONS - acm_ptr->dst.qp_type = ntohs(acm_ptr->dst.qp_type); -#endif - acm_ptr->dst.p_size = ntohl(acm_ptr->dst.p_size); - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " accept: DST %s port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", - inet_ntoa(((struct sockaddr_in *)&acm_ptr->dst. - ia_address)->sin_addr), acm_ptr->dst.port, - acm_ptr->dst.lid, acm_ptr->dst.qpn, acm_ptr->dst.p_size); + /* keep the QP, address info in network order */ /* validate private data size before reading */ - if (acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE) { + exp = ntohs(acm_ptr->msg.p_size); + if (exp > DCM_MAX_PDATA_SIZE) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " accept read: psize (%d) wrong\n", - acm_ptr->dst.p_size); + acm_ptr->msg.p_size); goto bail; } - dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket accepted, read private data\n"); - /* read private data into cm_handle if any present */ - if (acm_ptr->dst.p_size) { - len = - recv(acm_ptr->socket, acm_ptr->p_data, acm_ptr->dst.p_size, - 0); - if (len != acm_ptr->dst.p_size) { + if (exp) { + len = recv(acm_ptr->socket, acm_ptr->msg.p_data, exp, 0); + if (len != exp) { dapl_log(DAPL_DBG_TYPE_ERR, " accept read pdata: ERR %s, rcnt=%d\n", strerror(errno), len); goto bail; } - dapl_dbg_log(DAPL_DBG_TYPE_EP, " accept: psize=%d read\n", len); - p_data = acm_ptr->p_data; + p_data = acm_ptr->msg.p_data; } acm_ptr->state = DCM_ACCEPTING_DATA; + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " ACCEPT: DST %s %d port=%d lid=0x%x, qpn=0x%x, psz=%d\n", + inet_ntoa(((struct sockaddr_in *) + &acm_ptr->msg.daddr.so)->sin_addr), + ntohs(((struct sockaddr_in *) + &acm_ptr->msg.daddr.so)->sin_port), + acm_ptr->msg.saddr.ib.port_num, + ntohs(acm_ptr->msg.saddr.ib.lid), + ntohl(acm_ptr->msg.saddr.ib.qpn), exp); + #ifdef DAT_EXTENSIONS - if (acm_ptr->dst.qp_type == IBV_QPT_UD) { + if (acm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) { DAT_IB_EXTENSION_EVENT_DATA xevent; /* post EVENT, modify_qp created ah */ @@ -970,9 +984,9 @@ static void dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr) dapls_evd_post_cr_event_ext(acm_ptr->sp, DAT_IB_UD_CONNECTION_REQUEST_EVENT, acm_ptr, - (DAT_COUNT) acm_ptr->dst.p_size, - (DAT_PVOID *) acm_ptr->p_data, - (DAT_PVOID *) & xevent); + (DAT_COUNT) exp, + (DAT_PVOID *) acm_ptr->msg.p_data, + (DAT_PVOID *) &xevent); } else #endif /* trigger CR event and return SUCCESS */ @@ -980,8 +994,8 @@ static void dapli_socket_accept_data(ib_cm_srvc_handle_t acm_ptr) IB_CME_CONNECTION_REQUEST_PENDING, p_data, acm_ptr->sp); return; - bail: - /* close socket, free cm structure, active will see socket close as reject */ +bail: + /* close socket, free cm structure, active will see close as rej */ dapls_ib_cm_free(acm_ptr, acm_ptr->ep); return; } @@ -997,11 +1011,11 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr, { DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; - ib_qp_cm_t local; + ib_cm_msg_t local; struct iovec iov[2]; - int len; + int len, exp = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE; - if (p_size > IB_MAX_REP_PDATA_SIZE) + if (p_size > DCM_MAX_PDATA_SIZE) return DAT_LENGTH_ERROR; /* must have a accepted socket */ @@ -1009,13 +1023,16 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr, return DAT_INTERNAL_ERROR; dapl_dbg_log(DAPL_DBG_TYPE_EP, - " ACCEPT_USR: remote port=0x%x lid=0x%x" + " ACCEPT_USR: remote port=%d lid=0x%x" " qpn=0x%x qp_type %d, psize=%d\n", - cm_ptr->dst.port, cm_ptr->dst.lid, - cm_ptr->dst.qpn, cm_ptr->dst.qp_type, cm_ptr->dst.p_size); + cm_ptr->msg.saddr.ib.port_num, + ntohs(cm_ptr->msg.saddr.ib.lid), + ntohl(cm_ptr->msg.saddr.ib.qpn), + cm_ptr->msg.saddr.ib.qp_type, + ntohs(cm_ptr->msg.p_size)); #ifdef DAT_EXTENSIONS - if (cm_ptr->dst.qp_type == IBV_QPT_UD && + if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD && ep_ptr->qp_handle->qp_type != IBV_QPT_UD) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " ACCEPT_USR: ERR remote QP is UD," @@ -1027,22 +1044,28 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr, /* modify QP to RTR and then to RTS with remote info already read */ dapl_os_lock(&ep_ptr->header.lock); if (dapls_modify_qp_state(ep_ptr->qp_handle, - IBV_QPS_RTR, cm_ptr) != DAT_SUCCESS) { + IBV_QPS_RTR, + cm_ptr->msg.saddr.ib.qpn, + cm_ptr->msg.saddr.ib.lid, + NULL) != DAT_SUCCESS) { dapl_log(DAPL_DBG_TYPE_ERR, " ACCEPT_USR: QPS_RTR ERR %s -> %s\n", - strerror(errno), inet_ntoa(((struct sockaddr_in *) - &cm_ptr->dst.ia_address)-> - sin_addr)); + strerror(errno), + inet_ntoa(((struct sockaddr_in *) + &cm_ptr->msg.daddr.so)->sin_addr)); dapl_os_unlock(&ep_ptr->header.lock); goto bail; } if (dapls_modify_qp_state(ep_ptr->qp_handle, - IBV_QPS_RTS, cm_ptr) != DAT_SUCCESS) { + IBV_QPS_RTS, + cm_ptr->msg.saddr.ib.qpn, + cm_ptr->msg.saddr.ib.lid, + NULL) != DAT_SUCCESS) { dapl_log(DAPL_DBG_TYPE_ERR, " ACCEPT_USR: QPS_RTS ERR %s -> %s\n", - strerror(errno), inet_ntoa(((struct sockaddr_in *) - &cm_ptr->dst.ia_address)-> - sin_addr)); + strerror(errno), + inet_ntoa(((struct sockaddr_in *) + &cm_ptr->msg.daddr.so)->sin_addr)); dapl_os_unlock(&ep_ptr->header.lock); goto bail; } @@ -1050,53 +1073,50 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr, /* save remote address information */ dapl_os_memcpy(&ep_ptr->remote_ia_address, - &cm_ptr->dst.ia_address, - sizeof(ep_ptr->remote_ia_address)); + &cm_ptr->msg.daddr.so, + sizeof(union dcm_addr)); /* send our QP info, IA address, pdata. Don't overwrite dst data */ local.ver = htons(DCM_VER); - local.rej = 0; - local.qpn = htonl(ep_ptr->qp_handle->qp_num); - local.qp_type = htons(ep_ptr->qp_handle->qp_type); - local.port = htons(ia_ptr->hca_ptr->port_num); - local.lid = ia_ptr->hca_ptr->ib_trans.lid; - local.gid = ia_ptr->hca_ptr->ib_trans.gid; - local.ia_address = ia_ptr->hca_ptr->hca_address; - ((struct sockaddr_in *)&local.ia_address)->sin_port = - ntohs(cm_ptr->sp->conn_qual); - - local.p_size = htonl(p_size); + local.op = htons(DCM_REP); + local.saddr.ib.qpn = htonl(ep_ptr->qp_handle->qp_num); + local.saddr.ib.qp_type = ep_ptr->qp_handle->qp_type; + local.saddr.ib.port_num = ia_ptr->hca_ptr->port_num; + local.saddr.ib.lid = ia_ptr->hca_ptr->ib_trans.lid; + local.saddr.ib.gid = ia_ptr->hca_ptr->ib_trans.gid; + local.daddr.so = ia_ptr->hca_ptr->hca_address; + ((struct sockaddr_in *)&local.daddr.so)->sin_port = + htons((uint16_t)cm_ptr->sp->conn_qual); + + local.p_size = htons(p_size); iov[0].iov_base = (void *)&local; - iov[0].iov_len = sizeof(ib_qp_cm_t); + iov[0].iov_len = exp; if (p_size) { iov[1].iov_base = p_data; iov[1].iov_len = p_size; len = writev(cm_ptr->socket, iov, 2); - } else { + } else len = writev(cm_ptr->socket, iov, 1); - } - - if (len != (p_size + sizeof(ib_qp_cm_t))) { + + if (len != (p_size + exp)) { dapl_log(DAPL_DBG_TYPE_ERR, " ACCEPT_USR: ERR %s, wcnt=%d -> %s\n", - strerror(errno), len, inet_ntoa(((struct sockaddr_in *) - &cm_ptr->dst. - ia_address)-> - sin_addr)); + strerror(errno), len, + inet_ntoa(((struct sockaddr_in *) + &cm_ptr->msg.daddr.so)->sin_addr)); goto bail; } dapl_dbg_log(DAPL_DBG_TYPE_CM, - " ACCEPT_USR: local port=0x%x lid=0x%x" - " qpn=0x%x psize=%d\n", - ntohs(local.port), ntohs(local.lid), - ntohl(local.qpn), ntohl(local.p_size)); + " ACCEPT_USR: local port=%d lid=0x%x qpn=0x%x psz=%d\n", + local.saddr.ib.port_num, ntohs(local.saddr.ib.lid), + ntohl(local.saddr.ib.qpn), ntohs(local.p_size)); dapl_dbg_log(DAPL_DBG_TYPE_CM, - " ACCEPT_USR SRC GID subnet %016llx id %016llx\n", + " ACCEPT_USR: SRC GID subnet %016llx id %016llx\n", (unsigned long long) - htonll(local.gid.global.subnet_prefix), + htonll(local.saddr.ib.gid.global.subnet_prefix), (unsigned long long) - htonll(local.gid.global.interface_id)); + htonll(local.saddr.ib.gid.global.interface_id)); /* save state and reference to EP, queue for RTU data */ cm_ptr->ep = ep_ptr; @@ -1107,7 +1127,7 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr, return DAT_SUCCESS; bail: dapls_ib_cm_free(cm_ptr, cm_ptr->ep); - dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0, 0, 0); return DAT_INTERNAL_ERROR; } @@ -1117,16 +1137,15 @@ dapli_socket_accept_usr(DAPL_EP * ep_ptr, void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) { int len; - short rtu_data = 0; - /* complete handshake after final QP state change */ - len = recv(cm_ptr->socket, (char *)&rtu_data, sizeof(rtu_data), 0); - if (len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f) { + /* complete handshake after final QP state change, VER and OP */ + len = recv(cm_ptr->socket, (char *)&cm_ptr->msg, 4, 0); + if (len != 4 || ntohs(cm_ptr->msg.op) != DCM_RTU) { dapl_log(DAPL_DBG_TYPE_ERR, - " ACCEPT_RTU: ERR %s, rcnt=%d rdata=%x\n", - strerror(errno), len, ntohs(rtu_data), + " ACCEPT_RTU: rcv ERR, rcnt=%d op=%x\n", + len, ntohs(cm_ptr->msg.op), inet_ntoa(((struct sockaddr_in *) - &cm_ptr->dst.ia_address)->sin_addr)); + &cm_ptr->msg.daddr.so)->sin_addr)); goto bail; } @@ -1137,25 +1156,26 @@ void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) dapl_dbg_log(DAPL_DBG_TYPE_EP, " PASSIVE: connected!\n"); #ifdef DAT_EXTENSIONS - if (cm_ptr->dst.qp_type == IBV_QPT_UD) { + if (cm_ptr->msg.saddr.ib.qp_type == IBV_QPT_UD) { DAT_IB_EXTENSION_EVENT_DATA xevent; /* post EVENT, modify_qp created ah */ xevent.status = 0; xevent.type = DAT_IB_UD_PASSIVE_REMOTE_AH; xevent.remote_ah.ah = cm_ptr->ah; - xevent.remote_ah.qpn = cm_ptr->dst.qpn; + xevent.remote_ah.qpn = cm_ptr->msg.saddr.ib.qpn; dapl_os_memcpy(&xevent.remote_ah.ia_addr, - &cm_ptr->dst.ia_address, - sizeof(cm_ptr->dst.ia_address)); - - dapls_evd_post_connection_event_ext((DAPL_EVD *) cm_ptr->ep-> - param.connect_evd_handle, - DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED, - (DAT_EP_HANDLE) cm_ptr->ep, - (DAT_COUNT) cm_ptr->dst.p_size, - (DAT_PVOID *) cm_ptr->p_data, - (DAT_PVOID *) &xevent); + &cm_ptr->msg.daddr.so, + sizeof(union dcm_addr)); + + dapls_evd_post_connection_event_ext( + (DAPL_EVD *) + cm_ptr->ep->param.connect_evd_handle, + DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED, + (DAT_EP_HANDLE) cm_ptr->ep, + (DAT_COUNT) cm_ptr->msg.p_size, + (DAT_PVOID *) cm_ptr->msg.p_data, + (DAT_PVOID *) &xevent); /* done with socket, don't destroy cm_ptr, need pdata */ closesocket(cm_ptr->socket); @@ -1169,7 +1189,7 @@ void dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) return; bail: - dapls_modify_qp_state(cm_ptr->ep->qp_handle, IBV_QPS_ERR, 0); + dapls_modify_qp_state(cm_ptr->ep->qp_handle, IBV_QPS_ERR, 0, 0, 0); dapls_ib_cm_free(cm_ptr, cm_ptr->ep); dapls_cr_callback(cm_ptr, IB_CME_DESTINATION_REJECT, NULL, cm_ptr->sp); } @@ -1237,7 +1257,7 @@ dapls_ib_disconnect(IN DAPL_EP * ep_ptr, IN DAT_CLOSE_FLAGS close_flags) "dapls_ib_disconnect(ep_handle %p ....)\n", ep_ptr); /* Transition to error state to flush queue */ - dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, 0, 0, 0); if (ep_ptr->cm_handle == NULL || ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED) @@ -1429,19 +1449,16 @@ dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm_ptr, " reject(cm %p reason %x, pdata %p, psize %d)\n", cm_ptr, reason, pdata, psize); - if (psize > IB_MAX_REJ_PDATA_SIZE) + if (psize > DCM_MAX_PDATA_SIZE) return DAT_LENGTH_ERROR; /* write reject data to indicate reject */ if (cm_ptr->socket != DAPL_INVALID_SOCKET) { - cm_ptr->dst.rej = (uint16_t) reason; - cm_ptr->dst.rej = htons(cm_ptr->dst.rej); - cm_ptr->dst.p_size = htonl(psize); - /* get qp_type from request */ - cm_ptr->dst.qp_type = ntohs(cm_ptr->dst.qp_type); - - iov[0].iov_base = (void *)&cm_ptr->dst; - iov[0].iov_len = sizeof(ib_qp_cm_t); + cm_ptr->msg.op = htons(DCM_REJ_USER); + cm_ptr->msg.p_size = htons(psize); + + iov[0].iov_base = (void *)&cm_ptr->msg; + iov[0].iov_len = sizeof(ib_cm_msg_t) - DCM_MAX_PDATA_SIZE; if (psize) { iov[1].iov_base = pdata; iov[1].iov_len = psize; @@ -1457,10 +1474,7 @@ dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm_ptr, /* cr_thread will destroy CR */ cm_ptr->state = DCM_DESTROY; - if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1) - dapl_log(DAPL_DBG_TYPE_CM, - " cm_destroy: thread wakeup error = %s\n", - strerror(errno)); + send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0); return DAT_SUCCESS; } @@ -1501,7 +1515,7 @@ dapls_ib_cm_remote_addr(IN DAT_HANDLE dat_handle, return DAT_INVALID_HANDLE; dapl_os_memcpy(remote_ia_address, - &ib_cm_handle->dst.ia_address, sizeof(DAT_SOCK_ADDR6)); + &ib_cm_handle->msg.daddr.so, sizeof(DAT_SOCK_ADDR6)); return DAT_SUCCESS; } @@ -1533,38 +1547,16 @@ int dapls_ib_private_data_size(IN DAPL_PRIVATE * prd_ptr, int size; switch (conn_op) { - case DAPL_PDATA_CONN_REQ: - { - size = IB_MAX_REQ_PDATA_SIZE; - break; - } - case DAPL_PDATA_CONN_REP: - { - size = IB_MAX_REP_PDATA_SIZE; - break; - } - case DAPL_PDATA_CONN_REJ: - { - size = IB_MAX_REJ_PDATA_SIZE; + case DAPL_PDATA_CONN_REQ: + case DAPL_PDATA_CONN_REP: + case DAPL_PDATA_CONN_REJ: + case DAPL_PDATA_CONN_DREQ: + case DAPL_PDATA_CONN_DREP: + size = DCM_MAX_PDATA_SIZE; break; - } - case DAPL_PDATA_CONN_DREQ: - { - size = IB_MAX_DREQ_PDATA_SIZE; - break; - } - case DAPL_PDATA_CONN_DREP: - { - size = IB_MAX_DREP_PDATA_SIZE; - break; - } - default: - { + default: size = 0; - } - - } /* end case */ - + } return size; } @@ -1717,27 +1709,26 @@ void cr_thread(void *arg) continue; event = (cr->state == DCM_CONN_PENDING) ? - DAPL_FD_WRITE : DAPL_FD_READ; + DAPL_FD_WRITE : DAPL_FD_READ; + if (dapl_fd_set(cr->socket, set, event)) { dapl_log(DAPL_DBG_TYPE_ERR, " cr_thread: DESTROY CR st=%d fd %d" " -> %s\n", cr->state, cr->socket, inet_ntoa(((struct sockaddr_in *) - &cr->dst.ia_address)-> - sin_addr)); + &cr->msg.daddr.so)->sin_addr)); dapls_ib_cm_free(cr, cr->ep); continue; } dapl_dbg_log(DAPL_DBG_TYPE_CM, - " poll cr=%p, socket=%d\n", cr, - cr->socket); + " poll cr=%p, sck=%d\n", cr, cr->socket); dapl_os_unlock(&hca_ptr->ib_trans.lock); ret = dapl_poll(cr->socket, event); dapl_dbg_log(DAPL_DBG_TYPE_CM, - " poll ret=0x%x cr->state=%d socket=%d\n", + " poll ret=0x%x cr->state=%d sck=%d\n", ret, cr->state, cr->socket); /* data on listen, qp exchange, and on disc req */ @@ -1783,7 +1774,7 @@ void cr_thread(void *arg) " poll=%d cr->st=%s sk=%d ep %p, %d\n", ret, dapl_cm_state_str(cr->state), cr->socket, cr->ep, - cr->ep ? cr->ep->param.ep_state:0); + cr->ep ? cr->ep->param.ep_state : 0); dapli_socket_disconnect(cr); } dapl_os_lock(&hca_ptr->ib_trans.lock); @@ -1846,17 +1837,17 @@ void dapls_print_cm_list(IN DAPL_IA *ia_ptr) printf( " CONN[%d]: sp %p ep %p sock %d %s %s %s %s %d\n", i, cr->sp, cr->ep, cr->socket, - cr->dst.qp_type == IBV_QPT_RC ? "RC" : "UD", + cr->msg.saddr.ib.qp_type == IBV_QPT_RC ? "RC" : "UD", dapl_cm_state_str(cr->state), cr->sp ? "<-" : "->", cr->state == DCM_LISTEN ? inet_ntoa(((struct sockaddr_in *) &ia_ptr->hca_ptr->hca_address)->sin_addr) : inet_ntoa(((struct sockaddr_in *) - &cr->dst.ia_address)->sin_addr), + &cr->msg.daddr.so)->sin_addr), cr->sp ? (int)cr->sp->conn_qual : ntohs(((struct sockaddr_in *) - &cr->dst.ia_address)->sin_port)); + &cr->msg.daddr.so)->sin_port)); i++; } printf("\n"); diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h index 933364c..d6950fa 100644 --- a/dapl/openib_scm/dapl_ib_util.h +++ b/dapl/openib_scm/dapl_ib_util.h @@ -40,8 +40,7 @@ struct ib_cm_handle struct dapl_hca *hca; struct dapl_sp *sp; struct dapl_ep *ep; - ib_qp_cm_t dst; - unsigned char p_data[256]; /* must follow ib_qp_cm_t */ + ib_cm_msg_t msg; struct ibv_ah *ah; }; @@ -66,15 +65,6 @@ typedef dp_ib_cm_handle_t ib_cm_srvc_handle_t; #define SCM_HOP_LIMIT 0xff #define SCM_TCLASS 0 -/* CM private data areas */ -#define IB_MAX_REQ_PDATA_SIZE 92 -#define IB_MAX_REP_PDATA_SIZE 196 -#define IB_MAX_REJ_PDATA_SIZE 148 -#define IB_MAX_DREQ_PDATA_SIZE 220 -#define IB_MAX_DREP_PDATA_SIZE 224 -#define IB_MAX_RTU_PDATA_SIZE 224 - - /* ib_hca_transport_t, specific to this implementation */ typedef struct _ib_hca_transport { @@ -120,11 +110,8 @@ void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); void dapli_async_event_cb(struct _ib_hca_transport *tp); void dapli_cq_event_cb(struct _ib_hca_transport *tp); DAT_RETURN dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr); -void dapls_print_cm_list(IN DAPL_IA *ia_ptr); dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep); void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep); -DAT_RETURN dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, - IN ib_qp_state_t qp_state, - IN dp_ib_cm_handle_t cm); +void dapls_print_cm_list(IN DAPL_IA *ia_ptr); #endif /* _DAPL_IB_UTIL_H_ */ diff --git a/dapl/openib_ucm/README b/dapl/openib_ucm/README new file mode 100644 index 0000000..239dfe6 --- /dev/null +++ b/dapl/openib_ucm/README @@ -0,0 +1,40 @@ + +OpenIB uDAPL provider using socket-based CM, in leiu of uCM/uAT, to setup QP/channels. + +to build: + +cd dapl/udapl +make VERBS=openib_scm clean +make VERBS=openib_scm + + +Modifications to common code: + +- added dapl/openib_scm directory + + dapl/udapl/Makefile + +New files for openib_scm provider + + dapl/openib/dapl_ib_cq.c + dapl/openib/dapl_ib_dto.h + dapl/openib/dapl_ib_mem.c + dapl/openib/dapl_ib_qp.c + dapl/openib/dapl_ib_util.c + dapl/openib/dapl_ib_util.h + dapl/openib/dapl_ib_cm.c + +A simple dapl test just for openib_scm testing... + + test/dtest/dtest.c + test/dtest/makefile + + server: dtest -s + client: dtest -h hostname + +known issues: + + no memory windows support in ibverbs, dat_create_rmr fails. + + + diff --git a/dapl/openib_ucm/SOURCES b/dapl/openib_ucm/SOURCES new file mode 100644 index 0000000..dfe956f --- /dev/null +++ b/dapl/openib_ucm/SOURCES @@ -0,0 +1,53 @@ +!if $(FREEBUILD) +TARGETNAME=dapl2-ofa-ucm +!else +TARGETNAME=dapl2-ofa-ucmd +!endif + +TARGETPATH = ..\..\..\..\bin\user\obj$(BUILD_ALT_DIR) +TARGETTYPE = DYNLINK +DLLENTRY = _DllMainCRTStartup + +!if $(_NT_TOOLS_VERSION) == 0x700 +DLLDEF=$O\udapl_ofa_ucm_exports.def +!else +DLLDEF=$(OBJ_PATH)\$O\udapl_ofa_ucm_exports.def +!endif + +USE_MSVCRT = 1 + +SOURCES = \ + udapl.rc \ + ..\dapl_common_src.c \ + ..\dapl_udapl_src.c \ + dapl_ib_cq.c \ + dapl_ib_extensions.c \ + dapl_ib_mem.c \ + dapl_ib_qp.c \ + dapl_ib_util.c \ + dapl_ib_cm.c + +INCLUDES = ..\include;..\common;windows;..\..\dat\include;\ + ..\..\dat\udat\windows;..\udapl\windows;\ + ..\..\..\..\inc;..\..\..\..\inc\user;..\..\..\libibverbs\include + +DAPL_OPTS = -DEXPORT_DAPL_SYMBOLS -DDAT_EXTENSIONS -DSOCK_CM -DOPENIB -DCQ_WAIT_OBJECT + +USER_C_FLAGS = $(USER_C_FLAGS) $(DAPL_OPTS) + +!if !$(FREEBUILD) +USER_C_FLAGS = $(USER_C_FLAGS) -DDAPL_DBG +!endif + +TARGETLIBS= \ + $(SDK_LIB_PATH)\kernel32.lib \ + $(SDK_LIB_PATH)\ws2_32.lib \ +!if $(FREEBUILD) + $(TARGETPATH)\*\dat2.lib \ + $(TARGETPATH)\*\libibverbs.lib +!else + $(TARGETPATH)\*\dat2d.lib \ + $(TARGETPATH)\*\libibverbsd.lib +!endif + +MSC_WARNING_LEVEL = /W1 /wd4113 diff --git a/dapl/openib_ucm/cm.c b/dapl/openib_ucm/cm.c new file mode 100644 index 0000000..ab3823e --- /dev/null +++ b/dapl/openib_ucm/cm.c @@ -0,0 +1,1837 @@ +/* + * Copyright (c) 2009 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_evd_util.h" +#include "dapl_cr_util.h" +#include "dapl_name_service.h" +#include "dapl_ib_util.h" +#include "dapl_osd.h" + + +#if defined(_WIN32) || defined(_WIN64) +enum DAPL_FD_EVENTS { + DAPL_FD_READ = 0x1, + DAPL_FD_WRITE = 0x2, + DAPL_FD_ERROR = 0x4 +}; + +struct dapl_fd_set { + struct fd_set set[3]; +}; + +static struct dapl_fd_set *dapl_alloc_fd_set(void) +{ + return dapl_os_alloc(sizeof(struct dapl_fd_set)); +} + +static void dapl_fd_zero(struct dapl_fd_set *set) +{ + FD_ZERO(&set->set[0]); + FD_ZERO(&set->set[1]); + FD_ZERO(&set->set[2]); +} + +static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set, + enum DAPL_FD_EVENTS event) +{ + FD_SET(s, &set->set[(event == DAPL_FD_READ) ? 0 : 1]); + FD_SET(s, &set->set[2]); + return 0; +} + +static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event) +{ + struct fd_set rw_fds; + struct fd_set err_fds; + struct timeval tv; + int ret; + + FD_ZERO(&rw_fds); + FD_ZERO(&err_fds); + FD_SET(s, &rw_fds); + FD_SET(s, &err_fds); + + tv.tv_sec = 0; + tv.tv_usec = 0; + + if (event == DAPL_FD_READ) + ret = select(1, &rw_fds, NULL, &err_fds, &tv); + else + ret = select(1, NULL, &rw_fds, &err_fds, &tv); + + if (ret == 0) + return 0; + else if (ret == SOCKET_ERROR) + return WSAGetLastError(); + else if (FD_ISSET(s, &rw_fds)) + return event; + else + return DAPL_FD_ERROR; +} + +static int dapl_select(struct dapl_fd_set *set) +{ + int ret; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: sleep\n"); + ret = select(0, &set->set[0], &set->set[1], &set->set[2], NULL); + dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: wakeup\n"); + + if (ret == SOCKET_ERROR) + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " dapl_select: error 0x%x\n", WSAGetLastError()); + + return ret; +} +#else // _WIN32 || _WIN64 +enum DAPL_FD_EVENTS { + DAPL_FD_READ = POLLIN, + DAPL_FD_WRITE = POLLOUT, + DAPL_FD_ERROR = POLLERR +}; + +struct dapl_fd_set { + int index; + struct pollfd set[DAPL_FD_SETSIZE]; +}; + +static struct dapl_fd_set *dapl_alloc_fd_set(void) +{ + return dapl_os_alloc(sizeof(struct dapl_fd_set)); +} + +static void dapl_fd_zero(struct dapl_fd_set *set) +{ + set->index = 0; +} + +static int dapl_fd_set(DAPL_SOCKET s, struct dapl_fd_set *set, + enum DAPL_FD_EVENTS event) +{ + if (set->index == DAPL_FD_SETSIZE - 1) { + dapl_log(DAPL_DBG_TYPE_ERR, + "SCM ERR: cm_thread exceeded FD_SETSIZE %d\n", + set->index + 1); + return -1; + } + + set->set[set->index].fd = s; + set->set[set->index].revents = 0; + set->set[set->index++].events = event; + return 0; +} + +static enum DAPL_FD_EVENTS dapl_poll(DAPL_SOCKET s, enum DAPL_FD_EVENTS event) +{ + struct pollfd fds; + int ret; + + fds.fd = s; + fds.events = event; + fds.revents = 0; + ret = poll(&fds, 1, 0); + dapl_log(DAPL_DBG_TYPE_CM, " dapl_poll: fd=%d ret=%d, evnts=0x%x\n", + s, ret, fds.revents); + if (ret == 0) + return 0; + else if (fds.revents & (POLLERR | POLLHUP | POLLNVAL)) + return DAPL_FD_ERROR; + else + return fds.revents; +} + +static int dapl_select(struct dapl_fd_set *set) +{ + int ret; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: sleep, fds=%d\n", + set->index); + ret = poll(set->set, set->index, -1); + dapl_dbg_log(DAPL_DBG_TYPE_CM, " dapl_select: wakeup, ret=0x%x\n", ret); + return ret; +} +#endif + +/* forward declarations */ +static void ucm_accept(ib_cm_srvc_handle_t cm, ib_cm_msg_t *msg); +static void ucm_connect_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg); +static void ucm_accept_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg); +static int ucm_send(ib_hca_transport_t *tp, ib_cm_msg_t *msg); +DAT_RETURN dapli_cm_disconnect(dp_ib_cm_handle_t cm); + +#define UCM_SND_BURST 100 + +/* Service ids - port space */ +static uint16_t ucm_get_port(ib_hca_transport_t *tp, uint16_t port) +{ + int i = 0; + + dapl_os_lock(&tp->plock); + /* get specific ID */ + if (port) { + if (tp->sid[port] == 0) { + tp->sid[port] = 1; + i = port; + } + goto done; + } + + /* get any free ID */ + for (i = 0xffff; i > 0; i--) { + if (tp->sid[i] == 0) { + tp->sid[i] = 1; + break; + } + } +done: + dapl_os_unlock(&tp->plock); + return i; +} + +static void ucm_free_port(ib_hca_transport_t *tp, uint16_t port) +{ + dapl_os_lock(&tp->plock); + tp->sid[port] = 0; + dapl_os_unlock(&tp->plock); +} + +/* SEND CM MESSAGE PROCESSING */ + +/* Get CM UD message from send queue, called with s_lock held */ +static ib_cm_msg_t *ucm_get_smsg(ib_hca_transport_t *tp) +{ + ib_cm_msg_t *msg = NULL; + int ret, polled = 0, hd = tp->s_hd; + + hd++; +retry: + if (hd == tp->qpe) + hd = 0; + + if (hd == tp->s_tl) + msg = NULL; + else { + msg = &tp->sbuf[hd]; + tp->s_hd = hd; /* new hd */ + } + + /* if empty, process some completions */ + if ((msg == NULL) && (!polled)) { + struct ibv_wc wc; + + /* process completions, based on UCM_SND_BURST */ + ret = ibv_poll_cq(tp->scq, 1, &wc); + if (ret < 0) { + dapl_log(DAPL_DBG_TYPE_WARN, + " get_smsg: cq %p %s\n", + tp->scq, strerror(errno)); + } + /* free up completed sends, update tail */ + if (ret > 0) { + tp->s_tl = (int)wc.wr_id; + dapl_log(DAPL_DBG_TYPE_CM, + " get_smsg: wr_cmp (%d) s_tl=%d\n", + wc.status, tp->s_tl); + } + polled++; + goto retry; + } + return msg; +} + +/* RECEIVE CM MESSAGE PROCESSING */ + +static int ucm_post_rmsg(ib_hca_transport_t *tp, ib_cm_msg_t *msg) +{ + struct ibv_recv_wr recv_wr, *recv_err; + struct ibv_sge sge; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + recv_wr.wr_id = (uint64_t)(uintptr_t) msg; + sge.length = sizeof(ib_cm_msg_t) + sizeof(struct ibv_grh); + sge.lkey = tp->mr_rbuf->lkey; + sge.addr = (uintptr_t)((char *)msg - sizeof(struct ibv_grh)); + + return (ibv_post_recv(tp->qp, &recv_wr, &recv_err)); +} + +static int ucm_reject(ib_hca_transport_t *tp, ib_cm_msg_t *msg) +{ + ib_cm_msg_t smsg; + + /* setup op, rearrange the src, dst cm and addr info */ + (void)dapl_os_memzero(&smsg, sizeof(smsg)); + smsg.ver = htons(DCM_VER); + smsg.op = htons(DCM_REJ_CM); + smsg.dport = msg->sport; + smsg.dqpn = msg->sqpn; + smsg.sport = msg->dport; + smsg.sqpn = msg->dqpn; + + dapl_os_memcpy(&smsg.daddr, &msg->saddr, sizeof(union dcm_addr)); + dapl_os_memcpy(&smsg.saddr, &msg->daddr, sizeof(union dcm_addr)); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " CM reject -> LID %x, QPN %x PORT %d\n", + ntohs(smsg.daddr.ib.lid), + ntohl(smsg.dqpn), ntohs(smsg.dport)); + + return (ucm_send(tp, &smsg)); +} + +static void ucm_process_recv(ib_hca_transport_t *tp, + ib_cm_msg_t *msg, + dp_ib_cm_handle_t cm) +{ + dapl_os_lock(&cm->lock); + switch (cm->state) { + case DCM_LISTEN: + dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: LISTEN\n"); + dapl_os_unlock(&cm->lock); + ucm_accept(cm, msg); + break; + case DCM_ACCEPTED: + dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: ACCEPT_RTU\n"); + dapl_os_unlock(&cm->lock); + ucm_accept_rtu(cm, msg); + break; + case DCM_CONN_PENDING: + dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: CONN_RTU\n"); + dapl_os_unlock(&cm->lock); + ucm_connect_rtu(cm, msg); + break; + case DCM_CONNECTED: + dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: DREQ connect\n"); + dapl_os_unlock(&cm->lock); + if (ntohs(msg->op) == DCM_DREQ) + dapli_cm_disconnect(cm); + break; + case DCM_DISC_PENDING: + case DCM_DESTROY: + dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: DREQ toss\n"); + break; + default: + dapl_log(DAPL_DBG_TYPE_WARN, + " process_recv: UNKNOWN state" + " <- op %d, st %d spsp %d sqpn %d\n", + ntohs(msg->op), cm->state, + ntohs(msg->sport), ntohl(msg->sqpn)); + dapl_os_unlock(&cm->lock); + break; + } +} + +/* Find matching CM object for this receive message, return CM reference */ +dp_ib_cm_handle_t ucm_cm_find(ib_hca_transport_t *tp, ib_cm_msg_t *msg) +{ + dp_ib_cm_handle_t cm, next, found = NULL; + struct dapl_llist_entry *list; + DAPL_OS_LOCK lock; + + /* connect request - listen list, otherwise conn list */ + if (ntohs(msg->op) == DCM_REQ) { + dapl_dbg_log(DAPL_DBG_TYPE_CM," search - listenQ\n"); + list = tp->llist; + lock = tp->llock; + } else { + dapl_dbg_log(DAPL_DBG_TYPE_CM," search - connectQ\n"); + list = tp->list; + lock = tp->lock; + } + + dapl_os_lock(&lock); + if (!dapl_llist_is_empty(&list)) + next = dapl_llist_peek_head(&list); + else + next = NULL; + + while (next) { + cm = next; + next = dapl_llist_next_entry(&list, + (DAPL_LLIST_ENTRY *)&cm->entry); + if (cm->state == DCM_DESTROY) + continue; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " MATCH? cm %p st %s sport %x sqpn %x lid %x\n", + cm, dapl_cm_state_str(cm->state), + ntohs(cm->msg.sport), ntohl(cm->msg.sqpn), + ntohs(cm->msg.saddr.ib.lid)); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " src port %d=%d, sqp %x=%x slid %x=%x, iqp %x=%x\n", + ntohs(cm->msg.sport), ntohs(msg->dport), + ntohl(cm->msg.sqpn), ntohl(msg->dqpn), + ntohs(cm->msg.saddr.ib.lid), + ntohs(msg->daddr.ib.lid), + ntohl(cm->msg.saddr.ib.qpn), + ntohl(msg->daddr.ib.qpn)); + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " dst port %d=%d, sqp %x=%x slid %x=%x, iqp %x=%x\n", + ntohs(cm->msg.dport), ntohs(msg->sport), + ntohl(cm->msg.dqpn), ntohl(msg->sqpn), + ntohs(cm->msg.daddr.ib.lid), + ntohs(msg->saddr.ib.lid), + ntohl(cm->msg.daddr.ib.qpn), + ntohl(msg->saddr.ib.qpn)); + + /* REQ: CM sPORT + QPN, match is good enough */ + if ((cm->msg.sport == msg->dport) && + (cm->msg.sqpn == msg->dqpn)) { + if (ntohs(msg->op) == DCM_REQ) { + found = cm; + break; + /* NOT REQ: add remote CM sPORT, QPN, LID match */ + } else if ((cm->msg.dport == msg->sport) && + (cm->msg.dqpn == msg->sqpn) && + (cm->msg.daddr.ib.lid == + msg->saddr.ib.lid)) { + found = cm; + break; + } + } + } + dapl_os_unlock(&lock); + return found; +} + +/* Get rmsgs from CM completion queue, 10 at a time */ +static void ucm_recv(ib_hca_transport_t *tp) +{ + struct ibv_wc wc[10]; + ib_cm_msg_t *msg; + dp_ib_cm_handle_t cm; + int i, ret, notify = 0; + struct ibv_cq *ibv_cq = NULL; + DAPL_HCA *hca; + + /* POLLIN on channel FD */ + ret = ibv_get_cq_event(tp->rch, &ibv_cq, (void *)&hca); + if (ret == 0) { + ibv_ack_cq_events(ibv_cq, 1); + } +retry: + ret = ibv_poll_cq(tp->rcq, 10, wc); + if (ret <= 0) { + if (!ret && !notify) { + ibv_req_notify_cq(tp->rcq, 0); + notify = 1; + goto retry; + } + return; + } else + notify = 0; + + for (i = 0; i < ret; i++) { + msg = (ib_cm_msg_t*)wc[i].wr_id; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ucm_recv: wc status=%d, ln=%d id=%p sqp=%x\n", + wc[i].status, wc[i].byte_len, + (void*)wc[i].wr_id, wc[i].src_qp); + + /* validate CM message, version */ + if (ntohs(msg->ver) != DCM_VER) { + dapl_log(DAPL_DBG_TYPE_WARN, + " ucm_recv: UNKNOWN msg %p, ver %d\n", + msg, msg->ver); + ucm_post_rmsg(tp, msg); + continue; + } + if (!(cm = ucm_cm_find(tp, msg))) { + dapl_log(DAPL_DBG_TYPE_CM, + " ucm_recv: NO MATCH op %d port %d cqp %x\n", + ntohs(msg->op), ntohs(msg->dport), + ntohl(msg->dqpn)); + if (ntohs(msg->op) == DCM_REQ) + ucm_reject(tp, msg); + ucm_post_rmsg(tp, msg); + continue; + } + dapl_dbg_log(DAPL_DBG_TYPE_CM, " ucm_recv: match %p\n",cm); + + /* match, process it */ + ucm_process_recv(tp, msg, cm); + ucm_post_rmsg(tp, msg); + } + + /* finished this batch of WC's, poll and rearm */ + goto retry; + +} + +/* ACTIVE/PASSIVE: build and send CM message out of CM object */ +static int ucm_send(ib_hca_transport_t *tp, ib_cm_msg_t *msg) +{ + ib_cm_msg_t *smsg = NULL; + struct ibv_send_wr wr, *bad_wr; + struct ibv_sge sge; + int len, ret = -1; + uint16_t dlid = ntohs(msg->daddr.ib.lid); + + /* Get message from send queue, copy data, and send */ + dapl_os_lock(&tp->slock); + if ((smsg = ucm_get_smsg(tp)) == NULL) + goto bail; + + len = ((sizeof(*msg) - DCM_MAX_PDATA_SIZE) + ntohs(msg->p_size)); + dapl_os_memcpy(smsg, msg, len); + + wr.next = NULL; + wr.sg_list = &sge; + wr.num_sge = 1; + wr.opcode = IBV_WR_SEND; + wr.wr_id = (unsigned long)tp->s_hd; + wr.send_flags = (wr.wr_id % UCM_SND_BURST) ? 0 : IBV_SEND_SIGNALED; + if (len <= tp->max_inline_send) + wr.send_flags |= IBV_SEND_INLINE; + + sge.length = len; + sge.lkey = tp->mr_sbuf->lkey; + sge.addr = (uintptr_t)smsg; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ucm_send: op %d ln %d lid %x c_qpn %x rport %d\n", + ntohs(smsg->op), len, htons(smsg->daddr.ib.lid), + htonl(smsg->dqpn), htons(smsg->dport)); + + /* empty slot, then create AH */ + if (!tp->ah[dlid]) { + tp->ah[dlid] = + dapls_create_ah(tp->hca, tp->pd, tp->qp, + htons(dlid), NULL); + if (!tp->ah[dlid]) + goto bail; + } + + wr.wr.ud.ah = tp->ah[dlid]; + wr.wr.ud.remote_qpn = ntohl(smsg->dqpn); + wr.wr.ud.remote_qkey = DAT_UD_QKEY; + + ret = ibv_post_send(tp->qp, &wr, &bad_wr); +bail: + dapl_os_unlock(&tp->slock); + return ret; +} + +/* ACTIVE/PASSIVE: CM objects */ +dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep) +{ + dp_ib_cm_handle_t cm; + + /* Allocate CM, init lock, and initialize */ + if ((cm = dapl_os_alloc(sizeof(*cm))) == NULL) + return NULL; + + (void)dapl_os_memzero(cm, sizeof(*cm)); + if (dapl_os_lock_init(&cm->lock)) + goto bail; + + cm->msg.ver = htons(DCM_VER); + + /* ACTIVE: init source address QP info from local EP */ + if (ep) { + DAPL_HCA *hca = ep->header.owner_ia->hca_ptr; + + cm->msg.sport = htons(ucm_get_port(&hca->ib_trans, 0)); + if (!cm->msg.sport) + goto bail; + + /* IB info in network order */ + cm->ep = ep; + cm->hca = hca; + cm->msg.sqpn = htonl(hca->ib_trans.qp->qp_num); /* ucm */ + cm->msg.saddr.ib.qpn = htonl(ep->qp_handle->qp_num); /* ep */ + cm->msg.saddr.ib.qp_type = ep->qp_handle->qp_type; + cm->msg.saddr.ib.port_num = hca->port_num; + cm->msg.saddr.ib.lid = hca->ib_trans.addr.ib.lid; + cm->msg.saddr.ib.gid = hca->ib_trans.addr.ib.gid; + } + return cm; +bail: + dapl_os_free(cm, sizeof(*cm)); + return NULL; +} + +/* + * UD CR objects are kept active because of direct private data references + * from CONN events. The cr->socket is closed and marked inactive but the + * object remains allocated and queued on the CR resource list. There can + * be multiple CR's associated with a given EP. There is no way to determine + * when consumer is finished with event until the dat_ep_free. + * + * Schedule destruction for all CR's associated with this EP, cr_thread will + * complete the cleanup with state == DCM_DESTROY. + */ +static void ucm_ud_free(DAPL_EP *ep) +{ + DAPL_IA *ia = ep->header.owner_ia; + DAPL_HCA *hca = NULL; + ib_hca_transport_t *tp = &ia->hca_ptr->ib_trans; + dp_ib_cm_handle_t cr, next; + + dapl_os_lock(&tp->lock); + if (!dapl_llist_is_empty((DAPL_LLIST_HEAD*)&tp->list)) + next = dapl_llist_peek_head((DAPL_LLIST_HEAD*)&tp->list); + else + next = NULL; + + while (next) { + cr = next; + next = dapl_llist_next_entry((DAPL_LLIST_HEAD*)&tp->list, + (DAPL_LLIST_ENTRY*)&cr->entry); + if (cr->ep == ep) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " qp_free CR: ep %p cr %p\n", ep, cr); + dapl_os_lock(&cr->lock); + hca = cr->hca; + cr->ep = NULL; + cr->state = DCM_DESTROY; + dapl_os_unlock(&cr->lock); + } + } + dapl_os_unlock(&tp->lock); + + /* wakeup work thread if necessary */ + if (hca) + send(tp->scm[1], "w", sizeof "w", 0); +} + +/* mark for destroy, remove all references, schedule cleanup */ +/* cm_ptr == NULL (UD), then multi CR's, kill all associated with EP */ +void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep) +{ + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " cm_destroy: cm %p ep %p\n", cm, ep); + + if (!cm && ep) + return (ucm_ud_free(ep)); + + dapl_os_lock(&cm->lock); + + /* client, release local conn id port */ + if (!cm->sp && cm->msg.sport) + ucm_free_port(&cm->hca->ib_trans, cm->msg.sport); + + /* cleanup, never made it to work queue */ + if (cm->state == DCM_INIT) { + dapl_os_unlock(&cm->lock); + dapl_os_free(cm, sizeof(*cm)); + return; + } + + /* free could be called before disconnect, disc_clean will destroy */ + if (cm->state == DCM_CONNECTED) { + dapl_os_unlock(&cm->lock); + dapli_cm_disconnect(cm); + return; + } + + cm->state = DCM_DESTROY; + if ((cm->ep) && (cm->ep->cm_handle == cm)) { + cm->ep->cm_handle = IB_INVALID_HANDLE; + cm->ep = NULL; + } + + dapl_os_unlock(&cm->lock); + + /* wakeup work thread */ + send(cm->hca->ib_trans.scm[1], "w", sizeof "w", 0); +} + +/* ACTIVE/PASSIVE: queue up connection object on CM list */ +static void ucm_queue_conn(dp_ib_cm_handle_t cm) +{ + /* add to work queue, list, for cm thread processing */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY *)&cm->entry); + dapl_os_lock(&cm->hca->ib_trans.lock); + dapl_llist_add_tail(&cm->hca->ib_trans.list, + (DAPL_LLIST_ENTRY *)&cm->entry, cm); + dapl_os_unlock(&cm->hca->ib_trans.lock); +} + +/* PASSIVE: queue up listen object on listen list */ +static void ucm_queue_listen(dp_ib_cm_handle_t cm) +{ + /* add to work queue, llist, for cm thread processing */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY *)&cm->entry); + dapl_os_lock(&cm->hca->ib_trans.llock); + dapl_llist_add_tail(&cm->hca->ib_trans.llist, + (DAPL_LLIST_ENTRY *)&cm->entry, cm); + dapl_os_unlock(&cm->hca->ib_trans.llock); +} + +static void ucm_dequeue_listen(dp_ib_cm_handle_t cm) { + dapl_os_lock(&cm->hca->ib_trans.llock); + dapl_llist_remove_entry(&cm->hca->ib_trans.llist, + (DAPL_LLIST_ENTRY *)&cm->entry); + dapl_os_unlock(&cm->hca->ib_trans.llock); +} + +/* + * ACTIVE/PASSIVE: called from CR thread or consumer via ep_disconnect + * or from ep_free + */ +DAT_RETURN dapli_cm_disconnect(dp_ib_cm_handle_t cm) +{ + DAPL_EP *ep = cm->ep; + + if (ep == NULL) + return DAT_SUCCESS; + + dapl_os_lock(&cm->lock); + if ((cm->state == DCM_INIT) || + (cm->state == DCM_DISC_PENDING) || + (cm->state == DCM_DISCONNECTED) || + (cm->state == DCM_DESTROY)) { + dapl_os_unlock(&cm->lock); + return DAT_SUCCESS; + } else { + /* send disc, schedule destroy */ + cm->msg.op = htons(DCM_DREQ); + if (ucm_send(&cm->hca->ib_trans, &cm->msg)) { + dapl_log(DAPL_DBG_TYPE_WARN, + " disc_req: ERR-> %s lid %d qpn %d" + " r_psp %d \n", strerror(errno), + htons(cm->msg.saddr.ib.lid), + htonl(cm->msg.saddr.ib.qpn), + htons(cm->msg.sport)); + } + cm->state = DCM_DISC_PENDING; + } + dapl_os_unlock(&cm->lock); + + /* disconnect events for RC's only */ + if (ep->param.ep_attr.service_type == DAT_SERVICE_TYPE_RC) { + if (ep->cr_ptr) { + dapls_cr_callback(cm, + IB_CME_DISCONNECTED, + NULL, + ((DAPL_CR *)ep->cr_ptr)->sp_ptr); + } else { + dapl_evd_connection_callback(ep->cm_handle, + IB_CME_DISCONNECTED, + NULL, ep); + } + } + + /* scheduled destroy via disconnect clean in callback */ + return DAT_SUCCESS; +} + +/* + * ACTIVE: get remote CM SID server info from r_addr. + * send, or resend CM msg via UD CM QP + */ +DAT_RETURN +dapli_cm_connect(DAPL_EP *ep, dp_ib_cm_handle_t cm) +{ + dapl_log(DAPL_DBG_TYPE_EP, + " connect: lid %x qpn %x lport %d p_sz=%d -> " + " lid %x c_qpn %x rport %d\n", + htons(cm->msg.saddr.ib.lid), htonl(cm->msg.saddr.ib.qpn), + htons(cm->msg.sport), htons(cm->msg.p_size), + htons(cm->msg.daddr.ib.lid), htonl(cm->msg.dqpn), + htons(cm->msg.dport)); + + dapl_os_lock(&cm->lock); + if (cm->state == DCM_INIT) + cm->state = DCM_CONN_PENDING; + else if (++cm->retries == DCM_RETRY_CNT) { + dapl_log(DAPL_DBG_TYPE_WARN, + " connect: RETRIES EXHAUSTED -> lid %d qpn %d r_psp" + " %d p_sz=%d\n", + strerror(errno), htons(cm->msg.daddr.ib.lid), + htonl(cm->msg.dqpn), htons(cm->msg.dport), + htons(cm->msg.p_size)); + + /* update ep->cm reference so we get cleaned up on callback */ + if (cm->msg.saddr.ib.qp_type == IBV_QPT_RC); + ep->cm_handle = cm; + + dapl_os_unlock(&cm->lock); + dapl_evd_connection_callback(cm, + IB_CME_DESTINATION_UNREACHABLE, + NULL, ep); + + return DAT_ERROR(DAT_INVALID_ADDRESS, + DAT_INVALID_ADDRESS_UNREACHABLE); + } + dapl_os_unlock(&cm->lock); + + cm->msg.op = htons(DCM_REQ); + if (ucm_send(&cm->hca->ib_trans, &cm->msg)) + goto bail; + + /* first time through, put on work queue */ + if (!cm->retries) + ucm_queue_conn(cm); + + return DAT_SUCCESS; + +bail: + dapl_log(DAPL_DBG_TYPE_ERR, + " connect: ERR %s -> cm_lid %d cm_qpn %d r_psp %d p_sz=%d\n", + strerror(errno), htons(cm->msg.daddr.ib.lid), + htonl(cm->msg.dqpn), htons(cm->msg.dport), + htonl(cm->msg.p_size)); + + /* close socket, free cm structure */ + dapls_ib_cm_free(cm, cm->ep); + return DAT_INSUFFICIENT_RESOURCES; +} + +/* + * ACTIVE: exchange QP information, called from CR thread + */ +static void ucm_connect_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg) +{ + DAPL_EP *ep = cm->ep; + ib_cm_events_t event = IB_CME_CONNECTED; + + dapl_os_lock(&cm->lock); + if (cm->state != DCM_CONN_PENDING) { + dapl_log(DAPL_DBG_TYPE_WARN, + " CONN_RTU: UNEXPECTED state:" + " op %d, st %s <- lid %d sqpn %d sport %d\n", + ntohs(msg->op), dapl_cm_state_str(cm->state), + ntohs(msg->saddr.ib.lid), ntohl(msg->saddr.ib.qpn), + ntohs(msg->sport)); + dapl_os_unlock(&cm->lock); + return; + } + dapl_os_unlock(&cm->lock); + + /* save remote address information to EP and CM */ + dapl_os_memcpy(&ep->remote_ia_address, + &msg->saddr, sizeof(union dcm_addr)); + dapl_os_memcpy(&cm->msg.daddr, + &msg->saddr, sizeof(union dcm_addr)); + + /* validate private data size, and copy if necessary */ + if (msg->p_size) { + if (ntohs(msg->p_size) > DCM_MAX_PDATA_SIZE) { + dapl_log(DAPL_DBG_TYPE_WARN, + " CONN_RTU: invalid p_size %d:" + " st %s <- lid %d sqpn %d spsp %d\n", + ntohs(msg->p_size), + dapl_cm_state_str(cm->state), + ntohs(msg->saddr.ib.lid), + ntohl(msg->saddr.ib.qpn), + ntohs(msg->sport)); + goto bail; + } + dapl_os_memcpy(cm->msg.p_data, + msg->p_data, ntohs(msg->p_size)); + } + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " CONN_RTU: DST port=%d lid=%x," + " iqp=%x, qp_type=%d, port=%d psize=%d\n", + cm->msg.daddr.ib.port_num, ntohs(cm->msg.daddr.ib.lid), + ntohl(cm->msg.daddr.ib.qpn), cm->msg.daddr.ib.qp_type, + ntohs(msg->sport), ntohs(msg->p_size)); + + if (ntohs(msg->op) == DCM_REP) + event = IB_CME_CONNECTED; + else if (ntohs(msg->op) == DCM_REJ_USER) + event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA; + else + event = IB_CME_DESTINATION_REJECT; + + if (event != IB_CME_CONNECTED) { + dapl_log(DAPL_DBG_TYPE_CM, + " CONN_RTU: REJ op=%d <- lid %x, iqp %x, psp %d\n", + ntohs(msg->op), ntohs(msg->saddr.ib.lid), + ntohl(msg->saddr.ib.qpn), ntohs(msg->sport)); +#ifdef DAT_EXTENSIONS + if (cm->msg.daddr.ib.qp_type == IBV_QPT_UD) + goto ud_bail; + else +#endif + goto bail; + } + + /* modify QP to RTR and then to RTS with remote info */ + dapl_os_lock(&cm->ep->header.lock); + if (dapls_modify_qp_state(cm->ep->qp_handle, + IBV_QPS_RTR, + cm->msg.daddr.ib.qpn, + cm->msg.daddr.ib.lid, + NULL) != DAT_SUCCESS) { + dapl_log(DAPL_DBG_TYPE_ERR, + " CONN_RTU: QPS_RTR ERR %s <- lid %x iqp %x\n", + strerror(errno), ntohs(cm->msg.daddr.ib.lid), + ntohl(cm->msg.daddr.ib.qpn)); + dapl_os_unlock(&cm->ep->header.lock); + event = IB_CME_LOCAL_FAILURE; + goto bail; + } + if (dapls_modify_qp_state(cm->ep->qp_handle, + IBV_QPS_RTS, + cm->msg.daddr.ib.qpn, + cm->msg.daddr.ib.lid, + NULL) != DAT_SUCCESS) { + dapl_log(DAPL_DBG_TYPE_ERR, + " CONN_RTU: QPS_RTS ERR %s <- lid %x iqp %x\n", + strerror(errno), ntohs(cm->msg.daddr.ib.lid), + ntohl(cm->msg.daddr.ib.qpn)); + dapl_os_unlock(&cm->ep->header.lock); + event = IB_CME_LOCAL_FAILURE; + goto bail; + } + dapl_os_unlock(&cm->ep->header.lock); + + /* Send RTU */ + cm->msg.op = htons(DCM_RTU); + + if (ucm_send(&cm->hca->ib_trans, &cm->msg)) + goto bail; + + /* init cm_handle and post the event with private data */ + cm->state = DCM_CONNECTED; + dapl_dbg_log(DAPL_DBG_TYPE_EP, " ACTIVE: connected!\n"); + +#ifdef DAT_EXTENSIONS +ud_bail: + if (cm->msg.daddr.ib.qp_type == IBV_QPT_UD) { + DAT_IB_EXTENSION_EVENT_DATA xevent; + uint16_t lid = ntohs(cm->msg.daddr.ib.lid); + + /* post EVENT, modify_qp, AH already created, ucm msg */ + xevent.status = 0; + xevent.type = DAT_IB_UD_REMOTE_AH; + xevent.remote_ah.ah = cm->hca->ib_trans.ah[lid]; + xevent.remote_ah.qpn = cm->msg.daddr.ib.qpn; + dapl_os_memcpy(&xevent.remote_ah.ia_addr, + &cm->msg.daddr, + sizeof(union dcm_addr)); + + if (event == IB_CME_CONNECTED) + event = DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED; + else + event = DAT_IB_UD_CONNECTION_REJECT_EVENT; + + dapls_evd_post_connection_event_ext( + (DAPL_EVD *)cm->ep->param.connect_evd_handle, + event, + (DAT_EP_HANDLE)ep, + (DAT_COUNT)cm->msg.p_size, + (DAT_PVOID *)cm->msg.p_data, + (DAT_PVOID *)&xevent); + + /* we are done, don't destroy cm_ptr, need pdata */ + cm->state = DCM_RELEASED; + } else +#endif + { + cm->ep->cm_handle = cm; /* only RC, multi CR's on UD */ + dapl_evd_connection_callback(cm, + IB_CME_CONNECTED, + cm->msg.p_data, cm->ep); + } + return; + +bail: + if (cm->msg.saddr.ib.qp_type != IBV_QPT_UD) + dapls_ib_reinit_ep(cm->ep); /* reset QP state */ + dapl_evd_connection_callback(NULL, event, cm->msg.p_data, cm->ep); +} + +/* + * PASSIVE: Accept on listen CM PSP. + * create new CM object for this CR, + * receive peer QP information, private data, + * and post cr_event + */ +static void ucm_accept(ib_cm_srvc_handle_t cm, ib_cm_msg_t *msg) +{ + dp_ib_cm_handle_t acm; + + /* Allocate accept CM and setup passive references */ + if ((acm = dapls_ib_cm_create(NULL)) == NULL) { + dapl_log(DAPL_DBG_TYPE_WARN, " accept: ERR cm_create\n"); + return; + } + + /* dest CM info from CR msg, source CM info from listen */ + acm->sp = cm->sp; + acm->hca = cm->hca; + acm->state = DCM_ACCEPTING; + acm->msg.dport = msg->sport; + acm->msg.dqpn = msg->sqpn; + acm->msg.sport = cm->msg.sport; + acm->msg.sqpn = cm->msg.sqpn; + acm->msg.p_size = msg->p_size; + + /* CR saddr is CM daddr info, need EP for local saddr */ + dapl_os_memcpy(&acm->msg.daddr, &msg->saddr, sizeof(union dcm_addr)); + + dapl_log(DAPL_DBG_TYPE_CM, + " accept: DST port=%d lid=%x, iqp=%x, psize=%d\n", + ntohs(acm->msg.dport), ntohs(acm->msg.daddr.ib.lid), + htonl(acm->msg.daddr.ib.qpn), htons(acm->msg.p_size)); + + /* validate private data size before reading */ + if (ntohs(msg->p_size) > DCM_MAX_PDATA_SIZE) { + dapl_log(DAPL_DBG_TYPE_WARN, " accept: psize (%d) wrong\n", + ntohs(msg->p_size)); + goto bail; + } + + /* read private data into cm_handle if any present */ + if (msg->p_size) + dapl_os_memcpy(acm->msg.p_data, + msg->p_data, ntohs(msg->p_size)); + + acm->state = DCM_ACCEPTING_DATA; + ucm_queue_conn(acm); + +#ifdef DAT_EXTENSIONS + if (acm->msg.daddr.ib.qp_type == IBV_QPT_UD) { + DAT_IB_EXTENSION_EVENT_DATA xevent; + + /* post EVENT, modify_qp created ah */ + xevent.status = 0; + xevent.type = DAT_IB_UD_CONNECT_REQUEST; + + dapls_evd_post_cr_event_ext(acm->sp, + DAT_IB_UD_CONNECTION_REQUEST_EVENT, + acm, + (DAT_COUNT)acm->msg.p_size, + (DAT_PVOID *)acm->msg.p_data, + (DAT_PVOID *)&xevent); + } else +#endif + /* trigger CR event and return SUCCESS */ + dapls_cr_callback(acm, + IB_CME_CONNECTION_REQUEST_PENDING, + acm->msg.p_data, acm->sp); + return; + +bail: + /* free cm object */ + dapls_ib_cm_free(acm, NULL); + return; +} + +/* + * PASSIVE: read RTU from active peer, post CONN event + */ +static void ucm_accept_rtu(dp_ib_cm_handle_t cm, ib_cm_msg_t *msg) +{ + dapl_os_lock(&cm->lock); + if ((ntohs(msg->op) != DCM_RTU) || (cm->state != DCM_ACCEPTED)) { + dapl_log(DAPL_DBG_TYPE_WARN, + " accept_rtu: UNEXPECTED op, state:" + " op %d, st %s <- lid %x iqp %x sport %d\n", + ntohs(msg->op), dapl_cm_state_str(cm->state), + ntohs(msg->saddr.ib.lid), ntohl(msg->saddr.ib.qpn), + ntohs(msg->sport)); + dapl_os_unlock(&cm->lock); + goto bail; + } + cm->state = DCM_CONNECTED; + dapl_os_unlock(&cm->lock); + + if (msg->p_size) + dapl_os_memcpy(cm->msg.p_data, + msg->p_data, ntohs(msg->p_size)); + + /* final data exchange if remote QP state is good to go */ + dapl_dbg_log(DAPL_DBG_TYPE_CM, " PASSIVE: connected!\n"); + +#ifdef DAT_EXTENSIONS + if (cm->msg.saddr.ib.qp_type == IBV_QPT_UD) { + DAT_IB_EXTENSION_EVENT_DATA xevent; + uint16_t lid = ntohs(cm->msg.daddr.ib.lid); + + /* post EVENT, modify_qp, AH already created, ucm msg */ + xevent.status = 0; + xevent.type = DAT_IB_UD_PASSIVE_REMOTE_AH; + xevent.remote_ah.ah = cm->hca->ib_trans.ah[lid]; + xevent.remote_ah.qpn = cm->msg.daddr.ib.qpn; + dapl_os_memcpy(&xevent.remote_ah.ia_addr, + &cm->msg.daddr, + sizeof(cm->msg.daddr)); + + dapls_evd_post_connection_event_ext( + (DAPL_EVD *)cm->ep->param.connect_evd_handle, + DAT_IB_UD_CONNECTION_EVENT_ESTABLISHED, + (DAT_EP_HANDLE)cm->ep, + (DAT_COUNT)cm->msg.p_size, + (DAT_PVOID *)cm->msg.p_data, + (DAT_PVOID *)&xevent); + + /* done with CM object, don't destroy cm, need pdata */ + cm->state = DCM_RELEASED; + } else { +#endif + cm->ep->cm_handle = cm; /* only RC, multi CR's on UD */ + dapls_cr_callback(cm, IB_CME_CONNECTED, NULL, cm->sp); + } + return; +bail: + if (cm->msg.saddr.ib.qp_type != IBV_QPT_UD) + dapls_ib_reinit_ep(cm->ep); /* reset QP state */ + dapls_ib_cm_free(cm, cm->ep); + dapls_cr_callback(cm, IB_CME_LOCAL_FAILURE, NULL, cm->sp); +} + +/* + * PASSIVE: consumer accept, send local QP information, private data, + * queue on work thread to receive RTU information to avoid blocking + * user thread. + */ +DAT_RETURN +dapli_accept_usr(DAPL_EP *ep, DAPL_CR *cr, DAT_COUNT p_size, DAT_PVOID p_data) +{ + DAPL_IA *ia = ep->header.owner_ia; + dp_ib_cm_handle_t cm = cr->ib_cm_handle; + + if (p_size > DCM_MAX_PDATA_SIZE) + return DAT_LENGTH_ERROR; + + dapl_os_lock(&cm->lock); + if (cm->state != DCM_ACCEPTING_DATA) { + dapl_os_unlock(&cm->lock); + return DAT_INVALID_STATE; + } + dapl_os_unlock(&cm->lock); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ACCEPT_USR: remote port_num=%d lid=%x" + " iqp=%x qp_type %d, psize=%d\n", + cm->msg.daddr.ib.port_num, cm->msg.daddr.ib.lid, + cm->msg.daddr.ib.qpn, cm->msg.daddr.ib.qp_type, + cm->msg.p_size); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ACCEPT_USR: remote GID subnet %016llx id %016llx\n", + (unsigned long long) + htonll(cm->msg.daddr.ib.gid.global.subnet_prefix), + (unsigned long long) + htonll(cm->msg.daddr.ib.gid.global.interface_id)); + +#ifdef DAT_EXTENSIONS + if (cm->msg.daddr.ib.qp_type == IBV_QPT_UD && + ep->qp_handle->qp_type != IBV_QPT_UD) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " ACCEPT_USR: ERR remote QP is UD," + ", but local QP is not\n"); + return (DAT_INVALID_HANDLE | DAT_INVALID_HANDLE_EP); + } +#endif + + /* modify QP to RTR and then to RTS with remote info already read */ + dapl_os_lock(&ep->header.lock); + if (dapls_modify_qp_state(ep->qp_handle, + IBV_QPS_RTR, + cm->msg.daddr.ib.qpn, + cm->msg.daddr.ib.lid, + NULL) != DAT_SUCCESS) { + dapl_log(DAPL_DBG_TYPE_ERR, + " ACCEPT_USR: QPS_RTR ERR %s -> lid %x qpn %x\n", + strerror(errno), ntohs(cm->msg.daddr.ib.lid), + ntohl(cm->msg.daddr.ib.qpn)); + dapl_os_unlock(&ep->header.lock); + goto bail; + } + if (dapls_modify_qp_state(ep->qp_handle, + IBV_QPS_RTS, + cm->msg.daddr.ib.qpn, + cm->msg.daddr.ib.lid, + NULL) != DAT_SUCCESS) { + dapl_log(DAPL_DBG_TYPE_ERR, + " ACCEPT_USR: QPS_RTS ERR %s -> lid %x qpn %x\n", + strerror(errno), ntohs(cm->msg.daddr.ib.lid), + ntohl(cm->msg.daddr.ib.qpn)); + dapl_os_unlock(&ep->header.lock); + goto bail; + } + dapl_os_unlock(&ep->header.lock); + + /* save remote address information */ + dapl_os_memcpy(&ep->remote_ia_address, + &cm->msg.saddr, sizeof(union dcm_addr)); + + /* setup local QP info and type from EP, copy pdata, for reply */ + cm->msg.op = htons(DCM_REP); + cm->msg.saddr.ib.qpn = htonl(ep->qp_handle->qp_num); + cm->msg.saddr.ib.qp_type = htons(ep->qp_handle->qp_type); + cm->msg.saddr.ib.port_num = cm->hca->port_num; + cm->msg.saddr.ib.lid = cm->hca->ib_trans.addr.ib.lid; + cm->msg.saddr.ib.gid = cm->hca->ib_trans.addr.ib.gid; + dapl_os_memcpy(&cm->msg.p_data, p_data, p_size); + + if (ucm_send(&cm->hca->ib_trans, &cm->msg)) + goto bail; + + /* save state and setup valid reference to EP, HCA */ + dapl_os_lock(&cm->lock); + cm->ep = ep; + cm->hca = ia->hca_ptr; + cm->state = DCM_ACCEPTED; + dapl_os_unlock(&cm->lock); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " PASSIVE: accepted!\n"); + return DAT_SUCCESS; + +bail: + if (cm->msg.saddr.ib.qp_type != IBV_QPT_UD) + dapls_ib_reinit_ep(ep); + dapls_ib_cm_free(cm, ep); + return DAT_INTERNAL_ERROR; +} + + +/* + * dapls_ib_connect + * + * Initiate a connection with the passive listener on another node + * + * Input: + * ep_handle, + * remote_ia_address, + * remote_conn_qual, + * prd_size size of private data and structure + * prd_prt pointer to private data structure + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_connect(IN DAT_EP_HANDLE ep_handle, + IN DAT_IA_ADDRESS_PTR r_addr, + IN DAT_CONN_QUAL r_psp, + IN DAT_COUNT p_size, IN void *p_data) +{ + DAPL_EP *ep = (DAPL_EP *)ep_handle; + dp_ib_cm_handle_t cm; + + /* create CM object, initialize SRC info from EP */ + cm = dapls_ib_cm_create(ep); + if (cm == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + /* remote hca and port: lid, gid, port_num, network order */ + dapl_os_memcpy(&cm->msg.daddr, r_addr, sizeof(union dcm_addr)); + + /* remote uCM information, comes from consumer provider r_addr */ + cm->msg.dport = htons((uint16_t)r_psp); + cm->msg.dqpn = cm->msg.daddr.ib.qpn; + + if (p_size) { + cm->msg.p_size = htons(p_size); + dapl_os_memcpy(&cm->msg.p_data, p_data, p_size); + } + + /* build connect request, send to remote CM based on r_addr info */ + return(dapli_cm_connect(ep, cm)); +} + +/* + * dapls_ib_disconnect + * + * Disconnect an EP + * + * Input: + * ep_handle, + * disconnect_flags + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + */ +DAT_RETURN +dapls_ib_disconnect(IN DAPL_EP *ep, IN DAT_CLOSE_FLAGS close_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + "dapls_ib_disconnect(ep_handle %p ....)\n", ep); + + /* reinit to modify QP state, if not UD */ + if (ep->qp_handle->qp_type != IBV_QPT_UD) + dapls_ib_reinit_ep(ep); + + if (ep->cm_handle == NULL || + ep->param.ep_state == DAT_EP_STATE_DISCONNECTED) + return DAT_SUCCESS; + else + return (dapli_cm_disconnect(ep->cm_handle)); +} + +/* + * dapls_ib_disconnect_clean + * + * Clean up outstanding connection data. This routine is invoked + * after the final disconnect callback has occurred. Only on the + * ACTIVE side of a connection. It is also called if dat_ep_connect + * times out using the consumer supplied timeout value. + * + * Input: + * ep_ptr DAPL_EP + * active Indicates active side of connection + * + * Output: + * none + * + * Returns: + * void + * + */ +void +dapls_ib_disconnect_clean(IN DAPL_EP *ep, + IN DAT_BOOLEAN active, + IN const ib_cm_events_t ib_cm_event) +{ + /* NOTE: SCM will only initialize cm_handle with RC type + * + * For UD there can many in-flight CR's so you + * cannot cleanup timed out CR's with EP reference + * alone since they share the same EP. The common + * code that handles connection timeout logic needs + * updated for UD support. + */ + if (ep->cm_handle) + dapls_ib_cm_free(ep->cm_handle, ep); + + return; +} + +/* + * dapl_ib_setup_conn_listener + * + * Have the CM set up a connection listener. + * + * Input: + * ibm_hca_handle HCA handle + * qp_handle QP handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * DAT_CONN_QUAL_UNAVAILBLE + * DAT_CONN_QUAL_IN_USE + * + */ +DAT_RETURN +dapls_ib_setup_conn_listener(IN DAPL_IA *ia, + IN DAT_UINT64 sid, + IN DAPL_SP *sp) +{ + ib_cm_srvc_handle_t cm = NULL; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " listen(ia %p ServiceID %d sp %p)\n", + ia, sid, sp); + + /* reserve local port, then allocate CM object */ + if (!ucm_get_port(&ia->hca_ptr->ib_trans, (uint16_t)sid)) { + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " listen: ERROR %s on conn_qual 0x%x\n", + strerror(errno), sid); + return DAT_CONN_QUAL_IN_USE; + } + + /* cm_create will setup saddr for listen server */ + if ((cm = dapls_ib_cm_create(NULL)) == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + /* LISTEN: init DST address and QP info to local CM server info */ + cm->sp = sp; + cm->hca = ia->hca_ptr; + cm->msg.sport = htons((uint16_t)sid); + cm->msg.sqpn = htonl(ia->hca_ptr->ib_trans.qp->qp_num); + cm->msg.saddr.ib.qp_type = IBV_QPT_UD; + cm->msg.saddr.ib.port_num = ia->hca_ptr->port_num; + cm->msg.saddr.ib.lid = ia->hca_ptr->ib_trans.addr.ib.lid; + cm->msg.saddr.ib.gid = ia->hca_ptr->ib_trans.addr.ib.gid; + + /* save cm_handle reference in service point */ + sp->cm_srvc_handle = cm; + + /* queue up listen socket to process inbound CR's */ + cm->state = DCM_LISTEN; + ucm_queue_listen(cm); + + return DAT_SUCCESS; +} + + +/* + * dapl_ib_remove_conn_listener + * + * Have the CM remove a connection listener. + * + * Input: + * ia_handle IA handle + * ServiceID IB Channel Service ID + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_remove_conn_listener(IN DAPL_IA *ia, IN DAPL_SP *sp) +{ + ib_cm_srvc_handle_t cm = sp->cm_srvc_handle; + ib_hca_transport_t *tp = &ia->hca_ptr->ib_trans; + + /* free cm_srvc_handle and port, and mark CM for cleanup */ + if (cm) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " remove_listener(ia %p sp %p cm %p psp=%d)\n", + ia, sp, cm, ntohs(cm->msg.dport)); + + sp->cm_srvc_handle = NULL; + dapl_os_lock(&cm->lock); + ucm_free_port(tp, ntohs(cm->msg.dport)); + cm->msg.dport = 0; + cm->state = DCM_DESTROY; + dapl_os_unlock(&cm->lock); + ucm_dequeue_listen(cm); + dapl_os_free(cm, sizeof(*cm)); + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_accept_connection + * + * Perform necessary steps to accept a connection + * + * Input: + * cr_handle + * ep_handle + * private_data_size + * private_data + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_accept_connection(IN DAT_CR_HANDLE cr_handle, + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT p_size, + IN const DAT_PVOID p_data) +{ + DAPL_CR *cr = (DAPL_CR *)cr_handle; + DAPL_EP *ep = (DAPL_EP *)ep_handle; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept_connection(cr %p ep %p prd %p,%d)\n", + cr, ep, p_data, p_size); + + /* allocate and attach a QP if necessary */ + if (ep->qp_state == DAPL_QP_STATE_UNATTACHED) { + DAT_RETURN status; + status = dapls_ib_qp_alloc(ep->header.owner_ia, + ep, ep); + if (status != DAT_SUCCESS) + return status; + } + return (dapli_accept_usr(ep, cr, p_size, p_data)); +} + +/* + * dapls_ib_reject_connection + * + * Reject a connection + * + * Input: + * cr_handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm, + IN int reason, + IN DAT_COUNT psize, IN const DAT_PVOID pdata) +{ + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " reject(cm %p reason %x, pdata %p, psize %d)\n", + cm, reason, pdata, psize); + + if (psize > DCM_MAX_PDATA_SIZE) + return DAT_LENGTH_ERROR; + + cm->msg.op = htons(DCM_REJ_USER); + if (psize) + dapl_os_memcpy(&cm->msg.p_data, pdata, psize); + + if (ucm_send(&cm->hca->ib_trans, &cm->msg)) { + dapl_log(DAPL_DBG_TYPE_WARN, + " cm_reject: ERR: %s\n", strerror(errno)); + return DAT_INTERNAL_ERROR; + } + + /* cr_thread will destroy CR */ + cm->state = DCM_REJECTING; + send(cm->hca->ib_trans.scm[1], "w", sizeof "w", 0); + return DAT_SUCCESS; +} + +/* + * dapls_ib_cm_remote_addr + * + * Obtain the remote IP address given a connection + * + * Input: + * cr_handle + * + * Output: + * remote_ia_address: where to place the remote address + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + * + */ +DAT_RETURN +dapls_ib_cm_remote_addr(IN DAT_HANDLE dat_handle, + OUT DAT_SOCK_ADDR6 * remote_ia_address) +{ + DAPL_HEADER *header; + dp_ib_cm_handle_t ib_cm_handle; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + "dapls_ib_cm_remote_addr(dat_handle %p, ....)\n", + dat_handle); + + header = (DAPL_HEADER *) dat_handle; + + if (header->magic == DAPL_MAGIC_EP) + ib_cm_handle = ((DAPL_EP *) dat_handle)->cm_handle; + else if (header->magic == DAPL_MAGIC_CR) + ib_cm_handle = ((DAPL_CR *) dat_handle)->ib_cm_handle; + else + return DAT_INVALID_HANDLE; + + dapl_os_memcpy(remote_ia_address, + &ib_cm_handle->msg.daddr, sizeof(DAT_SOCK_ADDR6)); + + return DAT_SUCCESS; +} + +/* + * dapls_ib_private_data_size + * + * Return the size of private data given a connection op type + * + * Input: + * prd_ptr private data pointer + * conn_op connection operation type + * + * If prd_ptr is NULL, this is a query for the max size supported by + * the provider, otherwise it is the actual size of the private data + * contained in prd_ptr. + * + * + * Output: + * None + * + * Returns: + * length of private data + * + */ +int dapls_ib_private_data_size(IN DAPL_PRIVATE * prd_ptr, + IN DAPL_PDATA_OP conn_op, IN DAPL_HCA * hca_ptr) +{ + int size; + + switch (conn_op) { + case DAPL_PDATA_CONN_REQ: + case DAPL_PDATA_CONN_REP: + case DAPL_PDATA_CONN_REJ: + case DAPL_PDATA_CONN_DREQ: + case DAPL_PDATA_CONN_DREP: + size = DCM_MAX_PDATA_SIZE; + break; + default: + size = 0; + } + return size; +} + +/* + * Map all socket CM event codes to the DAT equivelent. + */ +#define DAPL_IB_EVENT_CNT 10 + +static struct ib_cm_event_map { + const ib_cm_events_t ib_cm_event; + DAT_EVENT_NUMBER dat_event_num; +} ib_cm_event_map[DAPL_IB_EVENT_CNT] = { +/* 00 */ {IB_CME_CONNECTED, + DAT_CONNECTION_EVENT_ESTABLISHED}, +/* 01 */ {IB_CME_DISCONNECTED, + DAT_CONNECTION_EVENT_DISCONNECTED}, +/* 02 */ {IB_CME_DISCONNECTED_ON_LINK_DOWN, + DAT_CONNECTION_EVENT_DISCONNECTED}, +/* 03 */ {IB_CME_CONNECTION_REQUEST_PENDING, + DAT_CONNECTION_REQUEST_EVENT}, +/* 04 */ {IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA, + DAT_CONNECTION_REQUEST_EVENT}, +/* 05 */ {IB_CME_DESTINATION_REJECT, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, +/* 06 */ {IB_CME_DESTINATION_REJECT_PRIVATE_DATA, + DAT_CONNECTION_EVENT_PEER_REJECTED}, +/* 07 */ {IB_CME_DESTINATION_UNREACHABLE, + DAT_CONNECTION_EVENT_UNREACHABLE}, +/* 08 */ {IB_CME_TOO_MANY_CONNECTION_REQUESTS, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, +/* 09 */ {IB_CME_LOCAL_FAILURE, + DAT_CONNECTION_EVENT_BROKEN} +}; + +/* + * dapls_ib_get_cm_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * dat_event_num DAT event we need an equivelent CM event for + * + * Output: + * none + * + * Returns: + * ib_cm_event of translated DAPL value + */ +DAT_EVENT_NUMBER +dapls_ib_get_dat_event(IN const ib_cm_events_t ib_cm_event, + IN DAT_BOOLEAN active) +{ + DAT_EVENT_NUMBER dat_event_num; + int i; + + if (ib_cm_event > IB_CME_LOCAL_FAILURE) + return (DAT_EVENT_NUMBER) 0; + + dat_event_num = 0; + for (i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if (ib_cm_event == ib_cm_event_map[i].ib_cm_event) { + dat_event_num = ib_cm_event_map[i].dat_event_num; + break; + } + } + dapl_dbg_log(DAPL_DBG_TYPE_CALLBACK, + "dapls_ib_get_dat_event: event translate(%s) ib=0x%x dat=0x%x\n", + active ? "active" : "passive", ib_cm_event, dat_event_num); + + return dat_event_num; +} + +/* + * dapls_ib_get_dat_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * ib_cm_event event provided to the dapl callback routine + * active switch indicating active or passive connection + * + * Output: + * none + * + * Returns: + * DAT_EVENT_NUMBER of translated provider value + */ +ib_cm_events_t dapls_ib_get_cm_event(IN DAT_EVENT_NUMBER dat_event_num) +{ + ib_cm_events_t ib_cm_event; + int i; + + ib_cm_event = 0; + for (i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if (dat_event_num == ib_cm_event_map[i].dat_event_num) { + ib_cm_event = ib_cm_event_map[i].ib_cm_event; + break; + } + } + return ib_cm_event; +} + +/* work thread for uAT, uCM, CQ, and async events */ +void cm_thread(void *arg) +{ + struct dapl_hca *hca = arg; + dp_ib_cm_handle_t cm, next; + struct dapl_fd_set *set; + char rbuf[2]; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread: ENTER hca %p\n", hca); + set = dapl_alloc_fd_set(); + if (!set) + goto out; + + dapl_os_lock(&hca->ib_trans.lock); + hca->ib_trans.cm_state = IB_THREAD_RUN; + + while (1) { + dapl_fd_zero(set); + dapl_fd_set(hca->ib_trans.scm[0], set, DAPL_FD_READ); + dapl_fd_set(hca->ib_hca_handle->async_fd, set, DAPL_FD_READ); + dapl_fd_set(hca->ib_trans.rch->fd, set, DAPL_FD_READ); + + if (!dapl_llist_is_empty(&hca->ib_trans.list)) + next = dapl_llist_peek_head(&hca->ib_trans.list); + else + next = NULL; + + while (next) { + cm = next; + next = dapl_llist_next_entry( + &hca->ib_trans.list, + (DAPL_LLIST_ENTRY *)&cm->entry); + + if (cm->state == DCM_DESTROY || + hca->ib_trans.cm_state != IB_THREAD_RUN) { + dapl_llist_remove_entry( + &hca->ib_trans.list, + (DAPL_LLIST_ENTRY *)&cm->entry); + dapl_os_free(cm, sizeof(*cm)); + continue; + } + + /* TODO: Check and process retries here */ + + continue; + } + + /* set to exit and all resources destroyed */ + if ((hca->ib_trans.cm_state != IB_THREAD_RUN) && + (dapl_llist_is_empty(&hca->ib_trans.list))) + break; + + dapl_os_unlock(&hca->ib_trans.lock); + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread: select sleep\n"); + dapl_select(set); + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread: select wake\n"); + + /* Process events: CM, ASYNC, NOTIFY THREAD */ + if (dapl_poll(hca->ib_trans.rch->fd, + DAPL_FD_READ) == DAPL_FD_READ) { + ucm_recv(&hca->ib_trans); + } + if (dapl_poll(hca->ib_hca_handle->async_fd, + DAPL_FD_READ) == DAPL_FD_READ) { + ucm_async_event(hca); + } + while (dapl_poll(hca->ib_trans.scm[0], + DAPL_FD_READ) == DAPL_FD_READ) { + recv(hca->ib_trans.scm[0], rbuf, 2, 0); + } + + dapl_os_lock(&hca->ib_trans.lock); + + /* set to exit and all resources destroyed */ + if ((hca->ib_trans.cm_state != IB_THREAD_RUN) && + (dapl_llist_is_empty(&hca->ib_trans.list))) + break; + } + + dapl_os_unlock(&hca->ib_trans.lock); + free(set); +out: + hca->ib_trans.cm_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_thread(hca %p) exit\n", hca); +} + + +#ifdef DAPL_COUNTERS +/* Debug aid: List all Connections in process and state */ +void dapls_print_cm_list(IN DAPL_IA *ia_ptr) +{ + /* Print in process CR's for this IA, if debug type set */ + int i = 0; + dp_ib_cm_handle_t cr, next_cr; + + dapl_os_lock(&ia_ptr->hca_ptr->ib_trans.lock); + if (!dapl_llist_is_empty((DAPL_LLIST_HEAD*) + &ia_ptr->hca_ptr->ib_trans.list)) + next_cr = dapl_llist_peek_head((DAPL_LLIST_HEAD*) + &ia_ptr->hca_ptr->ib_trans.list); + else + next_cr = NULL; + + printf("\n DAPL IA CONNECTIONS IN PROCESS:\n"); + while (next_cr) { + cr = next_cr; + next_cr = dapl_llist_next_entry((DAPL_LLIST_HEAD*) + &ia_ptr->hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry); + + printf( " CONN[%d]: sp %p ep %p %s %s %s" + " dst lid %x iqp %x port %d\n", + i, cr->sp, cr->ep, + cr->msg.saddr.ib.qp_type == IBV_QPT_RC ? "RC" : "UD", + dapl_cm_state_str(cr->state), + cr->sp ? "<-" : "->", + ntohs(cr->msg.daddr.ib.lid), + ntohl(cr->msg.daddr.ib.qpn), + cr->sp ? + (int)cr->sp->conn_qual : ntohs(cr->msg.dport) ); + i++; + } + printf("\n"); + dapl_os_unlock(&ia_ptr->hca_ptr->ib_trans.lock); +} +#endif diff --git a/dapl/openib_ucm/dapl_ib_util.h b/dapl/openib_ucm/dapl_ib_util.h new file mode 100644 index 0000000..dfee2b9 --- /dev/null +++ b/dapl/openib_ucm/dapl_ib_util.h @@ -0,0 +1,119 @@ +/* + * Copyright (c) 2009 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#ifndef _DAPL_IB_UTIL_H_ +#define _DAPL_IB_UTIL_H_ +#define _OPENIB_SCM_ + +#include +#include "openib_osd.h" +#include "dapl_ib_common.h" + +#define UCM_DEFAULT_CQE 500 +#define UCM_DEFAULT_QPE 500 + +struct ib_cm_handle +{ + struct dapl_llist_entry entry; + DAPL_OS_LOCK lock; + int state; + int retries; + struct dapl_hca *hca; + struct dapl_sp *sp; + struct dapl_ep *ep; + ib_cm_msg_t msg; +}; + +typedef struct ib_cm_handle *dp_ib_cm_handle_t; +typedef dp_ib_cm_handle_t ib_cm_srvc_handle_t; + +/* Definitions */ +#define IB_INVALID_HANDLE NULL + +/* ib_hca_transport_t, specific to this implementation */ +typedef struct _ib_hca_transport +{ + struct ibv_device *ib_dev; + struct dapl_hca *hca; + struct ibv_context *ib_ctx; + struct ibv_comp_channel *ib_cq; + ib_cq_handle_t ib_cq_empty; + int destroy; + int cm_state; + DAPL_OS_THREAD thread; + DAPL_OS_LOCK lock; /* connect list */ + struct dapl_llist_entry *list; + DAPL_OS_LOCK llock; /* listen list */ + struct dapl_llist_entry *llist; + ib_async_handler_t async_unafiliated; + void *async_un_ctx; + ib_async_cq_handler_t async_cq_error; + ib_async_dto_handler_t async_cq; + ib_async_qp_handler_t async_qp_error; + union dcm_addr addr; /* lid, port, qp_num, gid */ + int max_inline_send; + int rd_atom_in; + int rd_atom_out; + uint8_t ack_timer; + uint8_t ack_retry; + uint8_t rnr_timer; + uint8_t rnr_retry; + uint8_t global; + uint8_t hop_limit; + uint8_t tclass; + uint8_t mtu; + DAT_NAMED_ATTR named_attr; + DAPL_SOCKET scm[2]; + int cqe; + int qpe; + DAPL_OS_LOCK slock; + int s_hd; + int s_tl; + struct ibv_pd *pd; + struct ibv_cq *scq; + struct ibv_cq *rcq; + struct ibv_qp *qp; + struct ibv_mr *mr_rbuf; + struct ibv_mr *mr_sbuf; + ib_cm_msg_t *sbuf; + ib_cm_msg_t *rbuf; + struct ibv_comp_channel *rch; + struct ibv_ah **ah; + DAPL_OS_LOCK plock; + uint8_t *sid; /* Sevice IDs, port space, bitarray? */ + +} ib_hca_transport_t; + +/* prototypes */ +void cm_thread(void *arg); +void ucm_async_event(struct dapl_hca *hca); +dp_ib_cm_handle_t dapls_ib_cm_create(DAPL_EP *ep); +void dapls_ib_cm_free(dp_ib_cm_handle_t cm, DAPL_EP *ep); +void dapls_print_cm_list(IN DAPL_IA *ia_ptr); + +#endif /* _DAPL_IB_UTIL_H_ */ + diff --git a/dapl/openib_ucm/device.c b/dapl/openib_ucm/device.c new file mode 100644 index 0000000..329b050 --- /dev/null +++ b/dapl/openib_ucm/device.c @@ -0,0 +1,603 @@ +/* + * Copyright (c) 2009 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#include "openib_osd.h" +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_ib_util.h" +#include "dapl_osd.h" + +#include + +static void ucm_service_destroy(IN DAPL_HCA *hca); +static int ucm_service_create(IN DAPL_HCA *hca); + +static int32_t create_cr_pipe(IN DAPL_HCA * hca_ptr) +{ + DAPL_SOCKET listen_socket; + struct sockaddr_in addr; + socklen_t addrlen = sizeof(addr); + int ret; + + listen_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (listen_socket == DAPL_INVALID_SOCKET) + return 1; + + memset(&addr, 0, sizeof addr); + addr.sin_family = AF_INET; + addr.sin_addr.s_addr = htonl(0x7f000001); + ret = bind(listen_socket, (struct sockaddr *)&addr, sizeof addr); + if (ret) + goto err1; + + ret = getsockname(listen_socket, (struct sockaddr *)&addr, &addrlen); + if (ret) + goto err1; + + ret = listen(listen_socket, 0); + if (ret) + goto err1; + + hca_ptr->ib_trans.scm[1] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (hca_ptr->ib_trans.scm[1] == DAPL_INVALID_SOCKET) + goto err1; + + ret = connect(hca_ptr->ib_trans.scm[1], + (struct sockaddr *)&addr, sizeof(addr)); + if (ret) + goto err2; + + hca_ptr->ib_trans.scm[0] = accept(listen_socket, NULL, NULL); + if (hca_ptr->ib_trans.scm[0] == DAPL_INVALID_SOCKET) + goto err2; + + closesocket(listen_socket); + return 0; + + err2: + closesocket(hca_ptr->ib_trans.scm[1]); + err1: + closesocket(listen_socket); + return 1; +} + +static void destroy_cr_pipe(IN DAPL_HCA * hca_ptr) +{ + closesocket(hca_ptr->ib_trans.scm[0]); + closesocket(hca_ptr->ib_trans.scm[1]); +} + + +/* + * dapls_ib_init, dapls_ib_release + * + * Initialize Verb related items for device open + * + * Input: + * none + * + * Output: + * none + * + * Returns: + * 0 success, -1 error + * + */ +int32_t dapls_ib_init(void) +{ + return 0; +} + +int32_t dapls_ib_release(void) +{ + return 0; +} + +#if defined(_WIN64) || defined(_WIN32) +int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + return 0; +} +#else // _WIN64 || WIN32 +int dapls_config_comp_channel(struct ibv_comp_channel *channel) +{ + int opts; + + opts = fcntl(channel->fd, F_GETFL); /* uCQ */ + if (opts < 0 || fcntl(channel->fd, F_SETFL, opts | O_NONBLOCK) < 0) { + dapl_log(DAPL_DBG_TYPE_ERR, + " dapls_create_comp_channel: fcntl on ib_cq->fd %d ERR %d %s\n", + channel->fd, opts, strerror(errno)); + return errno; + } + + return 0; +} +#endif + +/* + * dapls_ib_open_hca + * + * Open HCA + * + * Input: + * *hca_name pointer to provider device name + * *ib_hca_handle_p pointer to provide HCA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) +{ + struct ibv_device **dev_list; + struct ibv_port_attr port_attr; + int i; + DAT_RETURN dat_status; + + /* Get list of all IB devices, find match, open */ + dev_list = ibv_get_device_list(NULL); + if (!dev_list) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " open_hca: ibv_get_device_list() failed\n", + hca_name); + return DAT_INTERNAL_ERROR; + } + + for (i = 0; dev_list[i]; ++i) { + hca_ptr->ib_trans.ib_dev = dev_list[i]; + if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + hca_name)) + goto found; + } + + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: device %s not found\n", hca_name); + goto err; + +found: + + hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev); + if (!hca_ptr->ib_hca_handle) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: dev open failed for %s, err=%s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + strerror(errno)); + goto err; + } + hca_ptr->ib_trans.ib_ctx = hca_ptr->ib_hca_handle; + + /* get lid for this hca-port, network order */ + if (ibv_query_port(hca_ptr->ib_hca_handle, + (uint8_t)hca_ptr->port_num, &port_attr)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: get lid ERR for %s, err=%s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + strerror(errno)); + goto err; + } else { + hca_ptr->ib_trans.addr.ib.lid = htons(port_attr.lid); + hca_ptr->ib_trans.addr.ib.port_num = hca_ptr->port_num; + } + + /* get gid for this hca-port, network order */ + if (ibv_query_gid(hca_ptr->ib_hca_handle, + (uint8_t) hca_ptr->port_num, + 0, &hca_ptr->ib_trans.addr.ib.gid)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: query GID ERR for %s, err=%s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + strerror(errno)); + goto err; + } + + /* set RC tunables via enviroment or default */ + hca_ptr->ib_trans.max_inline_send = + dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_IB_DEFAULT); + hca_ptr->ib_trans.ack_retry = + dapl_os_get_env_val("DAPL_ACK_RETRY", DCM_ACK_RETRY); + hca_ptr->ib_trans.ack_timer = + dapl_os_get_env_val("DAPL_ACK_TIMER", DCM_ACK_TIMER); + hca_ptr->ib_trans.rnr_retry = + dapl_os_get_env_val("DAPL_RNR_RETRY", DCM_RNR_RETRY); + hca_ptr->ib_trans.rnr_timer = + dapl_os_get_env_val("DAPL_RNR_TIMER", DCM_RNR_TIMER); + hca_ptr->ib_trans.global = + dapl_os_get_env_val("DAPL_GLOBAL_ROUTING", DCM_GLOBAL); + hca_ptr->ib_trans.hop_limit = + dapl_os_get_env_val("DAPL_HOP_LIMIT", DCM_HOP_LIMIT); + hca_ptr->ib_trans.tclass = + dapl_os_get_env_val("DAPL_TCLASS", DCM_TCLASS); + hca_ptr->ib_trans.mtu = + dapl_ib_mtu(dapl_os_get_env_val("DAPL_IB_MTU", DCM_IB_MTU)); + + /* initialize CM list, LISTEN, SND queue, PSP array, locks */ + if ((dapl_os_lock_init(&hca_ptr->ib_trans.lock)) != DAT_SUCCESS) + goto err; + + if ((dapl_os_lock_init(&hca_ptr->ib_trans.llock)) != DAT_SUCCESS) + goto err; + + if ((dapl_os_lock_init(&hca_ptr->ib_trans.slock)) != DAT_SUCCESS) + goto err; + + if ((dapl_os_lock_init(&hca_ptr->ib_trans.plock)) != DAT_SUCCESS) + goto err; + + + /* initialize CM and listen lists on this HCA uCM QP */ + dapl_llist_init_head(&hca_ptr->ib_trans.list); + dapl_llist_init_head(&hca_ptr->ib_trans.llist); + + /* create uCM qp services */ + if (ucm_service_create(hca_ptr)) + goto bail; + + /* initialize pipe, user level wakeup on select */ + if (create_cr_pipe(hca_ptr)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: failed to init cr pipe - %s\n", + strerror(errno)); + goto bail; + } + + /* create thread to process inbound connect request */ + hca_ptr->ib_trans.cm_state = IB_THREAD_INIT; + dat_status = dapl_os_thread_create(cm_thread, + (void *)hca_ptr, + &hca_ptr->ib_trans.thread); + if (dat_status != DAT_SUCCESS) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: failed to create thread\n"); + goto bail; + } + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " open_hca: devname %s, port %d, hostname_IP %s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + hca_ptr->ib_trans.addr.ib.port_num, + inet_ntoa(((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr)); + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " open_hca: QPN 0x%x LID 0x%x GID Subnet 0x" F64x "" + "ID 0x" F64x "\n", + ntohl(hca_ptr->ib_trans.addr.ib.qpn), + ntohs(hca_ptr->ib_trans.addr.ib.lid), + (unsigned long long) + htonll(hca_ptr->ib_trans.addr.ib.gid.global.subnet_prefix), + (unsigned long long) + htonll(hca_ptr->ib_trans.addr.ib.gid.global.interface_id)); + + /* save LID, GID, QPN, PORT address information, for ia_queries */ + hca_ptr->ib_trans.hca = hca_ptr; + hca_ptr->ib_trans.addr.ib.qp_type = IBV_QPT_UD; + memcpy(&hca_ptr->hca_address, + &hca_ptr->ib_trans.addr, + sizeof(union dcm_addr)); + + ibv_free_device_list(dev_list); + + /* wait for cm_thread */ + while (hca_ptr->ib_trans.cm_state != IB_THREAD_RUN) + dapl_os_sleep_usec(1000); + + return dat_status; + +bail: + ucm_service_destroy(hca_ptr); + ibv_close_device(hca_ptr->ib_hca_handle); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + +err: + ibv_free_device_list(dev_list); + return DAT_INTERNAL_ERROR; +} + +/* + * dapls_ib_close_hca + * + * Open HCA + * + * Input: + * DAPL_HCA provide CA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA * hca_ptr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: %p\n", hca_ptr); + + if (hca_ptr->ib_trans.cm_state == IB_THREAD_RUN) { + hca_ptr->ib_trans.cm_state = IB_THREAD_CANCEL; + send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0); + while (hca_ptr->ib_trans.cm_state != IB_THREAD_EXIT) { + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " close_hca: waiting for cr_thread\n"); + send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0); + dapl_os_sleep_usec(1000); + } + } + + if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { + if (ibv_close_device(hca_ptr->ib_hca_handle)) + return (dapl_convert_errno(errno, "ib_close_device")); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + } + + dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); + dapl_os_lock_destroy(&hca_ptr->ib_trans.llock); + destroy_cr_pipe(hca_ptr); /* no longer need pipe */ + ucm_service_destroy(hca_ptr); + return (DAT_SUCCESS); +} + +/* Create uCM endpoint services, allocate remote_ah's array */ +static void ucm_service_destroy(IN DAPL_HCA *hca) +{ + ib_hca_transport_t *tp = &hca->ib_trans; + int msg_size = sizeof(ib_cm_msg_t); + + if (tp->pd) + ibv_dealloc_pd(tp->pd); + + if (tp->rch) + ibv_destroy_comp_channel(tp->rch); + + if (tp->scq) + ibv_destroy_cq(tp->scq); + + if (tp->rcq) + ibv_destroy_cq(tp->rcq); + + if (tp->qp) + ibv_destroy_qp(tp->qp); + + if (tp->mr_sbuf) + ibv_dereg_mr(tp->mr_sbuf); + + if (tp->mr_sbuf) + ibv_dereg_mr(tp->mr_sbuf); + + if (tp->ah) + dapl_os_free(tp->ah, (sizeof(*tp->ah) * 0xffff)); + + if (tp->sid) + dapl_os_free(tp->sid, (sizeof(*tp->sid) * 0xffff)); + + if (tp->rbuf) + dapl_os_free(tp->rbuf, (msg_size * tp->qpe)); + + if (tp->sbuf) + dapl_os_free(tp->sbuf, (msg_size * tp->qpe)); +} + +static int ucm_service_create(IN DAPL_HCA *hca) +{ + struct ibv_qp_init_attr qp_create; + ib_hca_transport_t *tp = &hca->ib_trans; + struct ibv_recv_wr recv_wr, *recv_err; + struct ibv_sge sge; + int i, mlen = sizeof(ib_cm_msg_t); + int hlen = sizeof(struct ibv_grh); /* hdr included with UD recv */ + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ucm_create: \n"); + + /* get queue sizes */ + tp->qpe = dapl_os_get_env_val("DAPL_UCM_QPE", UCM_DEFAULT_QPE); + tp->cqe = dapl_os_get_env_val("DAPL_UCM_CQE", UCM_DEFAULT_CQE); + tp->pd = ibv_alloc_pd(hca->ib_hca_handle); + if (!tp->pd) + goto bail; + + tp->rch = ibv_create_comp_channel(hca->ib_hca_handle); + if (!tp->rch) + goto bail; + + tp->scq = ibv_create_cq(hca->ib_hca_handle, tp->cqe, hca, NULL, 0); + if (!tp->scq) + goto bail; + + tp->rcq = ibv_create_cq(hca->ib_hca_handle, tp->cqe, hca, tp->rch, 0); + if (!tp->rcq) + goto bail; + + if(ibv_req_notify_cq(tp->rcq, 0)) + goto bail; + + dapl_os_memzero((void *)&qp_create, sizeof(qp_create)); + qp_create.qp_type = IBV_QPT_UD; + qp_create.send_cq = tp->scq; + qp_create.recv_cq = tp->rcq; + qp_create.cap.max_send_wr = qp_create.cap.max_recv_wr = tp->qpe; + qp_create.cap.max_send_sge = qp_create.cap.max_recv_sge = 1; + qp_create.cap.max_inline_data = tp->max_inline_send; + qp_create.qp_context = (void *)hca; + + tp->qp = ibv_create_qp(tp->pd, &qp_create); + if (!tp->qp) + goto bail; + + tp->ah = (ib_ah_handle_t*) dapl_os_alloc(sizeof(ib_ah_handle_t) * 0xffff); + tp->sid = (uint8_t*) dapl_os_alloc(sizeof(uint8_t) * 0xffff); + tp->rbuf = (void*) dapl_os_alloc((mlen + hlen) * tp->qpe); + tp->sbuf = (void*) dapl_os_alloc(mlen * tp->qpe); + + if (!tp->ah || !tp->rbuf || !tp->sbuf || !tp->sid) + goto bail; + + (void)dapl_os_memzero(tp->ah, (sizeof(ib_ah_handle_t) * 0xffff)); + (void)dapl_os_memzero(tp->sid, (sizeof(uint8_t) * 0xffff)); + tp->sid[0] = 1; /* resv slot 0, 0 == no ports available */ + (void)dapl_os_memzero(tp->rbuf, ((mlen + hlen) * tp->qpe)); + (void)dapl_os_memzero(tp->sbuf, (mlen * tp->qpe)); + + tp->mr_sbuf = ibv_reg_mr(tp->pd, tp->sbuf, + (mlen * tp->qpe), + IBV_ACCESS_LOCAL_WRITE); + if (!tp->mr_sbuf) + goto bail; + + tp->mr_rbuf = ibv_reg_mr(tp->pd, tp->rbuf, + ((mlen + hlen) * tp->qpe), + IBV_ACCESS_LOCAL_WRITE); + if (!tp->mr_rbuf) + goto bail; + + /* modify UD QP: init, rtr, rts */ + if ((dapls_modify_qp_ud(hca, tp->qp)) != DAT_SUCCESS) + goto bail; + + /* post receive buffers, setup head, tail pointers */ + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + sge.length = mlen + hlen; + sge.lkey = tp->mr_rbuf->lkey; + + for (i = 0; i < tp->qpe; i++) { + recv_wr.wr_id = + (uintptr_t)((char *)&tp->rbuf[i] + + sizeof(struct ibv_grh)); + sge.addr = (uintptr_t) &tp->rbuf[i]; + if (ibv_post_recv(tp->qp, &recv_wr, &recv_err)) + goto bail; + } + + /* save qp_num as part of ia_address, network order */ + tp->addr.ib.qpn = htonl(tp->qp->qp_num); + return 0; +bail: + dapl_log(DAPL_DBG_TYPE_ERR, + " ucm_create_services: ERR %s\n", strerror(errno)); + ucm_service_destroy(hca); + return -1; +} + +void ucm_async_event(struct dapl_hca *hca) +{ + struct ibv_async_event event; + struct _ib_hca_transport *tp = &hca->ib_trans; + + dapl_log(DAPL_DBG_TYPE_WARN, " async_event(%p)\n", hca); + + if (!ibv_get_async_event(hca->ib_hca_handle, &event)) { + + switch (event.event_type) { + case IBV_EVENT_CQ_ERR: + { + struct dapl_ep *evd_ptr = + event.element.cq->cq_context; + + dapl_log(DAPL_DBG_TYPE_ERR, + "dapl async_event CQ (%p) ERR %d\n", + evd_ptr, event.event_type); + + /* report up if async callback still setup */ + if (tp->async_cq_error) + tp->async_cq_error(hca->ib_hca_handle, + event.element.cq, + &event, (void *)evd_ptr); + break; + } + case IBV_EVENT_COMM_EST: + { + /* Received msgs on connected QP before RTU */ + dapl_log(DAPL_DBG_TYPE_UTIL, + " async_event COMM_EST(%p) rdata beat RTU\n", + event.element.qp); + + break; + } + case IBV_EVENT_QP_FATAL: + case IBV_EVENT_QP_REQ_ERR: + case IBV_EVENT_QP_ACCESS_ERR: + case IBV_EVENT_QP_LAST_WQE_REACHED: + case IBV_EVENT_SRQ_ERR: + case IBV_EVENT_SRQ_LIMIT_REACHED: + case IBV_EVENT_SQ_DRAINED: + { + struct dapl_ep *ep_ptr = + event.element.qp->qp_context; + + dapl_log(DAPL_DBG_TYPE_ERR, + "dapl async_event QP (%p) ERR %d\n", + ep_ptr, event.event_type); + + /* report up if async callback still setup */ + if (tp->async_qp_error) + tp->async_qp_error(hca->ib_hca_handle, + ep_ptr->qp_handle, + &event, (void *)ep_ptr); + break; + } + case IBV_EVENT_PATH_MIG: + case IBV_EVENT_PATH_MIG_ERR: + case IBV_EVENT_DEVICE_FATAL: + case IBV_EVENT_PORT_ACTIVE: + case IBV_EVENT_PORT_ERR: + case IBV_EVENT_LID_CHANGE: + case IBV_EVENT_PKEY_CHANGE: + case IBV_EVENT_SM_CHANGE: + { + dapl_log(DAPL_DBG_TYPE_WARN, + "dapl async_event: DEV ERR %d\n", + event.event_type); + + /* report up if async callback still setup */ + if (tp->async_unafiliated) + tp->async_unafiliated(hca->ib_hca_handle, + &event, + tp->async_un_ctx); + break; + } + case IBV_EVENT_CLIENT_REREGISTER: + /* no need to report this event this time */ + dapl_log(DAPL_DBG_TYPE_UTIL, + " async_event: IBV_CLIENT_REREGISTER\n"); + break; + + default: + dapl_log(DAPL_DBG_TYPE_WARN, + "dapl async_event: %d UNKNOWN\n", + event.event_type); + break; + + } + ibv_ack_async_event(&event); + } +} + diff --git a/dapl/openib_ucm/linux/openib_osd.h b/dapl/openib_ucm/linux/openib_osd.h new file mode 100644 index 0000000..191a55b --- /dev/null +++ b/dapl/openib_ucm/linux/openib_osd.h @@ -0,0 +1,21 @@ +#ifndef OPENIB_OSD_H +#define OPENIB_OSD_H + +#include +#include + +#if __BYTE_ORDER == __BIG_ENDIAN +#define htonll(x) (x) +#define ntohll(x) (x) +#elif __BYTE_ORDER == __LITTLE_ENDIAN +#define htonll(x) bswap_64(x) +#define ntohll(x) bswap_64(x) +#endif + +#define DAPL_SOCKET int +#define DAPL_INVALID_SOCKET -1 +#define DAPL_FD_SETSIZE 16 + +#define closesocket close + +#endif // OPENIB_OSD_H diff --git a/dapl/openib_ucm/udapl.rc b/dapl/openib_ucm/udapl.rc new file mode 100644 index 0000000..8550256 --- /dev/null +++ b/dapl/openib_ucm/udapl.rc @@ -0,0 +1,48 @@ +/* + * Copyright (c) 2007, 2009 Intel Corporation. All rights reserved. + * + * This software is available to you under the OpenIB.org BSD license + * below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +#include + +#define VER_FILETYPE VFT_DLL +#define VER_FILESUBTYPE VFT2_UNKNOWN + +#if DBG +#define VER_FILEDESCRIPTION_STR "Direct Access Provider Library v2.0 (OFA socket-cm) (Debug)" +#define VER_INTERNALNAME_STR "dapl2-ofa-scmd.dll" +#define VER_ORIGINALFILENAME_STR "dapl2-ofa-scmd.dll" +#else +#define VER_FILEDESCRIPTION_STR "Direct Access Provider Library v2.0 (OFA socket-cm)" +#define VER_INTERNALNAME_STR "dapl2-ofa-scm.dll" +#define VER_ORIGINALFILENAME_STR "dapl2-ofa-scm.dll" +#endif + +#include diff --git a/dapl/openib_ucm/windows/openib_osd.h b/dapl/openib_ucm/windows/openib_osd.h new file mode 100644 index 0000000..7eb3df3 --- /dev/null +++ b/dapl/openib_ucm/windows/openib_osd.h @@ -0,0 +1,35 @@ +#ifndef OPENIB_OSD_H +#define OPENIB_OSD_H + +#ifndef FD_SETSIZE +#define FD_SETSIZE 1024 /* Set before including winsock2 - see select help */ +#define DAPL_FD_SETSIZE FD_SETSIZE +#endif + +#include +#include +#include +#include + +#define ntohll _byteswap_uint64 +#define htonll _byteswap_uint64 + +#define DAPL_SOCKET SOCKET +#define DAPL_INVALID_SOCKET INVALID_SOCKET + +/* allow casting to WSABUF */ +struct iovec +{ + u_long iov_len; + char FAR* iov_base; +}; + +static int writev(DAPL_SOCKET s, struct iovec *vector, int count) +{ + int len, ret; + + ret = WSASend(s, (WSABUF *) vector, count, &len, 0, NULL, NULL); + return ret ? ret : len; +} + +#endif // OPENIB_OSD_H diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index d868490..2f418fe 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -76,6 +76,7 @@ #include #include #include +#include #define DAPL_PROVIDER "ofa-v2-ib0" @@ -99,14 +100,15 @@ #define MAX_PROCS 1000 /* Header files needed for DAT/uDAPL */ -#include "dat2/udat.h" +#include "infiniband/verbs.h" +#include "dat2/udat.h" /* definitions */ #define SERVER_CONN_QUAL 45248 #define DTO_TIMEOUT (1000*1000*5) #define CNO_TIMEOUT (1000*1000*1) #define DTO_FLUSH_TIMEOUT (1000*1000*2) -#define CONN_TIMEOUT (1000*1000*10) +#define CONN_TIMEOUT (1000*1000*100) #define SERVER_TIMEOUT DAT_TIMEOUT_INFINITE #define RDMA_BUFFER_SIZE (64) @@ -187,7 +189,7 @@ struct dt_time { double conn; }; -struct dt_time time; +struct dt_time ts; /* defaults */ static int failed = 0; @@ -207,6 +209,22 @@ static int use_cno = 0; static int recv_msg_index = 0; static int burst_msg_posted = 0; static int burst_msg_index = 0; +static int ucm = 0; + +/* IB address structure used by DAPL uCM provider */ +union dcm_addr { + DAT_SOCK_ADDR6 so; + struct { + uint8_t qp_type; + uint8_t port_num; + uint16_t lid; + uint32_t qpn; + union ibv_gid gid; + } ib; +}; + +static union dcm_addr remote; +static union dcm_addr local; /* forward prototypes */ const char *DT_RetToStr(DAT_RETURN ret_value); @@ -313,9 +331,10 @@ int main(int argc, char **argv) int i, c; DAT_RETURN ret; DAT_EP_PARAM ep_param; + DAT_IA_ATTR ia_attr; /* parse arguments */ - while ((c = getopt(argc, argv, "tscvpb:d:B:h:P:")) != -1) { + while ((c = getopt(argc, argv, "tscvpq:l:b:d:B:h:P:")) != -1) { switch (c) { case 't': performance_times = 1; @@ -340,6 +359,16 @@ int main(int argc, char **argv) printf("%d Polling\n", getpid()); fflush(stdout); break; + case 'q': + remote.ib.qpn = htonl(strtol(optarg,NULL,0)); + ucm = 1; + server = 0; + break; + case 'l': + remote.ib.lid = htons(strtol(optarg,NULL,0)); + ucm = 1; + server = 0; + break; case 'B': burst = atoi(optarg); break; @@ -389,7 +418,7 @@ int main(int argc, char **argv) perror("malloc"); exit(1); } - memset(&time, 0, sizeof(struct dt_time)); + memset(&ts, 0, sizeof(struct dt_time)); LOGPRINTF("%d Allocated RDMA buffers (r:%p,s:%p) len %d \n", getpid(), rbuf, sbuf, buf_len); @@ -398,7 +427,7 @@ int main(int argc, char **argv) start = get_time(); ret = dat_ia_open(provider, 8, &h_async_evd, &h_ia); stop = get_time(); - time.open += ((stop - start) * 1.0e6); + ts.open += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: Error Adaptor open: %s\n", getpid(), DT_RetToStr(ret)); @@ -406,12 +435,34 @@ int main(int argc, char **argv) } else LOGPRINTF("%d Opened Interface Adaptor\n", getpid()); + printf("%d query \n", getpid()); + + ret = dat_ia_query(h_ia, 0, DAT_IA_FIELD_ALL, &ia_attr, 0, 0); + if (ret != DAT_SUCCESS) { + fprintf(stderr, "%d: Error Adaptor query: %s\n", + getpid(), DT_RetToStr(ret)); + exit(1); + } + memcpy((void*)&local, + (void*)ia_attr.ia_address_ptr, + sizeof(DAT_SOCK_ADDR6)); + + if (local.ib.qp_type == IBV_QPT_UD) { + ucm = 1; + printf("%d Local uCM Address = QPN=0x%x, LID=0x%x\n", + getpid(), ntohl(local.ib.qpn), + ntohs(local.ib.lid)); + printf("%d Remote uCM Address = QPN=0x%x, LID=0x%x\n", + getpid(), ntohl(remote.ib.qpn), + ntohs(remote.ib.lid)); + } + /* Create Protection Zone */ start = get_time(); LOGPRINTF("%d Create Protection Zone\n", getpid()); ret = dat_pz_create(h_ia, &h_pz); stop = get_time(); - time.pzc += ((stop - start) * 1.0e6); + ts.pzc += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error creating Protection Zone: %s\n", getpid(), DT_RetToStr(ret)); @@ -461,8 +512,8 @@ int main(int argc, char **argv) ret = dat_ep_create(h_ia, h_pz, h_dto_rcv_evd, h_dto_req_evd, h_conn_evd, &ep_attr, &h_ep); stop = get_time(); - time.epc += ((stop - start) * 1.0e6); - time.total += time.epc; + ts.epc += ((stop - start) * 1.0e6); + ts.total += ts.epc; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_ep_create: %s\n", getpid(), DT_RetToStr(ret)); @@ -570,8 +621,8 @@ complete: start = get_time(); ret = dat_ep_free(h_ep); stop = get_time(); - time.epf += ((stop - start) * 1.0e6); - time.total += time.epf; + ts.epf += ((stop - start) * 1.0e6); + ts.total += ts.epf; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing EP: %s\n", getpid(), DT_RetToStr(ret)); @@ -603,7 +654,7 @@ complete: start = get_time(); ret = dat_pz_free(h_pz); stop = get_time(); - time.pzf += ((stop - start) * 1.0e6); + ts.pzf += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing PZ: %s\n", getpid(), DT_RetToStr(ret)); @@ -617,7 +668,7 @@ complete: start = get_time(); ret = dat_ia_close(h_ia, DAT_CLOSE_ABRUPT_FLAG); stop = get_time(); - time.close += ((stop - start) * 1.0e6); + ts.close += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: Error Adaptor close: %s\n", getpid(), DT_RetToStr(ret)); @@ -640,35 +691,35 @@ complete: printf("\n%d: DAPL Test Complete.\n\n", getpid()); printf("%d: Message RTT: Total=%10.2lf usec, %d bursts, itime=%10.2lf" " usec, pc=%d\n", - getpid(), time.rtt, burst, time.rtt / burst, poll_count); + getpid(), ts.rtt, burst, ts.rtt / burst, poll_count); printf("%d: RDMA write: Total=%10.2lf usec, %d bursts, itime=%10.2lf" " usec, pc=%d\n", - getpid(), time.rdma_wr, burst, - time.rdma_wr / burst, rdma_wr_poll_count); + getpid(), ts.rdma_wr, burst, + ts.rdma_wr / burst, rdma_wr_poll_count); for (i = 0; i < MAX_RDMA_RD; i++) { printf("%d: RDMA read: Total=%10.2lf usec, %d bursts, " "itime=%10.2lf usec, pc=%d\n", - getpid(), time.rdma_rd_total, MAX_RDMA_RD, - time.rdma_rd[i], rdma_rd_poll_count[i]); + getpid(), ts.rdma_rd_total, MAX_RDMA_RD, + ts.rdma_rd[i], rdma_rd_poll_count[i]); } - printf("%d: open: %10.2lf usec\n", getpid(), time.open); - printf("%d: close: %10.2lf usec\n", getpid(), time.close); - printf("%d: PZ create: %10.2lf usec\n", getpid(), time.pzc); - printf("%d: PZ free: %10.2lf usec\n", getpid(), time.pzf); - printf("%d: LMR create:%10.2lf usec\n", getpid(), time.reg); - printf("%d: LMR free: %10.2lf usec\n", getpid(), time.unreg); - printf("%d: EVD create:%10.2lf usec\n", getpid(), time.evdc); - printf("%d: EVD free: %10.2lf usec\n", getpid(), time.evdf); + printf("%d: open: %10.2lf usec\n", getpid(), ts.open); + printf("%d: close: %10.2lf usec\n", getpid(), ts.close); + printf("%d: PZ create: %10.2lf usec\n", getpid(), ts.pzc); + printf("%d: PZ free: %10.2lf usec\n", getpid(), ts.pzf); + printf("%d: LMR create:%10.2lf usec\n", getpid(), ts.reg); + printf("%d: LMR free: %10.2lf usec\n", getpid(), ts.unreg); + printf("%d: EVD create:%10.2lf usec\n", getpid(), ts.evdc); + printf("%d: EVD free: %10.2lf usec\n", getpid(), ts.evdf); if (use_cno) { - printf("%d: CNO create: %10.2lf usec\n", getpid(), time.cnoc); - printf("%d: CNO free: %10.2lf usec\n", getpid(), time.cnof); + printf("%d: CNO create: %10.2lf usec\n", getpid(), ts.cnoc); + printf("%d: CNO free: %10.2lf usec\n", getpid(), ts.cnof); } - printf("%d: EP create: %10.2lf usec\n", getpid(), time.epc); - printf("%d: EP free: %10.2lf usec\n", getpid(), time.epf); + printf("%d: EP create: %10.2lf usec\n", getpid(), ts.epc); + printf("%d: EP free: %10.2lf usec\n", getpid(), ts.epf); if (!server) printf("%d: connect: %10.2lf usec, poll_cnt=%d\n", - getpid(), time.conn, conn_poll_count); - printf("%d: TOTAL: %10.2lf usec\n", getpid(), time.total); + getpid(), ts.conn, conn_poll_count); + printf("%d: TOTAL: %10.2lf usec\n", getpid(), ts.total); #if defined(_WIN32) || defined(_WIN64) WSACleanup(); @@ -676,6 +727,17 @@ complete: return (0); } +#if defined(_WIN32) || defined(_WIN64) +void gettimeofday(struct timeval *t, char *jnk) +{ + SYSTEMTIME now; + GetLocalTime(&now); + t->tv_sec = now.wMinute * 60; + t->tv_sec += now.wSecond; + t->tv_usec = now.wMilliseconds; +} +#endif + double get_time(void) { struct timeval tp; @@ -761,7 +823,7 @@ send_msg(void *data, DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) { - DAT_SOCK_ADDR remote_addr; + DAT_IA_ADDRESS_PTR remote_addr = (DAT_IA_ADDRESS_PTR)&remote; DAT_RETURN ret; DAT_REGION_DESCRIPTION region; DAT_EVENT event; @@ -953,6 +1015,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) struct addrinfo *target; int rval; + if (ucm) + goto no_resolution; + #if defined(_WIN32) || defined(_WIN64) if ((rval = getaddrinfo(hostname, "ftp", NULL, &target)) != 0) { printf("\n remote name resolution failed! %s\n", @@ -972,16 +1037,15 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) getpid(), (rval >> 0) & 0xff, (rval >> 8) & 0xff, (rval >> 16) & 0xff, (rval >> 24) & 0xff, conn_id); - remote_addr = *((DAT_IA_ADDRESS_PTR) target->ai_addr); - freeaddrinfo(target); - + remote_addr = (DAT_IA_ADDRESS_PTR)&target->ai_addr; /* IP */ +no_resolution: for (i = 0; i < 48; i++) /* simple pattern in private data */ pdata[i] = i + 1; LOGPRINTF("%d Connecting to server\n", getpid()); start = get_time(); ret = dat_ep_connect(h_ep, - &remote_addr, + remote_addr, conn_id, CONN_TIMEOUT, 48, @@ -993,6 +1057,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) return (ret); } else LOGPRINTF("%d dat_ep_connect completed\n", getpid()); + + if (!ucm) + freeaddrinfo(target); } printf("%d Waiting for connect response\n", getpid()); @@ -1007,7 +1074,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) if (!server) { stop = get_time(); - time.conn += ((stop - start) * 1.0e6); + ts.conn += ((stop - start) * 1.0e6); } #ifdef TEST_REJECT_WITH_PRIVATE_DATA @@ -1307,7 +1374,7 @@ DAT_RETURN do_rdma_write_with_msg(void) return (DAT_ABORT); stop = get_time(); - time.rdma_wr = ((stop - start) * 1.0e6); + ts.rdma_wr = ((stop - start) * 1.0e6); /* validate event number and status */ printf("%d inbound rdma_write; send message arrived!\n", getpid()); @@ -1436,8 +1503,8 @@ DAT_RETURN do_rdma_read_with_msg(void) return (DAT_ABORT); } stop = get_time(); - time.rdma_rd[i] = ((stop - start) * 1.0e6); - time.rdma_rd_total += time.rdma_rd[i]; + ts.rdma_rd[i] = ((stop - start) * 1.0e6); + ts.rdma_rd_total += ts.rdma_rd[i]; LOGPRINTF("%d rdma_read # %d completed\n", getpid(), i + 1); } @@ -1675,7 +1742,7 @@ DAT_RETURN do_ping_pong_msg() snd_buf += buf_len; } stop = get_time(); - time.rtt = ((stop - start) * 1.0e6); + ts.rtt = ((stop - start) * 1.0e6); return (DAT_SUCCESS); } @@ -1700,8 +1767,8 @@ DAT_RETURN register_rdma_memory(void) &rmr_context_recv, ®istered_size_recv, ®istered_addr_recv); stop = get_time(); - time.reg += ((stop - start) * 1.0e6); - time.total += time.reg; + ts.reg += ((stop - start) * 1.0e6); + ts.total += ts.reg; if (ret != DAT_SUCCESS) { fprintf(stderr, @@ -1751,8 +1818,8 @@ DAT_RETURN unregister_rdma_memory(void) start = get_time(); ret = dat_lmr_free(h_lmr_recv); stop = get_time(); - time.unreg += ((stop - start) * 1.0e6); - time.total += time.unreg; + ts.unreg += ((stop - start) * 1.0e6); + ts.total += ts.unreg; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error deregistering recv mr: %s\n", getpid(), DT_RetToStr(ret)); @@ -1801,8 +1868,8 @@ DAT_RETURN create_events(void) &h_dto_cno); #endif stop = get_time(); - time.cnoc += ((stop - start) * 1.0e6); - time.total += time.cnoc; + ts.cnoc += ((stop - start) * 1.0e6); + ts.total += ts.cnoc; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_cno_create: %s\n", getpid(), DT_RetToStr(ret)); @@ -1819,8 +1886,8 @@ DAT_RETURN create_events(void) dat_evd_create(h_ia, 10, DAT_HANDLE_NULL, DAT_EVD_CR_FLAG, &h_cr_evd); stop = get_time(); - time.evdc += ((stop - start) * 1.0e6); - time.total += time.evdc; + ts.evdc += ((stop - start) * 1.0e6); + ts.total += ts.evdc; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_create: %s\n", getpid(), DT_RetToStr(ret)); @@ -1930,8 +1997,8 @@ DAT_RETURN destroy_events(void) start = get_time(); ret = dat_evd_free(h_dto_rcv_evd); stop = get_time(); - time.evdf += ((stop - start) * 1.0e6); - time.total += time.evdf; + ts.evdf += ((stop - start) * 1.0e6); + ts.total += ts.evdf; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing dto EVD: %s\n", getpid(), DT_RetToStr(ret)); @@ -1962,8 +2029,8 @@ DAT_RETURN destroy_events(void) start = get_time(); ret = dat_cno_free(h_dto_cno); stop = get_time(); - time.cnof += ((stop - start) * 1.0e6); - time.total += time.cnof; + ts.cnof += ((stop - start) * 1.0e6); + ts.total += ts.cnof; if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error freeing dto CNO: %s\n", getpid(), DT_RetToStr(ret)); @@ -2048,6 +2115,8 @@ void print_usage(void) printf("B: burst count, rdma and msgs \n"); printf("h: hostname/address of server, specified on client\n"); printf("P: provider name (default = OpenIB-cma)\n"); + printf("l: server lid (required ucm provider)\n"); + printf("q: server qpn (required ucm provider)\n"); printf("\n"); } diff --git a/test/dtest/dtestcm.c b/test/dtest/dtestcm.c index 71d9350..5b0272a 100644 --- a/test/dtest/dtestcm.c +++ b/test/dtest/dtestcm.c @@ -76,6 +76,7 @@ #include #include #include +#include #define DAPL_PROVIDER "ofa-v2-mlx4_0-1" @@ -96,8 +97,24 @@ #define MAX_POLLING_CNT 50000 /* Header files needed for DAT/uDAPL */ -#include "dat2/udat.h" -#include "dat2/dat_ib_extensions.h" +#include "infiniband/verbs.h" +#include "dat2/udat.h" +#include "dat2/dat_ib_extensions.h" + +/* IB address structure used by DAPL uCM provider */ +union dcm_addr { + DAT_SOCK_ADDR6 so; + struct { + uint8_t qp_type; + uint8_t port_num; + uint16_t lid; + uint32_t qpn; + union ibv_gid gid; + } ib; +}; + +static union dcm_addr remote; +static union dcm_addr local; /* definitions */ #define SERVER_CONN_QUAL 45248 @@ -145,7 +162,7 @@ struct dt_time { double conn; }; -struct dt_time time; +struct dt_time ts; /* defaults */ static int connected = 0; @@ -160,6 +177,7 @@ static int delay = 0; static int connections = 1000; static int burst = 100; static int port_id = SERVER_CONN_QUAL; +static int ucm = 0; /* forward prototypes */ const char *DT_RetToString(DAT_RETURN ret_value); @@ -191,9 +209,10 @@ int main(int argc, char **argv) { int i, c, len; DAT_RETURN ret; + DAT_IA_ATTR ia_attr; /* parse arguments */ - while ((c = getopt(argc, argv, "smwvub:c:d:h:P:p:")) != -1) { + while ((c = getopt(argc, argv, "smwvub:c:d:h:P:p:q:l:")) != -1) { switch (c) { case 's': server = 1; @@ -230,6 +249,16 @@ int main(int argc, char **argv) case 'P': strcpy(provider, optarg); break; + case 'q': + remote.ib.qpn = htonl(strtol(optarg,NULL,0)); + ucm = 1; + server = 0; + break; + case 'l': + remote.ib.lid = htons(strtol(optarg,NULL,0)); + ucm = 1; + server = 0; + break; default: print_usage(); exit(-12); @@ -283,14 +312,14 @@ int main(int argc, char **argv) exit(1); } memset(h_psp, 0, len); - memset(&time, 0, sizeof(struct dt_time)); + memset(&ts, 0, sizeof(struct dt_time)); /* dat_ia_open, dat_pz_create */ h_async_evd = DAT_HANDLE_NULL; start = get_time(); ret = dat_ia_open(provider, 8, &h_async_evd, &h_ia); stop = get_time(); - time.open += ((stop - start) * 1.0e6); + ts.open += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, " Error Adaptor open: %s\n", DT_RetToString(ret)); @@ -298,12 +327,33 @@ int main(int argc, char **argv) } else LOGPRINTF(" Opened Interface Adaptor\n"); + /* query for UCM addressing */ + ret = dat_ia_query(h_ia, 0, DAT_IA_FIELD_ALL, &ia_attr, 0, 0); + if (ret != DAT_SUCCESS) { + fprintf(stderr, "%d: Error Adaptor query: %s\n", + getpid(), DT_RetToString(ret)); + exit(1); + } + memcpy((void*)&local, + (void*)ia_attr.ia_address_ptr, + sizeof(DAT_SOCK_ADDR6)); + + if (local.ib.qp_type == IBV_QPT_UD) { + ucm = 1; + printf("%d Local uCM Address = QPN=0x%x, LID=0x%x\n", + getpid(), ntohl(local.ib.qpn), + ntohs(local.ib.lid)); + printf("%d Remote uCM Address = QPN=0x%x, LID=0x%x\n", + getpid(), ntohl(remote.ib.qpn), + ntohs(remote.ib.lid)); + } + /* Create Protection Zone */ start = get_time(); LOGPRINTF(" Create Protection Zone\n"); ret = dat_pz_create(h_ia, &h_pz); stop = get_time(); - time.pzc += ((stop - start) * 1.0e6); + ts.pzc += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, " Error creating Protection Zone: %s\n", DT_RetToString(ret)); @@ -345,8 +395,8 @@ int main(int argc, char **argv) &ep_attr, &h_ep[i]); } stop = get_time(); - time.epc += ((stop - start) * 1.0e6); - time.total += time.epc; + ts.epc += ((stop - start) * 1.0e6); + ts.total += ts.epc; if (ret != DAT_SUCCESS) { fprintf(stderr, " Error dat_ep_create: %s\n", DT_RetToString(ret)); @@ -447,7 +497,7 @@ complete: start = get_time(); ret = dat_pz_free(h_pz); stop = get_time(); - time.pzf += ((stop - start) * 1.0e6); + ts.pzf += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, " Error freeing PZ: %s\n", DT_RetToString(ret)); @@ -462,7 +512,7 @@ complete: start = get_time(); ret = dat_ia_close(h_ia, DAT_CLOSE_ABRUPT_FLAG); stop = get_time(); - time.close += ((stop - start) * 1.0e6); + ts.close += ((stop - start) * 1.0e6); if (ret != DAT_SUCCESS) { fprintf(stderr, " Error Adaptor close: %s\n", DT_RetToString(ret)); @@ -471,25 +521,25 @@ complete: LOGPRINTF(" Closed Interface Adaptor\n"); printf(" DAPL Connection Test Complete.\n"); - printf(" open: %10.2lf usec\n", time.open); - printf(" close: %10.2lf usec\n", time.close); - printf(" PZ create: %10.2lf usec\n", time.pzc); - printf(" PZ free: %10.2lf usec\n", time.pzf); - printf(" LMR create:%10.2lf usec\n", time.reg); - printf(" LMR free: %10.2lf usec\n", time.unreg); - printf(" EVD create:%10.2lf usec\n", time.evdc); - printf(" EVD free: %10.2lf usec\n", time.evdf); - printf(" EP create: %10.2lf usec avg\n", time.epc/connections); - printf(" EP free: %10.2lf usec avg\n", time.epf/connections); + printf(" open: %10.2lf usec\n", ts.open); + printf(" close: %10.2lf usec\n", ts.close); + printf(" PZ create: %10.2lf usec\n", ts.pzc); + printf(" PZ free: %10.2lf usec\n", ts.pzf); + printf(" LMR create:%10.2lf usec\n", ts.reg); + printf(" LMR free: %10.2lf usec\n", ts.unreg); + printf(" EVD create:%10.2lf usec\n", ts.evdc); + printf(" EVD free: %10.2lf usec\n", ts.evdf); + printf(" EP create: %10.2lf usec avg\n", ts.epc/connections); + printf(" EP free: %10.2lf usec avg\n", ts.epf/connections); if (!server) { printf(" Connections: %8.2lf usec, CPS %7.2lf " "Total %4.2lf secs, poll_cnt=%u, Num=%d\n", - (double)(time.conn/connections), - (double)(1/(time.conn/1000000/connections)), - (double)(time.conn/1000000), + (double)(ts.conn/connections), + (double)(1/(ts.conn/1000000/connections)), + (double)(ts.conn/1000000), conn_poll_count, connections); } - printf(" TOTAL: %4.2lf sec\n", time.total/1000000); + printf(" TOTAL: %4.2lf sec\n", ts.total/1000000); fflush(stderr); fflush(stdout); bail: free(h_ep); @@ -501,6 +551,19 @@ bail: return (0); } +#if defined(_WIN32) || defined(_WIN64) + +void gettimeofday(struct timeval *t, char *jnk) +{ + SYSTEMTIME now; + GetLocalTime(&now); + t->tv_sec = now.wMinute * 60; + t->tv_sec += now.wSecond; + t->tv_usec = now.wMilliseconds; +} + +#endif + double get_time(void) { struct timeval tp; @@ -644,7 +707,7 @@ DAT_RETURN conn_server() DAT_RETURN conn_client() { - DAT_SOCK_ADDR raddr; + DAT_IA_ADDRESS_PTR raddr = (DAT_IA_ADDRESS_PTR)&remote; DAT_RETURN ret; DAT_EVENT event; DAT_COUNT nmore; @@ -657,6 +720,9 @@ DAT_RETURN conn_client() struct addrinfo *target; int rval; + if (ucm) + goto no_resolution; + #if defined(_WIN32) || defined(_WIN64) if ((rval = getaddrinfo(hostname, "ftp", NULL, &target)) != 0) { printf("\n remote name resolution failed! %s\n", @@ -677,8 +743,9 @@ DAT_RETURN conn_client() (rval >> 16) & 0xff, (rval >> 24) & 0xff, port_id); - raddr = *((DAT_IA_ADDRESS_PTR)target->ai_addr); - freeaddrinfo(target); + raddr = (DAT_IA_ADDRESS_PTR)target->ai_addr; + +no_resolution: for (i = 0; i < 48; i++) /* simple pattern in private data */ pdata[i] = i + 1; @@ -692,7 +759,7 @@ DAT_RETURN conn_client() else conn_id = port_id; - ret = dat_ep_connect(h_ep[i+ii], &raddr, + ret = dat_ep_connect(h_ep[i+ii], raddr, conn_id, CONN_TIMEOUT, 48, (DAT_PVOID) pdata, 0, DAT_CONNECT_DEFAULT_FLAG); @@ -790,7 +857,10 @@ DAT_RETURN conn_client() } stop = get_time(); - time.conn += ((stop - start) * 1.0e6); + ts.conn += ((stop - start) * 1.0e6); + + if (!ucm) + freeaddrinfo(target); printf("\n ALL %d CONNECTED on Client!\n\n", connections); @@ -825,8 +895,8 @@ DAT_RETURN disconnect_eps(void) } } stop = get_time(); - time.epf += ((stop - start) * 1.0e6); - time.total += time.epf; + ts.epf += ((stop - start) * 1.0e6); + ts.total += ts.epf; return DAT_SUCCESS; } @@ -900,8 +970,8 @@ DAT_RETURN disconnect_eps(void) } /* free EPs */ stop = get_time(); - time.epf += ((stop - start) * 1.0e6); - time.total += time.epf; + ts.epf += ((stop - start) * 1.0e6); + ts.total += ts.epf; return DAT_SUCCESS; } @@ -918,8 +988,8 @@ DAT_RETURN create_events(void) ret = dat_evd_create(h_ia, connections, DAT_HANDLE_NULL, DAT_EVD_CR_FLAG, &h_cr_evd); stop = get_time(); - time.evdc += ((stop - start) * 1.0e6); - time.total += time.evdc; + ts.evdc += ((stop - start) * 1.0e6); + ts.total += ts.evdc; if (ret != DAT_SUCCESS) { fprintf(stderr, " Error dat_evd_create: %s\n", DT_RetToString(ret)); @@ -1009,8 +1079,8 @@ DAT_RETURN destroy_events(void) start = get_time(); ret = dat_evd_free(h_dto_rcv_evd); stop = get_time(); - time.evdf += ((stop - start) * 1.0e6); - time.total += time.evdf; + ts.evdf += ((stop - start) * 1.0e6); + ts.total += ts.evdf; if (ret != DAT_SUCCESS) { fprintf(stderr, " Error freeing dto EVD: %s\n", DT_RetToString(ret)); diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c index a14785b..af87af0 100755 --- a/test/dtest/dtestx.c +++ b/test/dtest/dtestx.c @@ -65,6 +65,7 @@ #endif +#include "infiniband/verbs.h" #include "dat2/udat.h" #include "dat2/dat_ib_extensions.h" @@ -178,6 +179,22 @@ int eps = 1; int verbose = 0; int counters = 0; int counters_ok = 0; +static int ucm = 0; + +/* IB address structure used by DAPL uCM provider */ +union dcm_addr { + DAT_SOCK_ADDR6 so; + struct { + uint8_t qp_type; + uint8_t port_num; + uint16_t lid; + uint32_t qpn; + union ibv_gid gid; + } ib; +}; + +static union dcm_addr remote; +static union dcm_addr local; #define LOGPRINTF if (verbose) printf @@ -392,8 +409,9 @@ void process_conn(int idx) int connect_ep(char *hostname) { - DAT_SOCK_ADDR remote_addr; + DAT_IA_ADDRESS_PTR remote_addr = (DAT_IA_ADDRESS_PTR)&remote; DAT_EP_ATTR ep_attr; + DAT_IA_ATTR ia_attr; DAT_RETURN status; DAT_REGION_DESCRIPTION region; DAT_EVENT event; @@ -412,10 +430,26 @@ int connect_ep(char *hostname) _OK(status, "dat_ia_open"); memset(&prov_attrs, 0, sizeof(prov_attrs)); - status = dat_ia_query(ia, NULL, 0, NULL, + status = dat_ia_query(ia, NULL, + DAT_IA_FIELD_ALL, &ia_attr, DAT_PROVIDER_FIELD_ALL, &prov_attrs); _OK(status, "dat_ia_query"); + memcpy((void*)&local, + (void*)ia_attr.ia_address_ptr, + sizeof(DAT_SOCK_ADDR6)); + + if (local.ib.qp_type == IBV_QPT_UD) { + ucm = 1; + printf("%d Local uCM Address = QPN=0x%x, LID=0x%x\n", + getpid(), ntohl(local.ib.qpn), + ntohs(local.ib.lid)); + printf("%d Remote uCM Address = QPN=0x%x, LID=0x%x\n", + getpid(), ntohl(remote.ib.qpn), + ntohs(remote.ib.lid)); + } + + /* Print provider specific attributes */ for (i = 0; i < prov_attrs.num_provider_specific_attr; i++) { LOGPRINTF(" Provider Specific Attribute[%d] %s=%s\n", @@ -567,6 +601,9 @@ int connect_ep(char *hostname) if (!server || (server && ud_test)) { struct addrinfo *target; + if (ucm) + goto no_resolution; + if (getaddrinfo(hostname, NULL, NULL, &target) != 0) { printf("Error getting remote address.\n"); exit(1); @@ -579,10 +616,11 @@ int connect_ep(char *hostname) inet_ntoa(((struct sockaddr_in *) target->ai_addr)->sin_addr)); - remote_addr = *((DAT_IA_ADDRESS_PTR) target->ai_addr); - freeaddrinfo(target); strcpy((char *)buf[SND_RDMA_BUF_INDEX], "Client written data"); - + + remote_addr = (DAT_IA_ADDRESS_PTR)&target->ai_addr; /* IP */ +no_resolution: + /* one Client EP, multiple Server EPs, same conn_qual * use private data to select EP on Server */ @@ -596,13 +634,16 @@ int connect_ep(char *hostname) pdata = 0; /* just use first EP */ status = dat_ep_connect(ep[0], - &remote_addr, + remote_addr, (server ? CLIENT_ID : SERVER_ID), CONN_TIMEOUT, 4, (DAT_PVOID) & pdata, 0, DAT_CONNECT_DEFAULT_FLAG); _OK(status, "dat_ep_connect"); } + + if (!ucm) + freeaddrinfo(target); } /* UD: process CR's starting with 2nd on server, 1st for client */ @@ -721,7 +762,19 @@ int disconnect_ep(void) DAT_EVENT event; DAT_COUNT nmore; int i; + + if (counters) { /* examples of query and print */ + int ii; + DAT_UINT64 ia_cntrs[DCNT_IA_ALL_COUNTERS]; + dat_query_counters(ia, DCNT_IA_ALL_COUNTERS, ia_cntrs, 0); + printf(" IA Cntrs:"); + for (ii = 0; ii < DCNT_IA_ALL_COUNTERS; ii++) + printf(" " F64u "", ia_cntrs[ii]); + printf("\n"); + dat_print_counters(ia, DCNT_IA_ALL_COUNTERS, 0); + } + if (!ud_test) { status = dat_ep_disconnect(ep[0], DAT_CLOSE_DEFAULT); _OK2(status, "dat_ep_disconnect"); @@ -797,17 +850,6 @@ int disconnect_ep(void) status = dat_pz_free(pz); _OK2(status, "dat_pz_free"); - if (counters) { /* examples of query and print */ - int ii; - DAT_UINT64 ia_cntrs[DCNT_IA_ALL_COUNTERS]; - - dat_query_counters(ia, DCNT_IA_ALL_COUNTERS, ia_cntrs, 0); - printf(" IA Cntrs:"); - for (ii = 0; ii < DCNT_IA_ALL_COUNTERS; ii++) - printf(" " F64u "", ia_cntrs[ii]); - printf("\n"); - dat_print_counters(ia, DCNT_IA_ALL_COUNTERS, 0); - } status = dat_ia_close(ia, DAT_CLOSE_DEFAULT); _OK2(status, "dat_ia_close"); @@ -1200,7 +1242,7 @@ int main(int argc, char **argv) int rc; /* parse arguments */ - while ((rc = getopt(argc, argv, "csvumpU:h:b:P:")) != -1) { + while ((rc = getopt(argc, argv, "csvumpU:h:b:P:q:l:")) != -1) { switch (rc) { case 'u': ud_test = 1; @@ -1235,6 +1277,16 @@ int main(int argc, char **argv) case 'v': verbose = 1; break; + case 'q': + remote.ib.qpn = htonl(strtol(optarg,NULL,0)); + ucm = 1; + server = 0; + break; + case 'l': + remote.ib.lid = htons(strtol(optarg,NULL,0)); + ucm = 1; + server = 0; + break; default: print_usage(); exit(-12); -- 1.5.2.5 From rdreier at cisco.com Tue Aug 18 12:39:21 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Aug 2009 12:39:21 -0700 Subject: [ofa-general] What does IBV_WC_REM_OP_ERR after a verb send indicate? In-Reply-To: <4A89DD47.5030902@riorey.com> (Nitin Mehrotra's message of "Mon, 17 Aug 2009 18:44:23 -0400") References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com> <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com> <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com> <4A89DD47.5030902@riorey.com> Message-ID: > I am getting this error on a verb send operation and I can't figure > out what could be the cause; I searched for all instances of this > error in the IB code and while I found 4, none was illuminating. IBV_WC_REM_OP_ERR corresponds to "Remote Operation Error," which the IB spec describes as: The operation could not be completed successfully by the responder. Possible causes include a responder QP related error that prevented the responder from completing the request or a malformed WQE on the Receive Queue. Usually means a memory protection problem on the remote end. - R. From swise at opengridcomputing.com Tue Aug 18 13:01:26 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 18 Aug 2009 15:01:26 -0500 Subject: [ofa-general] Re: [PATCH] krping: Add support for fast_reg_mr with dma_local_lkey In-Reply-To: <20090818191203.GB20947@opengridcomputing.com> References: <20090818191203.GB20947@opengridcomputing.com> Message-ID: <4A8B0896.2050107@opengridcomputing.com> Applied. Thanks, Steve. From nmehrotra at riorey.com Tue Aug 18 13:34:43 2009 From: nmehrotra at riorey.com (Nitin Mehrotra) Date: Tue, 18 Aug 2009 16:34:43 -0400 Subject: [ofa-general] What does IBV_WC_REM_OP_ERR after a verb send indicate? In-Reply-To: References: <1770580407.2911250276033741.JavaMail.root@zmail.riorey.com> <1048182029.3001250276541490.JavaMail.root@zmail.riorey.com> <9A7396C9CD4746EA9474428B1BB6F0EA@amr.corp.intel.com> <4A89DD47.5030902@riorey.com> Message-ID: <4A8B1063.9020605@riorey.com> Roland, Thanks for your response; we suspected that because of the way we allocate buffers and are reworking the code to change it. Good to see your response because that means it's probably not a wasted effort. Nitin Roland Dreier wrote: > > I am getting this error on a verb send operation and I can't figure > > out what could be the cause; I searched for all instances of this > > error in the IB code and while I found 4, none was illuminating. > > IBV_WC_REM_OP_ERR corresponds to "Remote Operation Error," which the IB > spec describes as: > > The operation could not be completed successfully by the > responder. Possible causes include a responder QP related error that > prevented the responder from completing the request or a malformed > WQE on the Receive Queue. > > Usually means a memory protection problem on the remote end. > > - R. > ------------------------------------------------------------------------ > > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.409 / Virus Database: 270.13.60/2311 - Release Date: 08/18/09 06:03:00 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Wed Aug 19 03:04:08 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 19 Aug 2009 03:04:08 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090819-0200 daily build status Message-ID: <20090819100409.3CEBCE2822E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090819-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From eli at mellanox.co.il Wed Aug 19 07:38:43 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 17:38:43 +0300 Subject: [ofa-general] [PATCHv5 01/10] ib_core: Refine device personality from node type to port type Message-ID: <20090819143843.GA8675@mtls03> As a preparation to devices that, in general, support a different transport protocol for each port, specifically RDMAoE, this patch defines a transport type for each of a device's ports. As a result, rdma_node_get_transport() has been unexported and is used internally by the implementation of the new API, rdma_port_get_transport(), which gives the transport protocol of the queried port. rdma_is_transport_supported() is also added to be used for verifying if a given device supports a given protocol on any of its ports. All references to rdma_node_get_transport() are changed to to use the new APIs. Also, ib_port_attr is extended to contain enum rdma_transport_type. Signed-off-by: Eli Cohen --- Changes from previous version: Define and make use of rdma_is_transport_supported(), an API that allows the caller to check if a given device supports a given transport protocol on any of its ports. drivers/infiniband/core/cm.c | 25 +++++++++---- drivers/infiniband/core/cma.c | 54 +++++++++++++++-------------- drivers/infiniband/core/mad.c | 41 ++++++++++++++-------- drivers/infiniband/core/multicast.c | 4 +- drivers/infiniband/core/sa_query.c | 39 +++++++++++++-------- drivers/infiniband/core/ucm.c | 8 +++- drivers/infiniband/core/ucma.c | 2 +- drivers/infiniband/core/user_mad.c | 6 +++- drivers/infiniband/core/verbs.c | 25 ++++++++++++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 12 +++--- include/rdma/ib_verbs.h | 11 ++++-- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 +- net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +- 13 files changed, 148 insertions(+), 84 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 5130fc5..d082f59 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3678,8 +3678,9 @@ static void cm_add_one(struct ib_device *ib_device) unsigned long flags; int ret; u8 i; + enum rdma_transport_type tt; - if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB) + if (!rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_IB)) return; cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) * @@ -3700,6 +3701,10 @@ static void cm_add_one(struct ib_device *ib_device) set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= ib_device->phys_port_cnt; i++) { + tt = rdma_port_get_transport(ib_device, i); + if (tt != RDMA_TRANSPORT_IB) + continue; + port = kzalloc(sizeof *port, GFP_KERNEL); if (!port) goto error1; @@ -3742,9 +3747,11 @@ error1: port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; while (--i) { port = cm_dev->port[i-1]; - ib_modify_port(ib_device, port->port_num, 0, &port_modify); - ib_unregister_mad_agent(port->mad_agent); - cm_remove_port_fs(port); + if (port) { + ib_modify_port(ib_device, port->port_num, 0, &port_modify); + ib_unregister_mad_agent(port->mad_agent); + cm_remove_port_fs(port); + } } device_unregister(cm_dev->device); kfree(cm_dev); @@ -3770,10 +3777,12 @@ static void cm_remove_one(struct ib_device *ib_device) for (i = 1; i <= ib_device->phys_port_cnt; i++) { port = cm_dev->port[i-1]; - ib_modify_port(ib_device, port->port_num, 0, &port_modify); - ib_unregister_mad_agent(port->mad_agent); - flush_workqueue(cm.wq); - cm_remove_port_fs(port); + if (port) { + ib_modify_port(ib_device, port->port_num, 0, &port_modify); + ib_unregister_mad_agent(port->mad_agent); + flush_workqueue(cm.wq); + cm_remove_port_fs(port); + } } device_unregister(cm_dev->device); kfree(cm_dev); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 851de83..02fd045 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -329,24 +329,26 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv) struct cma_device *cma_dev; union ib_gid gid; int ret = -ENODEV; - - switch (rdma_node_get_transport(dev_addr->dev_type)) { - case RDMA_TRANSPORT_IB: - ib_addr_get_sgid(dev_addr, &gid); - break; - case RDMA_TRANSPORT_IWARP: - iw_addr_get_sgid(dev_addr, &gid); - break; - default: - return -ENODEV; - } + int port; list_for_each_entry(cma_dev, &dev_list, list) { - ret = ib_find_cached_gid(cma_dev->device, &gid, - &id_priv->id.port_num, NULL); - if (!ret) { - cma_attach_to_dev(id_priv, cma_dev); - break; + for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port) { + switch (rdma_port_get_transport(cma_dev->device, port)) { + case RDMA_TRANSPORT_IB: + ib_addr_get_sgid(dev_addr, &gid); + break; + case RDMA_TRANSPORT_IWARP: + iw_addr_get_sgid(dev_addr, &gid); + break; + default: + return -ENODEV; + } + ret = ib_find_cached_gid(cma_dev->device, &gid, + &id_priv->id.port_num, NULL); + if (!ret) { + cma_attach_to_dev(id_priv, cma_dev); + return ret; + } } } return ret; @@ -597,7 +599,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, int ret = 0; id_priv = container_of(id, struct rdma_id_private, id); - switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps)) ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask); @@ -747,7 +749,7 @@ static inline int cma_user_data_offset(enum rdma_port_space ps) static void cma_cancel_route(struct rdma_id_private *id_priv) { - switch (rdma_node_get_transport(id_priv->id.device->node_type)) { + switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: if (id_priv->query) ib_sa_cancel_query(id_priv->query_id, id_priv->query); @@ -843,7 +845,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) mutex_lock(&lock); if (id_priv->cma_dev) { mutex_unlock(&lock); - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); @@ -1500,7 +1502,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog) id_priv->backlog = backlog; if (id->device) { - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); if (ret) @@ -1727,7 +1729,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) return -EINVAL; atomic_inc(&id_priv->refcount); - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; @@ -2407,7 +2409,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) id_priv->srq = conn_param->srq; } - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: if (cma_is_ud_ps(id->ps)) ret = cma_resolve_ib_udp(id_priv, conn_param); @@ -2520,7 +2522,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) id_priv->srq = conn_param->srq; } - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS, @@ -2581,7 +2583,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data, if (!cma_has_cm_dev(id_priv)) return -EINVAL; - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, @@ -2612,7 +2614,7 @@ int rdma_disconnect(struct rdma_cm_id *id) if (!cma_has_cm_dev(id_priv)) return -EINVAL; - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_modify_qp_err(id_priv); if (ret) @@ -2764,7 +2766,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, list_add(&mc->list, &id_priv->mc_list); spin_unlock(&id_priv->lock); - switch (rdma_node_get_transport(id->device->node_type)) { + switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: ret = cma_join_ib_multicast(id_priv, mc); break; diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index de922a0..c06117c 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2905,8 +2905,9 @@ static int ib_mad_port_close(struct ib_device *device, int port_num) static void ib_mad_init_device(struct ib_device *device) { int start, end, i; + enum rdma_transport_type tt; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB)) return; if (device->node_type == RDMA_NODE_IB_SWITCH) { @@ -2918,6 +2919,10 @@ static void ib_mad_init_device(struct ib_device *device) } for (i = start; i <= end; i++) { + tt = rdma_port_get_transport(device, i); + if (tt != RDMA_TRANSPORT_IB) + continue; + if (ib_mad_port_open(device, i)) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", device->name, i); @@ -2941,13 +2946,15 @@ error: i--; while (i >= start) { - if (ib_agent_port_close(device, i)) - printk(KERN_ERR PFX "Couldn't close %s port %d " - "for agents\n", - device->name, i); - if (ib_mad_port_close(device, i)) - printk(KERN_ERR PFX "Couldn't close %s port %d\n", - device->name, i); + if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) { + if (ib_agent_port_close(device, i)) + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, i); + if (ib_mad_port_close(device, i)) + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, i); + } i--; } } @@ -2955,6 +2962,7 @@ error: static void ib_mad_remove_device(struct ib_device *device) { int i, num_ports, cur_port; + enum rdma_transport_type tt; if (device->node_type == RDMA_NODE_IB_SWITCH) { num_ports = 1; @@ -2964,13 +2972,16 @@ static void ib_mad_remove_device(struct ib_device *device) cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - if (ib_agent_port_close(device, cur_port)) - printk(KERN_ERR PFX "Couldn't close %s port %d " - "for agents\n", - device->name, cur_port); - if (ib_mad_port_close(device, cur_port)) - printk(KERN_ERR PFX "Couldn't close %s port %d\n", - device->name, cur_port); + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB) { + if (ib_agent_port_close(device, cur_port)) + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + if (ib_mad_port_close(device, cur_port)) + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + } } } diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 107f170..e6c98e7 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -788,10 +788,10 @@ static void mcast_add_one(struct ib_device *device) struct mcast_port *port; int i; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB)) return; - dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + dev = kzalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, GFP_KERNEL); if (!dev) return; diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 1865049..46899de 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -416,14 +416,16 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event struct ib_sa_port *port = &sa_dev->port[event->element.port_num - sa_dev->start_port]; - spin_lock_irqsave(&port->ah_lock, flags); - if (port->sm_ah) - kref_put(&port->sm_ah->ref, free_sm_ah); - port->sm_ah = NULL; - spin_unlock_irqrestore(&port->ah_lock, flags); - - schedule_work(&sa_dev->port[event->element.port_num - - sa_dev->start_port].update_task); + if (rdma_port_get_transport(handler->device, port->port_num) == RDMA_TRANSPORT_IB) { + spin_lock_irqsave(&port->ah_lock, flags); + if (port->sm_ah) + kref_put(&port->sm_ah->ref, free_sm_ah); + port->sm_ah = NULL; + spin_unlock_irqrestore(&port->ah_lock, flags); + + schedule_work(&sa_dev->port[event->element.port_num - + sa_dev->start_port].update_task); + } } } @@ -991,7 +993,7 @@ static void ib_sa_add_one(struct ib_device *device) struct ib_sa_device *sa_dev; int s, e, i; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB)) return; if (device->node_type == RDMA_NODE_IB_SWITCH) @@ -1001,7 +1003,7 @@ static void ib_sa_add_one(struct ib_device *device) e = device->phys_port_cnt; } - sa_dev = kmalloc(sizeof *sa_dev + + sa_dev = kzalloc(sizeof *sa_dev + (e - s + 1) * sizeof (struct ib_sa_port), GFP_KERNEL); if (!sa_dev) @@ -1011,6 +1013,9 @@ static void ib_sa_add_one(struct ib_device *device) sa_dev->end_port = e; for (i = 0; i <= e - s; ++i) { + if (rdma_port_get_transport(device, i + 1) != RDMA_TRANSPORT_IB) + continue; + sa_dev->port[i].sm_ah = NULL; sa_dev->port[i].port_num = i + s; spin_lock_init(&sa_dev->port[i].ah_lock); @@ -1039,13 +1044,15 @@ static void ib_sa_add_one(struct ib_device *device) goto err; for (i = 0; i <= e - s; ++i) - update_sm_ah(&sa_dev->port[i].update_task); + if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) + update_sm_ah(&sa_dev->port[i].update_task); return; err: while (--i >= 0) - ib_unregister_mad_agent(sa_dev->port[i].agent); + if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB) + ib_unregister_mad_agent(sa_dev->port[i].agent); kfree(sa_dev); @@ -1065,9 +1072,11 @@ static void ib_sa_remove_one(struct ib_device *device) flush_scheduled_work(); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { - ib_unregister_mad_agent(sa_dev->port[i].agent); - if (sa_dev->port[i].sm_ah) - kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + if (sa_dev->port[i].sm_ah) + kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + } } kfree(sa_dev); diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 51bd966..b508020 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -1239,11 +1239,15 @@ static DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); static void ib_ucm_add_one(struct ib_device *device) { struct ib_ucm_device *ucm_dev; + int i; - if (!device->alloc_ucontext || - rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + if (!device->alloc_ucontext) return; + for (i = 1; i <= device->phys_port_cnt; ++i) + if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) + return; + ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); if (!ucm_dev) return; diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 4346a24..24d9510 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -614,7 +614,7 @@ static ssize_t ucma_query_route(struct ucma_file *file, resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid; resp.port_num = ctx->cm_id->port_num; - switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) { + switch (rdma_port_get_transport(ctx->cm_id->device, ctx->cm_id->port_num)) { case RDMA_TRANSPORT_IB: ucma_copy_ib_route(&resp, &ctx->cm_id->route); break; diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 8c46f22..aa4eeb3 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1113,7 +1113,7 @@ static void ib_umad_add_one(struct ib_device *device) struct ib_umad_device *umad_dev; int s, e, i; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB)) return; if (device->node_type == RDMA_NODE_IB_SWITCH) @@ -1123,6 +1123,10 @@ static void ib_umad_add_one(struct ib_device *device) e = device->phys_port_cnt; } + for (i = s; i <= e; ++i) + if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) + return; + umad_dev = kzalloc(sizeof *umad_dev + (e - s + 1) * sizeof (struct ib_umad_port), GFP_KERNEL); diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index a7da9be..d81e217 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -77,7 +77,7 @@ enum ib_rate mult_to_ib_rate(int mult) } EXPORT_SYMBOL(mult_to_ib_rate); -enum rdma_transport_type +static enum rdma_transport_type rdma_node_get_transport(enum rdma_node_type node_type) { switch (node_type) { @@ -92,7 +92,28 @@ rdma_node_get_transport(enum rdma_node_type node_type) return 0; } } -EXPORT_SYMBOL(rdma_node_get_transport); + +enum rdma_transport_type rdma_port_get_transport(struct ib_device *device, + u8 port_num) +{ + return device->get_port_transport ? + device->get_port_transport(device, port_num) : + rdma_node_get_transport(device->node_type); +} +EXPORT_SYMBOL(rdma_port_get_transport); + +int rdma_is_transport_supported(struct ib_device *device, + enum rdma_transport_type transport) +{ + int i; + + for (i = 1; i <= device->phys_port_cnt; ++i) + if (rdma_port_get_transport(device, i) == transport) + return 1; + + return 0; +} +EXPORT_SYMBOL(rdma_is_transport_supported); /* Protection domains */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index ab2c192..39df0f7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1337,9 +1337,6 @@ static void ipoib_add_one(struct ib_device *device) struct ipoib_dev_priv *priv; int s, e, p; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; - dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; @@ -1355,6 +1352,9 @@ static void ipoib_add_one(struct ib_device *device) } for (p = s; p <= e; ++p) { + if (rdma_port_get_transport(device, p) != RDMA_TRANSPORT_IB) + continue; + dev = ipoib_add_port("ib%d", device, p); if (!IS_ERR(dev)) { priv = netdev_priv(dev); @@ -1370,12 +1370,12 @@ static void ipoib_remove_one(struct ib_device *device) struct ipoib_dev_priv *priv, *tmp; struct list_head *dev_list; - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) - return; - dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { + if (rdma_port_get_transport(device, priv->port) != RDMA_TRANSPORT_IB) + continue; + ib_unregister_event_handler(&priv->event_handler); rtnl_lock(); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index c179318..4cf42f3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -72,9 +72,6 @@ enum rdma_transport_type { RDMA_TRANSPORT_IWARP }; -enum rdma_transport_type -rdma_node_get_transport(enum rdma_node_type node_type) __attribute_const__; - enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), @@ -298,6 +295,7 @@ struct ib_port_attr { u8 active_width; u8 active_speed; u8 phys_state; + enum rdma_transport_type transport; }; enum ib_device_modify_flags { @@ -1003,6 +1001,8 @@ struct ib_device { int (*query_port)(struct ib_device *device, u8 port_num, struct ib_port_attr *port_attr); + enum rdma_transport_type (*get_port_transport)(struct ib_device *device, + u8 port_num); int (*query_gid)(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); @@ -1213,6 +1213,11 @@ int ib_query_device(struct ib_device *device, int ib_query_port(struct ib_device *device, u8 port_num, struct ib_port_attr *port_attr); +enum rdma_transport_type rdma_port_get_transport(struct ib_device *device, + u8 port_num); +int rdma_is_transport_supported(struct ib_device *device, + enum rdma_transport_type transport); + int ib_query_gid(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 42a6f9f..769dc18 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -338,8 +338,7 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt, static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count) { if ((RDMA_TRANSPORT_IWARP == - rdma_node_get_transport(xprt->sc_cm_id-> - device->node_type)) + rdma_port_get_transport(xprt->sc_cm_id->device, xprt->sc_cm_id->port_num)) && sge_count > 1) return 1; else diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 5151f9f..a5a4162 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -976,7 +976,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) /* * Determine if a DMA MR is required and if so, what privs are required */ - switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) { + switch (rdma_port_get_transport(newxprt->sc_cm_id->device, newxprt->sc_cm_id->port_num)) { case RDMA_TRANSPORT_IWARP: newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV; if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) { -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 07:38:59 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 17:38:59 +0300 Subject: [ofa-general] [PATCHv5 03/10] ib_core: RDMAoE support only QP1 Message-ID: <20090819143859.GB8675@mtls03> Since RDMAoE is using Ethernet as its link layer, there is no need for QP0. QP1 is still needed since it handles communications between CM agents. This patch will create only QP1 for RDMAoE ports. Signed-off-by: Eli Cohen --- Changes from previous version: 1. Instead of returning NULL for unsupported ports (which is no considered an error), now callers of ib_register_mad_agent() must verify that the port/special QP is supported before calling this function. 2. Make use of rdma_is_transport_supported() where appropriate. drivers/infiniband/core/agent.c | 38 +++++++++++++++++++++++++------------- drivers/infiniband/core/mad.c | 37 +++++++++++++++++++++++++++++-------- 2 files changed, 54 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index ae7c288..c130a4a 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -48,6 +48,8 @@ struct ib_agent_port_private { struct list_head port_list; struct ib_mad_agent *agent[2]; + struct ib_device *device; + u8 port_num; }; static DEFINE_SPINLOCK(ib_agent_port_list_lock); @@ -58,11 +60,10 @@ __ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if (entry->agent[0]->device == device && - entry->agent[0]->port_num == port_num) + list_for_each_entry(entry, &ib_agent_port_list, port_list) + if (entry->device == device && entry->port_num == port_num) return entry; - } + return NULL; } @@ -146,6 +147,7 @@ int ib_agent_port_open(struct ib_device *device, int port_num) struct ib_agent_port_private *port_priv; unsigned long flags; int ret; + enum rdma_transport_type tt; /* Create new device info */ port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); @@ -155,14 +157,17 @@ int ib_agent_port_open(struct ib_device *device, int port_num) goto error1; } - /* Obtain send only MAD agent for SMI QP */ - port_priv->agent[0] = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->agent[0])) { - ret = PTR_ERR(port_priv->agent[0]); - goto error2; + tt = rdma_port_get_transport(device, port_num); + if (tt == RDMA_TRANSPORT_IB) { + /* Obtain send only MAD agent for SMI QP */ + port_priv->agent[0] = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->agent[0])) { + ret = PTR_ERR(port_priv->agent[0]); + goto error2; + } } /* Obtain send only MAD agent for GSI QP */ @@ -175,6 +180,9 @@ int ib_agent_port_open(struct ib_device *device, int port_num) goto error3; } + port_priv->device = device; + port_priv->port_num = port_num; + spin_lock_irqsave(&ib_agent_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_agent_port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); @@ -182,7 +190,8 @@ int ib_agent_port_open(struct ib_device *device, int port_num) return 0; error3: - ib_unregister_mad_agent(port_priv->agent[0]); + if (tt == RDMA_TRANSPORT_IB) + ib_unregister_mad_agent(port_priv->agent[0]); error2: kfree(port_priv); error1: @@ -194,6 +203,9 @@ int ib_agent_port_close(struct ib_device *device, int port_num) struct ib_agent_port_private *port_priv; unsigned long flags; + if (rdma_port_get_transport(device, port_num) != RDMA_TRANSPORT_IB) + return 0; + spin_lock_irqsave(&ib_agent_port_list_lock, flags); port_priv = __ib_get_agent_port(device, port_num); if (port_priv == NULL) { diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index c06117c..aceae79 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2602,6 +2602,9 @@ static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) struct ib_mad_private *recv; struct ib_mad_list_head *mad_list; + if (!qp_info->qp) + return; + while (!list_empty(&qp_info->recv_queue.list)) { mad_list = list_entry(qp_info->recv_queue.list.next, @@ -2643,6 +2646,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv) for (i = 0; i < IB_MAD_QPS_CORE; i++) { qp = port_priv->qp_info[i].qp; + if (!qp) + continue; + /* * PKey index for QP1 is irrelevant but * one is needed for the Reset to Init transition @@ -2684,6 +2690,9 @@ static int ib_mad_port_start(struct ib_mad_port_private *port_priv) } for (i = 0; i < IB_MAD_QPS_CORE; i++) { + if (!port_priv->qp_info[i].qp) + continue; + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); if (ret) { printk(KERN_ERR PFX "Couldn't post receive WRs\n"); @@ -2762,6 +2771,9 @@ error: static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) { + if (!qp_info->qp) + return; + ib_destroy_qp(qp_info->qp); kfree(qp_info->snoop_table); } @@ -2777,6 +2789,7 @@ static int ib_mad_port_open(struct ib_device *device, struct ib_mad_port_private *port_priv; unsigned long flags; char name[sizeof "ib_mad123"]; + int has_smi; /* Create new device info */ port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); @@ -2792,7 +2805,11 @@ static int ib_mad_port_open(struct ib_device *device, init_mad_qp(port_priv, &port_priv->qp_info[0]); init_mad_qp(port_priv, &port_priv->qp_info[1]); - cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE; + has_smi = rdma_port_get_transport(device, port_num) == RDMA_TRANSPORT_IB; + if (has_smi) + cq_size *= 2; + port_priv->cq = ib_create_cq(port_priv->device, ib_mad_thread_completion_handler, NULL, port_priv, cq_size, 0); @@ -2816,9 +2833,11 @@ static int ib_mad_port_open(struct ib_device *device, goto error5; } - ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); - if (ret) - goto error6; + if (has_smi) { + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + } ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); if (ret) goto error7; @@ -2907,7 +2926,8 @@ static void ib_mad_init_device(struct ib_device *device) int start, end, i; enum rdma_transport_type tt; - if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB)) + if (!rdma_is_transport_supported(device, RDMA_TRANSPORT_IB) && + !rdma_is_transport_supported(device, RDMA_TRANSPORT_RDMAOE)) return; if (device->node_type == RDMA_NODE_IB_SWITCH) { @@ -2920,7 +2940,7 @@ static void ib_mad_init_device(struct ib_device *device) for (i = start; i <= end; i++) { tt = rdma_port_get_transport(device, i); - if (tt != RDMA_TRANSPORT_IB) + if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE) continue; if (ib_mad_port_open(device, i)) { @@ -2946,7 +2966,8 @@ error: i--; while (i >= start) { - if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) { + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) { if (ib_agent_port_close(device, i)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", @@ -2973,7 +2994,7 @@ static void ib_mad_remove_device(struct ib_device *device) } for (i = 0; i < num_ports; i++, cur_port++) { tt = rdma_port_get_transport(device, i); - if (tt == RDMA_TRANSPORT_IB) { + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) { if (ib_agent_port_close(device, cur_port)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 07:39:12 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 17:39:12 +0300 Subject: [ofa-general] [PATCHv5 04/10] IB/umad: Enable support only for IB ports Message-ID: <20090819143912.GC8675@mtls03> Initialize umad context for devices that have any of their ports IB. Since devices may have ports of two different protocols (for example, RDMA_TRANSPORT_IB and RDMA_TRANSPORT_RDMAOE), ib_umad_add_one() needs to succeed if any of the ports is IB but ib_umad_init_port() is called only for IB ports. Signed-off-by: Eli Cohen --- Changes from last version: 1. Patch title changed from to "Enable support for RDMAoE ports" to "Enable support only for IB ports". 2. Do not allow userspace MADs to RDMAoE ports. drivers/infiniband/core/user_mad.c | 15 +++++++-------- 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index aa4eeb3..51888eb 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1123,10 +1123,6 @@ static void ib_umad_add_one(struct ib_device *device) e = device->phys_port_cnt; } - for (i = s; i <= e; ++i) - if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) - return; - umad_dev = kzalloc(sizeof *umad_dev + (e - s + 1) * sizeof (struct ib_umad_port), GFP_KERNEL); @@ -1141,8 +1137,9 @@ static void ib_umad_add_one(struct ib_device *device) for (i = s; i <= e; ++i) { umad_dev->port[i - s].umad_dev = umad_dev; - if (ib_umad_init_port(device, i, &umad_dev->port[i - s])) - goto err; + if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) + if (ib_umad_init_port(device, i, &umad_dev->port[i - s])) + goto err; } ib_set_client_data(device, &umad_client, umad_dev); @@ -1151,7 +1148,8 @@ static void ib_umad_add_one(struct ib_device *device) err: while (--i >= s) - ib_umad_kill_port(&umad_dev->port[i - s]); + if (rdma_port_get_transport(device, i) == RDMA_TRANSPORT_IB) + ib_umad_kill_port(&umad_dev->port[i - s]); kref_put(&umad_dev->ref, ib_umad_release_dev); } @@ -1165,7 +1163,8 @@ static void ib_umad_remove_one(struct ib_device *device) return; for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) - ib_umad_kill_port(&umad_dev->port[i]); + if (rdma_port_get_transport(device, i + 1) == RDMA_TRANSPORT_IB) + ib_umad_kill_port(&umad_dev->port[i]); kref_put(&umad_dev->ref, ib_umad_release_dev); } -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 07:39:28 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 17:39:28 +0300 Subject: [ofa-general] [PATCHv5 05/10] ib/cm: Enable CM support for RDMAoE Message-ID: <20090819143928.GD8675@mtls03> CM messages can be transported on RDMAoE protocol ports so they are enabled here. Signed-off-by: Eli Cohen --- Changes from last version: Make use of rdma_is_transport_supported() drivers/infiniband/core/cm.c | 5 +++-- drivers/infiniband/core/ucm.c | 12 +++++++++--- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index d082f59..c9f9122 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -3680,7 +3680,8 @@ static void cm_add_one(struct ib_device *ib_device) u8 i; enum rdma_transport_type tt; - if (!rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_IB)) + if (!rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_IB) && + !rdma_is_transport_supported(ib_device, RDMA_TRANSPORT_RDMAOE)) return; cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) * @@ -3702,7 +3703,7 @@ static void cm_add_one(struct ib_device *ib_device) set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= ib_device->phys_port_cnt; i++) { tt = rdma_port_get_transport(ib_device, i); - if (tt != RDMA_TRANSPORT_IB) + if (tt != RDMA_TRANSPORT_IB && tt != RDMA_TRANSPORT_RDMAOE) continue; port = kzalloc(sizeof *port, GFP_KERNEL); diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index b508020..3ce5df2 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -1240,13 +1240,19 @@ static void ib_ucm_add_one(struct ib_device *device) { struct ib_ucm_device *ucm_dev; int i; + enum rdma_transport_type tt; if (!device->alloc_ucontext) return; - for (i = 1; i <= device->phys_port_cnt; ++i) - if (rdma_port_get_transport(device, i) != RDMA_TRANSPORT_IB) - return; + for (i = 1; i <= device->phys_port_cnt; ++i) { + tt = rdma_port_get_transport(device, i); + if (tt == RDMA_TRANSPORT_IB || tt == RDMA_TRANSPORT_RDMAOE) + break; + } + + if (i > device->phys_port_cnt) + return; ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); if (!ucm_dev) -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 07:39:37 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 17:39:37 +0300 Subject: [ofa-general] [PATCHv5 06/10] ib_core: CMA device binding Message-ID: <20090819143937.GE8675@mtls03> Add support for RDMAoE device binding and IP --> GID resolution. Path resolving and multicast joining are implemented within cma.c by filling the responses and pushing the callbacks to the cma work queue. IP->GID resolution always yields IPv6 link local addresses - remote GIDs are derived from the destination MAC address of the remote port. Multicast GIDs are always mapped to multicast MACs as is done in IPv6; addtion/removal of addresses are made by calling dev_mc_add/delete thus causing the netedvice driver to update the corresponding port's configuration. IPv4 multlicast is not supported currently. Some helper functions are added to ib_addr.h. Signed-off-by: Eli Cohen --- Changes from last version: 1. Add kref to struct cma_multicast to aid in maintaining reference count on the object. This is to avoid freeing the object while the worker thread is still using it. 2. return an immediate error if we get an invalid mtu in a resolved path 3. Don't fail resolve path if rate is 0 since this value stands for IB_RATE_PORT_CURRENT. 4. In cma_rdmaoe_join_multicast(), fail immediately if mtu is zero. 5. Add ucma_copy_rdmaoe_route() to copy route to userspace instead of modifying ucma_copy_ib_route(). drivers/infiniband/core/cma.c | 207 ++++++++++++++++++++++++++++++++++++++- drivers/infiniband/core/ucma.c | 31 ++++++ include/rdma/ib_addr.h | 92 ++++++++++++++++++ 3 files changed, 324 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 02fd045..6e56e27 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -58,6 +58,7 @@ MODULE_LICENSE("Dual BSD/GPL"); #define CMA_CM_RESPONSE_TIMEOUT 20 #define CMA_MAX_CM_RETRIES 15 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) +#define RDMAOE_PACKET_LIFETIME 18 static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -157,6 +158,7 @@ struct cma_multicast { struct list_head list; void *context; struct sockaddr_storage addr; + struct kref mcref; }; struct cma_work { @@ -173,6 +175,12 @@ struct cma_ndev_work { struct rdma_cm_event event; }; +struct rdmaoe_mcast_work { + struct work_struct work; + struct rdma_id_private *id; + struct cma_multicast *mc; +}; + union cma_ip_addr { struct in6_addr ip6; struct { @@ -290,6 +298,20 @@ static inline void cma_deref_dev(struct cma_device *cma_dev) complete(&cma_dev->comp); } +static inline void release_mc(struct kref *kref) +{ + struct cma_multicast *mc = container_of(kref, struct cma_multicast, mcref); + struct rdma_dev_addr *dev_addr = &mc->id_priv->id.route.addr.dev_addr; + u8 mac[6]; + + rdma_get_mcast_mac((struct in6_addr *)(&mc->multicast.ib->rec.mgid), mac); + rtnl_lock(); + dev_mc_delete(dev_addr->src_dev, mac, 6, 0); + rtnl_unlock(); + kfree(mc->multicast.ib); + kfree(mc); +} + static void cma_detach_from_dev(struct rdma_id_private *id_priv) { list_del(&id_priv->list); @@ -340,6 +362,9 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv) case RDMA_TRANSPORT_IWARP: iw_addr_get_sgid(dev_addr, &gid); break; + case RDMA_TRANSPORT_RDMAOE: + rdmaoe_addr_get_sgid(dev_addr, &gid); + break; default: return -ENODEV; } @@ -568,10 +593,16 @@ static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv, { struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; int ret; + u16 pkey; + + if (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num) == + RDMA_TRANSPORT_IB) + pkey = ib_addr_get_pkey(dev_addr); + else + pkey = 0xffff; ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, - ib_addr_get_pkey(dev_addr), - &qp_attr->pkey_index); + pkey, &qp_attr->pkey_index); if (ret) return ret; @@ -601,6 +632,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, id_priv = container_of(id, struct rdma_id_private, id); switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (!id_priv->cm_id.ib || cma_is_ud_ps(id_priv->id.ps)) ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask); else @@ -828,8 +860,17 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) mc = container_of(id_priv->mc_list.next, struct cma_multicast, list); list_del(&mc->list); - ib_sa_free_multicast(mc->multicast.ib); - kfree(mc); + switch (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num)) { + case RDMA_TRANSPORT_IB: + ib_sa_free_multicast(mc->multicast.ib); + kfree(mc); + break; + case RDMA_TRANSPORT_RDMAOE: + kref_put(&mc->mcref, release_mc); + break; + default: + break; + } } } @@ -847,6 +888,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) mutex_unlock(&lock); switch (rdma_port_get_transport(id_priv->id.device, id_priv->id.port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) ib_destroy_cm_id(id_priv->cm_id.ib); break; @@ -1504,6 +1546,7 @@ int rdma_listen(struct rdma_cm_id *id, int backlog) if (id->device) { switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: ret = cma_ib_listen(id_priv); if (ret) goto err; @@ -1719,6 +1762,66 @@ static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) return 0; } +static int cma_resolve_rdmaoe_route(struct rdma_id_private *id_priv) +{ + struct rdma_route *route = &id_priv->id.route; + struct rdma_addr *addr = &route->addr; + struct cma_work *work; + int ret; + struct sockaddr_in *src_addr = (struct sockaddr_in *)&route->addr.src_addr; + struct sockaddr_in *dst_addr = (struct sockaddr_in *)&route->addr.dst_addr; + + if (src_addr->sin_family != dst_addr->sin_family) + return -EINVAL; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler); + + route->path_rec = kzalloc(sizeof *route->path_rec, GFP_KERNEL); + if (!route->path_rec) { + ret = -ENOMEM; + goto err1; + } + + route->num_paths = 1; + + rdmaoe_mac_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr); + rdmaoe_mac_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr); + + route->path_rec->hop_limit = 2; + route->path_rec->reversible = 1; + route->path_rec->pkey = cpu_to_be16(0xffff); + route->path_rec->mtu_selector = 2; + route->path_rec->mtu = rdmaoe_get_mtu(addr->dev_addr.src_dev->mtu); + route->path_rec->rate_selector = 2; + route->path_rec->rate = rdmaoe_get_rate(addr->dev_addr.src_dev); + route->path_rec->packet_life_time_selector = 2; + route->path_rec->packet_life_time = RDMAOE_PACKET_LIFETIME; + if (!route->path_rec->mtu) { + ret = -EINVAL; + goto err2; + } + + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + work->event.status = 0; + + queue_work(cma_wq, &work->work); + + return 0; + +err2: + kfree(route->path_rec); +err1: + kfree(work); + return ret; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1736,6 +1839,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) case RDMA_TRANSPORT_IWARP: ret = cma_resolve_iw_route(id_priv, timeout_ms); break; + case RDMA_TRANSPORT_RDMAOE: + ret = cma_resolve_rdmaoe_route(id_priv); + break; default: ret = -ENOSYS; break; @@ -2411,6 +2517,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (cma_is_ud_ps(id->ps)) ret = cma_resolve_ib_udp(id_priv, conn_param); else @@ -2524,6 +2631,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS, conn_param->private_data, @@ -2585,6 +2693,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data, switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: if (cma_is_ud_ps(id->ps)) ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, private_data, private_data_len); @@ -2616,6 +2725,7 @@ int rdma_disconnect(struct rdma_cm_id *id) switch (rdma_port_get_transport(id->device, id->port_num)) { case RDMA_TRANSPORT_IB: + case RDMA_TRANSPORT_RDMAOE: ret = cma_modify_qp_err(id_priv); if (ret) goto out; @@ -2742,6 +2852,77 @@ static int cma_join_ib_multicast(struct rdma_id_private *id_priv, return 0; } + +static void rdmaoe_mcast_work_handler(struct work_struct *work) +{ + struct rdmaoe_mcast_work *mw = container_of(work, struct rdmaoe_mcast_work, work); + struct cma_multicast *mc = mw->mc; + struct ib_sa_multicast *m = mc->multicast.ib; + struct rdma_dev_addr *dev_addr = &mw->id->id.route.addr.dev_addr; + u8 mac[6]; + + mc->multicast.ib->context = mc; + rdma_get_mcast_mac((struct in6_addr *)(&mc->multicast.ib->rec.mgid), mac); + rtnl_lock(); + dev_mc_add(dev_addr->src_dev, mac, 6, 0); + rtnl_unlock(); + cma_ib_mc_handler(0, m); + kref_put(&mc->mcref, release_mc); + kfree(mw); +} + +static int cma_rdmaoe_join_multicast(struct rdma_id_private *id_priv, + struct cma_multicast *mc) +{ + struct rdmaoe_mcast_work *work; + struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; + int err; + struct sockaddr *addr = (struct sockaddr *)&mc->addr; + + if (cma_zero_addr((struct sockaddr *)&mc->addr)) + return -EINVAL; + + /* IPv4 multicast is not supported currenntly */ + if (addr->sa_family == AF_INET) + return -EINVAL; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + mc->multicast.ib = kzalloc(sizeof(struct ib_sa_multicast), GFP_KERNEL); + if (!mc->multicast.ib) { + err = -ENOMEM; + goto out1; + } + + cma_set_mgid(id_priv, addr, &mc->multicast.ib->rec.mgid); + mc->multicast.ib->rec.pkey = cpu_to_be16(0xffff); + if (id_priv->id.ps == RDMA_PS_UDP) + mc->multicast.ib->rec.qkey = cpu_to_be32(RDMA_UDP_QKEY); + mc->multicast.ib->rec.rate = rdmaoe_get_rate(dev_addr->src_dev); + mc->multicast.ib->rec.hop_limit = 1; + mc->multicast.ib->rec.mtu = rdmaoe_get_mtu(dev_addr->src_dev->mtu); + if (!mc->multicast.ib->rec.mtu) { + err = -EINVAL; + goto out2; + } + rdmaoe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid); + work->id = id_priv; + work->mc = mc; + INIT_WORK(&work->work, rdmaoe_mcast_work_handler); + kref_get(&mc->mcref); + queue_work(cma_wq, &work->work); + + return 0; + +out2: + kfree(mc->multicast.ib); +out1: + kfree(work); + return err; +} + int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, void *context) { @@ -2770,6 +2951,10 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, case RDMA_TRANSPORT_IB: ret = cma_join_ib_multicast(id_priv, mc); break; + case RDMA_TRANSPORT_RDMAOE: + kref_init(&mc->mcref); + ret = cma_rdmaoe_join_multicast(id_priv, mc); + break; default: ret = -ENOSYS; break; @@ -2781,6 +2966,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, spin_unlock_irq(&id_priv->lock); kfree(mc); } + return ret; } EXPORT_SYMBOL(rdma_join_multicast); @@ -2801,8 +2987,17 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) ib_detach_mcast(id->qp, &mc->multicast.ib->rec.mgid, mc->multicast.ib->rec.mlid); - ib_sa_free_multicast(mc->multicast.ib); - kfree(mc); + switch (rdma_port_get_transport(id_priv->cma_dev->device, id_priv->id.port_num)) { + case RDMA_TRANSPORT_IB: + ib_sa_free_multicast(mc->multicast.ib); + kfree(mc); + break; + case RDMA_TRANSPORT_RDMAOE: + kref_put(&mc->mcref, release_mc); + break; + default: + break; + } return; } } diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 24d9510..5eb1198 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -580,6 +580,34 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, } } +static void ucma_copy_rdmaoe_route(struct rdma_ucm_query_route_resp *resp, + struct rdma_route *route) +{ + struct rdma_dev_addr *dev_addr; + + resp->num_paths = route->num_paths; + switch (route->num_paths) { + case 0: + dev_addr = &route->addr.dev_addr; + rdmaoe_mac_to_ll((union ib_gid *) &resp->ib_route[0].dgid, + dev_addr->dst_dev_addr); + rdmaoe_addr_get_sgid(dev_addr, + (union ib_gid *) &resp->ib_route[0].sgid); + resp->ib_route[0].pkey = cpu_to_be16(0xffff); + break; + case 2: + ib_copy_path_rec_to_user(&resp->ib_route[1], + &route->path_rec[1]); + /* fall through */ + case 1: + ib_copy_path_rec_to_user(&resp->ib_route[0], + &route->path_rec[0]); + break; + default: + break; + } +} + static ssize_t ucma_query_route(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) @@ -618,6 +646,9 @@ static ssize_t ucma_query_route(struct ucma_file *file, case RDMA_TRANSPORT_IB: ucma_copy_ib_route(&resp, &ctx->cm_id->route); break; + case RDMA_TRANSPORT_RDMAOE: + ucma_copy_rdmaoe_route(&resp, &ctx->cm_id->route); + break; default: break; } diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index 483057b..ab06fe9 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -39,6 +39,8 @@ #include #include #include +#include +#include struct rdma_addr_client { atomic_t refcount; @@ -157,4 +159,94 @@ static inline void iw_addr_get_dgid(struct rdma_dev_addr *dev_addr, memcpy(gid, dev_addr->dst_dev_addr, sizeof *gid); } +static inline void rdmaoe_mac_to_ll(union ib_gid *gid, u8 *mac) +{ + memset(gid->raw, 0, 16); + *((u32 *)gid->raw) = cpu_to_be32(0xfe800000); + gid->raw[12] = 0xfe; + gid->raw[11] = 0xff; + memcpy(gid->raw + 13, mac + 3, 3); + memcpy(gid->raw + 8, mac, 3); + gid->raw[8] ^= 2; +} + +static inline void rdmaoe_addr_get_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + rdmaoe_mac_to_ll(gid, dev_addr->src_dev_addr); +} + +static inline enum ib_mtu rdmaoe_get_mtu(int mtu) +{ + /* + * reduce IB headers from effective RDMAoE MTU. 28 stands for + * atomic header which is the biggest possible header after BTH + */ + mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28; + + if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096)) + return IB_MTU_4096; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048)) + return IB_MTU_2048; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024)) + return IB_MTU_1024; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512)) + return IB_MTU_512; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_256)) + return IB_MTU_256; + else + return 0; +} + +static inline int rdmaoe_get_rate(struct net_device *dev) +{ + struct ethtool_cmd cmd; + + if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings || + dev->ethtool_ops->get_settings(dev, &cmd)) + return IB_RATE_PORT_CURRENT; + + if (cmd.speed >= 40000) + return IB_RATE_40_GBPS; + else if (cmd.speed >= 30000) + return IB_RATE_30_GBPS; + else if (cmd.speed >= 20000) + return IB_RATE_20_GBPS; + else if (cmd.speed >= 10000) + return IB_RATE_10_GBPS; + else + return IB_RATE_PORT_CURRENT; +} + +static inline int rdma_link_local_addr(struct in6_addr *addr) +{ + if (addr->s6_addr32[0] == cpu_to_be32(0xfe800000) && + addr->s6_addr32[1] == 0) + return 1; + + return 0; +} + +static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac) +{ + memcpy(mac, &addr->s6_addr[8], 3); + memcpy(mac + 3, &addr->s6_addr[13], 3); + mac[0] ^= 2; +} + +static inline int rdma_is_multicast_addr(struct in6_addr *addr) +{ + return addr->s6_addr[0] == 0xff ? 1 : 0; +} + +static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac) +{ + int i; + + mac[0] = 0x33; + mac[1] = 0x33; + for (i = 2; i < 6; ++i) + mac[i] = addr->s6_addr[i + 10]; +} + #endif /* IB_ADDR_H */ -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 07:39:45 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 17:39:45 +0300 Subject: [ofa-general] [PATCHv5 09/10] mlx4: Add support for RDMAoE - address resolution Message-ID: <20090819143945.GF8675@mtls03> The following path handles address vectors creation for RDMAoE ports. mlx4 needs the MAC address of the remote node to include it in the WQE of a UD QP or in the QP context of connected QPs. Address resolution is done atomically in the case of a link local address or a multicast GID and otherwise -EINVAL is returned. mlx4 transport packets were changed too to accomodate for RDMAoE. Signed-off-by: Eli Cohen --- Changes from previous version: Call ib_register_mad_agent() for RDMA_TRANSPORT_IB type ports. drivers/infiniband/hw/mlx4/ah.c | 187 ++++++++++++++++++++++++++++------ drivers/infiniband/hw/mlx4/mad.c | 32 ++++-- drivers/infiniband/hw/mlx4/mlx4_ib.h | 19 +++- drivers/infiniband/hw/mlx4/qp.c | 172 +++++++++++++++++++++---------- drivers/net/mlx4/fw.c | 3 +- include/linux/mlx4/device.h | 31 ++++++- include/linux/mlx4/qp.h | 8 +- 7 files changed, 347 insertions(+), 105 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c index c75ac94..0a015c3 100644 --- a/drivers/infiniband/hw/mlx4/ah.c +++ b/drivers/infiniband/hw/mlx4/ah.c @@ -31,63 +31,166 @@ */ #include "mlx4_ib.h" +#include +#include +#include -struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr, + u8 *mac, int *is_mcast) { - struct mlx4_dev *dev = to_mdev(pd->device)->dev; - struct mlx4_ib_ah *ah; + struct mlx4_ib_rdmaoe *rdmaoe = &dev->rdmaoe; + struct sockaddr_in6 s6 = {0}; + struct net_device *netdev; + int ifidx; - ah = kmalloc(sizeof *ah, GFP_ATOMIC); - if (!ah) - return ERR_PTR(-ENOMEM); + *is_mcast = 0; + spin_lock(&rdmaoe->lock); + netdev = rdmaoe->netdevs[ah_attr->port_num - 1]; + if (!netdev) { + spin_unlock(&rdmaoe->lock); + return -EINVAL; + } + ifidx = netdev->ifindex; + spin_unlock(&rdmaoe->lock); - memset(&ah->av, 0, sizeof ah->av); + memcpy(s6.sin6_addr.s6_addr, ah_attr->grh.dgid.raw, sizeof ah_attr->grh); + s6.sin6_family = AF_INET6; + s6.sin6_scope_id = ifidx; + if (rdma_link_local_addr(&s6.sin6_addr)) + rdma_get_ll_mac(&s6.sin6_addr, mac); + else if (rdma_is_multicast_addr(&s6.sin6_addr)) { + rdma_get_mcast_mac(&s6.sin6_addr, mac); + *is_mcast = 1; + } else + return -EINVAL; - ah->av.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24)); - ah->av.g_slid = ah_attr->src_path_bits; - ah->av.dlid = cpu_to_be16(ah_attr->dlid); - if (ah_attr->static_rate) { - ah->av.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET; - while (ah->av.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && - !(1 << ah->av.stat_rate & dev->caps.stat_rate_support)) - --ah->av.stat_rate; - } - ah->av.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + return 0; +} + +static struct ib_ah *create_ib_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr, + struct mlx4_ib_ah *ah) +{ + struct mlx4_dev *dev = to_mdev(pd->device)->dev; + + ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24)); + ah->av.ib.g_slid = ah_attr->src_path_bits; if (ah_attr->ah_flags & IB_AH_GRH) { - ah->av.g_slid |= 0x80; - ah->av.gid_index = ah_attr->grh.sgid_index; - ah->av.hop_limit = ah_attr->grh.hop_limit; - ah->av.sl_tclass_flowlabel |= + ah->av.ib.g_slid |= 0x80; + ah->av.ib.gid_index = ah_attr->grh.sgid_index; + ah->av.ib.hop_limit = ah_attr->grh.hop_limit; + ah->av.ib.sl_tclass_flowlabel |= cpu_to_be32((ah_attr->grh.traffic_class << 20) | ah_attr->grh.flow_label); - memcpy(ah->av.dgid, ah_attr->grh.dgid.raw, 16); + memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16); + } + + ah->av.ib.dlid = cpu_to_be16(ah_attr->dlid); + if (ah_attr->static_rate) { + ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET; + while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && + !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support)) + --ah->av.ib.stat_rate; } + ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); return &ah->ibah; } +static struct ib_ah *create_rdmaoe_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr, + struct mlx4_ib_ah *ah) +{ + struct mlx4_ib_dev *ibdev = to_mdev(pd->device); + struct mlx4_dev *dev = ibdev->dev; + u8 mac[6]; + int err; + int is_mcast; + + err = mlx4_ib_resolve_grh(ibdev, ah_attr, mac, &is_mcast); + if (err) + return ERR_PTR(err); + + memcpy(ah->av.eth.mac_0_1, mac, 2); + memcpy(ah->av.eth.mac_2_5, mac + 2, 4); + ah->av.ib.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24)); + ah->av.ib.g_slid = 0x80; + if (ah_attr->static_rate) { + ah->av.ib.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET; + while (ah->av.ib.stat_rate > IB_RATE_2_5_GBPS + MLX4_STAT_RATE_OFFSET && + !(1 << ah->av.ib.stat_rate & dev->caps.stat_rate_support)) + --ah->av.ib.stat_rate; + } + + /* + * HW requires multicast LID so we just choose one. + */ + if (is_mcast) + ah->av.ib.dlid = cpu_to_be16(0xc000); + + memcpy(ah->av.ib.dgid, ah_attr->grh.dgid.raw, 16); + ah->av.ib.sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + + return &ah->ibah; +} + +struct ib_ah *mlx4_ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct mlx4_ib_ah *ah; + enum rdma_transport_type transport; + struct ib_ah *ret; + + ah = kzalloc(sizeof *ah, GFP_ATOMIC); + if (!ah) + return ERR_PTR(-ENOMEM); + + transport = rdma_port_get_transport(pd->device, ah_attr->port_num); + if (transport == RDMA_TRANSPORT_RDMAOE) { + if (!(ah_attr->ah_flags & IB_AH_GRH)) { + ret = ERR_PTR(-EINVAL); + goto out; + } else { + /* TBD: need to handle the case when we get called + in an atomic context and there we might sleep. We + don't expect this currently since we're working with + link local addresses which we can translate without + going to sleep */ + ret = create_rdmaoe_ah(pd, ah_attr, ah); + if (IS_ERR(ret)) + goto out; + else + return ret; + } + } else + return create_ib_ah(pd, ah_attr, ah); /* never fails */ + +out: + kfree(ah); + return ret; +} + int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr) { struct mlx4_ib_ah *ah = to_mah(ibah); + enum rdma_transport_type transport; + transport = rdma_port_get_transport(ibah->device, ah_attr->port_num); memset(ah_attr, 0, sizeof *ah_attr); - ah_attr->dlid = be16_to_cpu(ah->av.dlid); - ah_attr->sl = be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28; - ah_attr->port_num = be32_to_cpu(ah->av.port_pd) >> 24; - if (ah->av.stat_rate) - ah_attr->static_rate = ah->av.stat_rate - MLX4_STAT_RATE_OFFSET; - ah_attr->src_path_bits = ah->av.g_slid & 0x7F; + ah_attr->dlid = transport == RDMA_TRANSPORT_IB ? be16_to_cpu(ah->av.ib.dlid) : 0; + ah_attr->sl = be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28; + ah_attr->port_num = be32_to_cpu(ah->av.ib.port_pd) >> 24; + if (ah->av.ib.stat_rate) + ah_attr->static_rate = ah->av.ib.stat_rate - MLX4_STAT_RATE_OFFSET; + ah_attr->src_path_bits = ah->av.ib.g_slid & 0x7F; if (mlx4_ib_ah_grh_present(ah)) { ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.traffic_class = - be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20; + be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20; ah_attr->grh.flow_label = - be32_to_cpu(ah->av.sl_tclass_flowlabel) & 0xfffff; - ah_attr->grh.hop_limit = ah->av.hop_limit; - ah_attr->grh.sgid_index = ah->av.gid_index; - memcpy(ah_attr->grh.dgid.raw, ah->av.dgid, 16); + be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) & 0xfffff; + ah_attr->grh.hop_limit = ah->av.ib.hop_limit; + ah_attr->grh.sgid_index = ah->av.ib.gid_index; + memcpy(ah_attr->grh.dgid.raw, ah->av.ib.dgid, 16); } return 0; @@ -98,3 +201,21 @@ int mlx4_ib_destroy_ah(struct ib_ah *ah) kfree(to_mah(ah)); return 0; } + +int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac) +{ + int err; + struct mlx4_ib_dev *ibdev = to_mdev(device); + struct ib_ah_attr ah_attr = { + .port_num = port, + }; + int is_mcast; + + memcpy(ah_attr.grh.dgid.raw, gid, 16); + err = mlx4_ib_resolve_grh(ibdev, &ah_attr, mac, &is_mcast); + if (err) + ERR_PTR(err); + + return 0; +} + diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c index 19e68ab..3df4f64 100644 --- a/drivers/infiniband/hw/mlx4/mad.c +++ b/drivers/infiniband/hw/mlx4/mad.c @@ -310,19 +310,25 @@ int mlx4_ib_mad_init(struct mlx4_ib_dev *dev) struct ib_mad_agent *agent; int p, q; int ret; + enum rdma_transport_type tt; - for (p = 0; p < dev->num_ports; ++p) + for (p = 0; p < dev->num_ports; ++p) { + tt = rdma_port_get_transport(&dev->ib_dev, p + 1); for (q = 0; q <= 1; ++q) { - agent = ib_register_mad_agent(&dev->ib_dev, p + 1, - q ? IB_QPT_GSI : IB_QPT_SMI, - NULL, 0, send_handler, - NULL, NULL); - if (IS_ERR(agent)) { - ret = PTR_ERR(agent); - goto err; - } - dev->send_agent[p][q] = agent; + if (tt == RDMA_TRANSPORT_IB) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); + goto err; + } + dev->send_agent[p][q] = agent; + } else + dev->send_agent[p][q] = NULL; } + } return 0; @@ -343,8 +349,10 @@ void mlx4_ib_mad_cleanup(struct mlx4_ib_dev *dev) for (p = 0; p < dev->num_ports; ++p) { for (q = 0; q <= 1; ++q) { agent = dev->send_agent[p][q]; - dev->send_agent[p][q] = NULL; - ib_unregister_mad_agent(agent); + if (agent) { + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } } if (dev->sm_ah[p]) diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 8a7dd67..c644cac 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -138,6 +138,7 @@ struct mlx4_ib_qp { u8 resp_depth; u8 sq_no_prefetch; u8 state; + int mlx_type; }; struct mlx4_ib_srq { @@ -157,7 +158,14 @@ struct mlx4_ib_srq { struct mlx4_ib_ah { struct ib_ah ibah; - struct mlx4_av av; + union mlx4_ext_av av; +}; + +struct mlx4_ib_rdmaoe { + spinlock_t lock; + struct net_device *netdevs[MLX4_MAX_PORTS]; + struct notifier_block nb; + union ib_gid gid_table[MLX4_MAX_PORTS][128]; }; struct mlx4_ib_dev { @@ -175,6 +183,8 @@ struct mlx4_ib_dev { spinlock_t sm_lock; struct mutex cap_mask_mutex; + + struct mlx4_ib_rdmaoe rdmaoe; }; static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev) @@ -313,9 +323,14 @@ int mlx4_ib_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, int npages, int mlx4_ib_unmap_fmr(struct list_head *fmr_list); int mlx4_ib_fmr_dealloc(struct ib_fmr *fmr); +int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr, + u8 *mac, int *is_mcast); + +int mlx4_ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac); + static inline int mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah) { - return !!(ah->av.g_slid & 0x80); + return !!(ah->av.ib.g_slid & 0x80); } #endif /* MLX4_IB_H */ diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 20724ae..4b391fa 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -32,6 +32,7 @@ */ #include +#include #include #include @@ -47,14 +48,21 @@ enum { enum { MLX4_IB_DEFAULT_SCHED_QUEUE = 0x83, - MLX4_IB_DEFAULT_QP0_SCHED_QUEUE = 0x3f + MLX4_IB_DEFAULT_QP0_SCHED_QUEUE = 0x3f, + MLX4_IB_LINK_TYPE_IB = 0, + MLX4_IB_LINK_TYPE_ETH = 1 }; enum { /* * Largest possible UD header: send with GRH and immediate data. + * 4 bytes added to accommodate for eth header instead of lrh */ - MLX4_IB_UD_HEADER_SIZE = 72 + MLX4_IB_UD_HEADER_SIZE = 76 +}; + +enum { + MLX4_RDMAOE_ETHERTYPE = 0x8915 }; struct mlx4_ib_sqp { @@ -62,7 +70,10 @@ struct mlx4_ib_sqp { int pkey_index; u32 qkey; u32 send_psn; - struct ib_ud_header ud_header; + union { + struct ib_ud_header ib; + struct eth_ud_header eth; + } hdr; u8 header_buf[MLX4_IB_UD_HEADER_SIZE]; }; @@ -546,9 +557,9 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, } } - if (sqpn) { + if (sqpn) qpn = sqpn; - } else { + else { err = mlx4_qp_reserve_range(dev->dev, 1, 1, &qpn); if (err) goto err_wrid; @@ -843,6 +854,12 @@ static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port) static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah, struct mlx4_qp_path *path, u8 port) { + int err; + int is_eth = rdma_port_get_transport(&dev->ib_dev, port) == + RDMA_TRANSPORT_RDMAOE ? 1 : 0; + u8 mac[6]; + int is_mcast; + path->grh_mylmc = ah->src_path_bits & 0x7f; path->rlid = cpu_to_be16(ah->dlid); if (ah->static_rate) { @@ -873,6 +890,21 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah, path->sched_queue = MLX4_IB_DEFAULT_SCHED_QUEUE | ((port - 1) << 6) | ((ah->sl & 0xf) << 2); + if (is_eth) { + if (!(ah->ah_flags & IB_AH_GRH)) + return -1; + + err = mlx4_ib_resolve_grh(dev, ah, mac, &is_mcast); + if (err) + return err; + + memcpy(path->dmac_h, mac, 2); + memcpy(path->dmac_l, mac + 2, 4); + path->ackto = MLX4_IB_LINK_TYPE_ETH; + /* use index 0 into MAC table for RDMAoE */ + path->grh_mylmc &= 0x80; + } + return 0; } @@ -972,7 +1004,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_TIMEOUT) { - context->pri_path.ackto = attr->timeout << 3; + context->pri_path.ackto |= (attr->timeout << 3); optpar |= MLX4_QP_OPTPAR_ACK_TIMEOUT; } @@ -1218,79 +1250,109 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr, int header_size; int spc; int i; + void *tmp; + struct ib_ud_header *ib = NULL; + struct eth_ud_header *eth = NULL; + struct ib_unpacked_grh *grh; + struct ib_unpacked_bth *bth; + struct ib_unpacked_deth *deth; send_size = 0; for (i = 0; i < wr->num_sge; ++i) send_size += wr->sg_list[i].length; - ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), &sqp->ud_header); + if (rdma_port_get_transport(sqp->qp.ibqp.device, sqp->qp.port) == RDMA_TRANSPORT_IB) { + ib = &sqp->hdr.ib; + grh = &ib->grh; + bth = &ib->bth; + deth = &ib->deth; + ib_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), ib); + ib->lrh.service_level = + be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28; + ib->lrh.destination_lid = ah->av.ib.dlid; + ib->lrh.source_lid = cpu_to_be16(ah->av.ib.g_slid & 0x7f); + } else { + eth = &sqp->hdr.eth; + grh = ð->grh; + bth = ð->bth; + deth = ð->deth; + ib_rdmaoe_ud_header_init(send_size, mlx4_ib_ah_grh_present(ah), eth); + } - sqp->ud_header.lrh.service_level = - be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 28; - sqp->ud_header.lrh.destination_lid = ah->av.dlid; - sqp->ud_header.lrh.source_lid = cpu_to_be16(ah->av.g_slid & 0x7f); if (mlx4_ib_ah_grh_present(ah)) { - sqp->ud_header.grh.traffic_class = - (be32_to_cpu(ah->av.sl_tclass_flowlabel) >> 20) & 0xff; - sqp->ud_header.grh.flow_label = - ah->av.sl_tclass_flowlabel & cpu_to_be32(0xfffff); - sqp->ud_header.grh.hop_limit = ah->av.hop_limit; - ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.port_pd) >> 24, - ah->av.gid_index, &sqp->ud_header.grh.source_gid); - memcpy(sqp->ud_header.grh.destination_gid.raw, - ah->av.dgid, 16); + grh->traffic_class = + (be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 20) & 0xff; + grh->flow_label = + ah->av.ib.sl_tclass_flowlabel & cpu_to_be32(0xfffff); + grh->hop_limit = ah->av.ib.hop_limit; + ib_get_cached_gid(ib_dev, be32_to_cpu(ah->av.ib.port_pd) >> 24, + ah->av.ib.gid_index, &grh->source_gid); + memcpy(grh->destination_gid.raw, + ah->av.ib.dgid, 16); } mlx->flags &= cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) | - (sqp->ud_header.lrh.destination_lid == - IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) | - (sqp->ud_header.lrh.service_level << 8)); - mlx->rlid = sqp->ud_header.lrh.destination_lid; + + if (ib) { + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MLX4_WQE_MLX_VL15 : 0) | + (ib->lrh.destination_lid == + IB_LID_PERMISSIVE ? MLX4_WQE_MLX_SLR : 0) | + (ib->lrh.service_level << 8)); + mlx->rlid = ib->lrh.destination_lid; + } switch (wr->opcode) { case IB_WR_SEND: - sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; - sqp->ud_header.immediate_present = 0; + bth->opcode = IB_OPCODE_UD_SEND_ONLY; + if (ib) + ib->immediate_present = 0; + else + eth->immediate_present = 0; break; case IB_WR_SEND_WITH_IMM: - sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; - sqp->ud_header.immediate_present = 1; - sqp->ud_header.immediate_data = wr->ex.imm_data; + bth->opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + if (ib) { + ib->immediate_present = 1; + ib->immediate_data = wr->ex.imm_data; + } else { + eth->immediate_present = 1; + eth->immediate_data = wr->ex.imm_data; + } break; default: return -EINVAL; } - sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; - if (sqp->ud_header.lrh.destination_lid == IB_LID_PERMISSIVE) - sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; - sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (ib) { + ib->lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (ib->lrh.destination_lid == IB_LID_PERMISSIVE) + ib->lrh.source_lid = IB_LID_PERMISSIVE; + } else { + memcpy(eth->eth.dmac_h, ah->av.eth.mac_0_1, 2); + memcpy(eth->eth.dmac_h + 2, ah->av.eth.mac_2_5, 2); + memcpy(eth->eth.dmac_l, ah->av.eth.mac_2_5 + 2, 2); + tmp = to_mdev(sqp->qp.ibqp.device)->rdmaoe.netdevs[sqp->qp.port - 1]->dev_addr; + memcpy(eth->eth.smac_h, tmp, 2); + memcpy(eth->eth.smac_l, tmp + 2, 4); + eth->eth.type = cpu_to_be16(MLX4_RDMAOE_ETHERTYPE); + } + bth->solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) ib_get_cached_pkey(ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else ib_get_cached_pkey(ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); - sqp->ud_header.bth.pkey = cpu_to_be16(pkey); - sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); - sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); - sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + bth->pkey = cpu_to_be16(pkey); + bth->destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + bth->psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + deth->qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? sqp->qkey : wr->wr.ud.remote_qkey); - sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); - - header_size = ib_ud_header_pack(&sqp->ud_header, sqp->header_buf); - - if (0) { - printk(KERN_ERR "built UD header of size %d:\n", header_size); - for (i = 0; i < header_size / 4; ++i) { - if (i % 8 == 0) - printk(" [%02x] ", i * 4); - printk(" %08x", - be32_to_cpu(((__be32 *) sqp->header_buf)[i])); - if ((i + 1) % 8 == 0) - printk("\n"); - } - printk("\n"); - } + deth->source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + if (ib) + header_size = ib_ud_header_pack(ib, sqp->header_buf); + else + header_size = rdmaoe_ud_header_pack(eth, sqp->header_buf); /* * Inline data segments may not cross a 64 byte boundary. If @@ -1414,6 +1476,8 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg, memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av)); dseg->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn); dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey); + dseg->vlan = to_mah(wr->wr.ud.ah)->av.eth.vlan; + memcpy(dseg->mac_0_1, to_mah(wr->wr.ud.ah)->av.eth.mac_0_1, 6); } static void set_mlx_icrc_seg(void *dseg) diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index cee199c..20526ce 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -96,7 +96,8 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags) [20] = "Address vector port checking support", [21] = "UD multicast support", [24] = "Demand paging support", - [25] = "Router support" + [25] = "Router support", + [30] = "RDMAoE support" }; int i; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 3aff8a6..b73b5f0 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -66,7 +66,8 @@ enum { MLX4_DEV_CAP_FLAG_ATOMIC = 1 << 18, MLX4_DEV_CAP_FLAG_RAW_MCAST = 1 << 19, MLX4_DEV_CAP_FLAG_UD_AV_PORT = 1 << 20, - MLX4_DEV_CAP_FLAG_UD_MCAST = 1 << 21 + MLX4_DEV_CAP_FLAG_UD_MCAST = 1 << 21, + MLX4_DEV_CAP_FLAG_RDMAOE = 1 << 30 }; enum { @@ -371,6 +372,28 @@ struct mlx4_av { u8 dgid[16]; }; +struct mlx4_eth_av { + __be32 port_pd; + u8 reserved1; + u8 smac_idx; + u16 reserved2; + u8 reserved3; + u8 gid_index; + u8 stat_rate; + u8 hop_limit; + __be32 sl_tclass_flowlabel; + u8 dgid[16]; + u32 reserved4[2]; + __be16 vlan; + u8 mac_0_1[2]; + u8 mac_2_5[4]; +}; + +union mlx4_ext_av { + struct mlx4_av ib; + struct mlx4_eth_av eth; +}; + struct mlx4_dev { struct pci_dev *pdev; unsigned long flags; @@ -399,6 +422,12 @@ struct mlx4_init_port_param { if (((type) == MLX4_PORT_TYPE_IB ? (dev)->caps.port_mask : \ ~(dev)->caps.port_mask) & 1 << ((port) - 1)) +#define mlx4_foreach_ib_transport_port(port, dev) \ + for ((port) = 1; (port) <= (dev)->caps.num_ports; (port)++) \ + if (((dev)->caps.port_mask & 1 << ((port) - 1)) || \ + ((dev)->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE)) + + int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, struct mlx4_buf *buf); void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf); diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index bf8f119..d73534f 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -112,7 +112,9 @@ struct mlx4_qp_path { u8 snooper_flags; u8 reserved3[2]; u8 counter_index; - u8 reserved4[7]; + u8 reserved4; + u8 dmac_h[2]; + u8 dmac_l[4]; }; struct mlx4_qp_context { @@ -218,7 +220,9 @@ struct mlx4_wqe_datagram_seg { __be32 av[8]; __be32 dqpn; __be32 qkey; - __be32 reservd[2]; + __be16 vlan; + u8 mac_0_1[2]; + u8 mac_2_5[4]; }; struct mlx4_wqe_lso_seg { -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 07:39:58 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 17:39:58 +0300 Subject: [ofa-general] [PATCHv5 10/10] mlx4: Add RDMAoE support - allow interfaces to correspond to each other Message-ID: <20090819143958.GG8675@mtls03> This patch add support RDMAoE for mlx4. Since mlx4_ib now needs to reference mlx4_en netdevices, a new mechanism was added. Two new fields were added to struct mlx4_interface to define a protocol and a get_prot_dev method to retrieve the corresponding protocol's net device. An implementation of the new verb ib_get_port_link_type() - mlx4_ib_get_port_link_type - was added. mlx4_ib_query_port() has been modified to support eth link types. An interface is considered to be active if its corresponding eth interface is active. Code for setting the GID table of a port has been added. Currently, each IB port has a single GID entry in its table and that GID entery equals the link local IPv6 address. Signed-off-by: Eli Cohen --- Changes from previous version: Bug fix - call flush_workqueue after unregistering notifiers. drivers/infiniband/hw/mlx4/main.c | 309 +++++++++++++++++++++++++++++++++---- drivers/net/mlx4/en_main.c | 15 ++- drivers/net/mlx4/en_port.c | 4 +- drivers/net/mlx4/en_port.h | 3 +- drivers/net/mlx4/intf.c | 20 +++ drivers/net/mlx4/main.c | 6 + drivers/net/mlx4/mlx4.h | 1 + include/linux/mlx4/cmd.h | 1 + include/linux/mlx4/driver.h | 16 ++- 9 files changed, 335 insertions(+), 40 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index ae3d759..1828aec 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -34,9 +34,12 @@ #include #include #include +#include +#include #include #include +#include #include #include @@ -57,6 +60,15 @@ static const char mlx4_ib_version[] = DRV_NAME ": Mellanox ConnectX InfiniBand driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; +struct update_gid_work { + struct work_struct work; + union ib_gid gids[128]; + int port; + struct mlx4_ib_dev *dev; +}; + +static struct workqueue_struct *wq; + static void init_query_mad(struct ib_smp *mad) { mad->base_version = 1; @@ -152,28 +164,19 @@ out: return err; } -static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port, - struct ib_port_attr *props) +static enum rdma_transport_type +mlx4_ib_port_get_transport(struct ib_device *device, u8 port_num) { - struct ib_smp *in_mad = NULL; - struct ib_smp *out_mad = NULL; - int err = -ENOMEM; - - in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); - out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); - if (!in_mad || !out_mad) - goto out; - - memset(props, 0, sizeof *props); - - init_query_mad(in_mad); - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + struct mlx4_dev *dev = to_mdev(device)->dev; - err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); - if (err) - goto out; + return dev->caps.port_mask & (1 << (port_num - 1)) ? + RDMA_TRANSPORT_IB : RDMA_TRANSPORT_RDMAOE; +} +static void ib_link_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props, + struct ib_smp *out_mad) +{ props->lid = be16_to_cpup((__be16 *) (out_mad->data + 16)); props->lmc = out_mad->data[34] & 0x7; props->sm_lid = be16_to_cpup((__be16 *) (out_mad->data + 18)); @@ -193,6 +196,67 @@ static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port, props->subnet_timeout = out_mad->data[51] & 0x1f; props->max_vl_num = out_mad->data[37] >> 4; props->init_type_reply = out_mad->data[41] >> 4; + props->transport = RDMA_TRANSPORT_IB; +} + +static void eth_link_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props, + struct ib_smp *out_mad) +{ + struct mlx4_ib_rdmaoe *rdmaoe = &to_mdev(ibdev)->rdmaoe; + struct net_device *ndev; + + props->port_cap_flags = IB_PORT_CM_SUP; + props->gid_tbl_len = to_mdev(ibdev)->dev->caps.gid_table_len[port]; + props->max_msg_sz = to_mdev(ibdev)->dev->caps.max_msg_sz; + props->pkey_tbl_len = 1; + props->bad_pkey_cntr = be16_to_cpup((__be16 *) (out_mad->data + 46)); + props->qkey_viol_cntr = be16_to_cpup((__be16 *) (out_mad->data + 48)); + props->active_width = 0; + props->active_speed = 0; + props->max_mtu = out_mad->data[41] & 0xf; + props->subnet_timeout = 0; + props->max_vl_num = out_mad->data[37] >> 4; + props->init_type_reply = 0; + props->transport = RDMA_TRANSPORT_RDMAOE; + spin_lock(&rdmaoe->lock); + ndev = rdmaoe->netdevs[port - 1]; + if (!ndev) + goto out; + + props->active_mtu = rdmaoe_get_mtu(ndev->mtu); + props->state = netif_running(ndev) && netif_oper_up(ndev) ? + IB_PORT_ACTIVE : IB_PORT_DOWN; + props->phys_state = props->state; +out: + spin_unlock(&rdmaoe->lock); +} + +static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(props, 0, sizeof *props); + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); + if (err) + goto out; + + mlx4_ib_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB ? + ib_link_query_port(ibdev, port, props, out_mad) : + eth_link_query_port(ibdev, port, props, out_mad); out: kfree(in_mad); @@ -201,8 +265,8 @@ out: return err; } -static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index, - union ib_gid *gid) +static int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid) { struct ib_smp *in_mad = NULL; struct ib_smp *out_mad = NULL; @@ -239,6 +303,25 @@ out: return err; } +static int rdmaoe_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid) +{ + struct mlx4_ib_dev *dev = to_mdev(ibdev); + + *gid = dev->rdmaoe.gid_table[port - 1][index]; + + return 0; +} + +static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid) +{ + if (rdma_port_get_transport(ibdev, port) == RDMA_TRANSPORT_IB) + return __mlx4_ib_query_gid(ibdev, port, index, gid); + else + return rdmaoe_query_gid(ibdev, port, index, gid); +} + static int mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) { @@ -287,6 +370,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols, { struct mlx4_cmd_mailbox *mailbox; int err; + u8 is_eth = dev->dev->caps.port_type[port] == MLX4_PORT_TYPE_ETH; mailbox = mlx4_alloc_cmd_mailbox(dev->dev); if (IS_ERR(mailbox)) @@ -302,7 +386,7 @@ static int mlx4_SET_PORT(struct mlx4_ib_dev *dev, u8 port, int reset_qkey_viols, ((__be32 *) mailbox->buf)[1] = cpu_to_be32(cap_mask); } - err = mlx4_cmd(dev->dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT, + err = mlx4_cmd(dev->dev, mailbox->dma, port, is_eth, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B); mlx4_free_cmd_mailbox(dev->dev, mailbox); @@ -538,19 +622,146 @@ static struct device_attribute *mlx4_class_attributes[] = { &dev_attr_board_id }; +static void mlx4_addrconf_ifid_eui48(u8 *eui, struct net_device *dev) +{ + memcpy(eui, dev->dev_addr, 3); + memcpy(eui + 5, dev->dev_addr + 3, 3); + eui[3] = 0xFF; + eui[4] = 0xFE; + eui[0] ^= 2; +} + +static void update_gids_task(struct work_struct *work) +{ + struct update_gid_work *gw = container_of(work, struct update_gid_work, work); + struct mlx4_cmd_mailbox *mailbox; + union ib_gid *gids; + int err; + struct mlx4_dev *dev = gw->dev->dev; + struct ib_event event; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) { + printk(KERN_WARNING "update gid table failed %ld\n", PTR_ERR(mailbox)); + return; + } + + gids = mailbox->buf; + memcpy(gids, gw->gids, sizeof gw->gids); + + err = mlx4_cmd(dev, mailbox->dma, MLX4_SET_PORT_GID_TABLE << 8 | gw->port, + 1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B); + if (err) + printk(KERN_WARNING "set port command failed\n"); + else { + memcpy(gw->dev->rdmaoe.gid_table[gw->port - 1], gw->gids, sizeof gw->gids); + event.device = &gw->dev->ib_dev; + event.element.port_num = gw->port; + event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } + + mlx4_free_cmd_mailbox(dev, mailbox); + kfree(gw); +} + +static int update_ipv6_gids(struct mlx4_ib_dev *dev, int port, int clear) +{ + struct net_device *ndev = dev->rdmaoe.netdevs[port - 1]; + struct update_gid_work *work; + + work = kzalloc(sizeof *work, GFP_ATOMIC); + if (!work) + return -ENOMEM; + + if (!clear) { + mlx4_addrconf_ifid_eui48(&work->gids[0].raw[8], ndev); + work->gids[0].global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL); + } + + INIT_WORK(&work->work, update_gids_task); + work->port = port; + work->dev = dev; + queue_work(wq, &work->work); + + return 0; +} + +static void handle_en_event(struct mlx4_ib_dev *dev, int port, unsigned long event) +{ + switch (event) { + case NETDEV_UP: + update_ipv6_gids(dev, port, 0); + break; + + case NETDEV_DOWN: + update_ipv6_gids(dev, port, 1); + } +} + +static void netdev_added(struct mlx4_ib_dev *dev, int port) +{ + update_ipv6_gids(dev, port, 0); +} + +static void netdev_removed(struct mlx4_ib_dev *dev, int port) +{ + update_ipv6_gids(dev, port, 1); +} + +static int mlx4_ib_netdev_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + struct net_device *dev = ptr; + struct mlx4_ib_dev *ibdev; + struct net_device *oldnd; + struct mlx4_ib_rdmaoe *rdmaoe; + int port; + + if (!net_eq(dev_net(dev), &init_net)) + return NOTIFY_DONE; + + ibdev = container_of(this, struct mlx4_ib_dev, rdmaoe.nb); + rdmaoe = &ibdev->rdmaoe; + + spin_lock(&rdmaoe->lock); + mlx4_foreach_ib_transport_port(port, ibdev->dev) { + oldnd = rdmaoe->netdevs[port - 1]; + rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(ibdev->dev, MLX4_PROT_EN, port); + if (oldnd != rdmaoe->netdevs[port - 1]) { + if (rdmaoe->netdevs[port - 1]) + netdev_added(ibdev, port); + else + netdev_removed(ibdev, port); + } + } + + if (dev == rdmaoe->netdevs[0]) + handle_en_event(ibdev, 1, event); + else if (dev == rdmaoe->netdevs[1]) + handle_en_event(ibdev, 2, event); + + spin_unlock(&rdmaoe->lock); + + return NOTIFY_DONE; +} + static void *mlx4_ib_add(struct mlx4_dev *dev) { static int mlx4_ib_version_printed; struct mlx4_ib_dev *ibdev; int num_ports = 0; int i; + int err; + int port; + struct mlx4_ib_rdmaoe *rdmaoe; if (!mlx4_ib_version_printed) { printk(KERN_INFO "%s", mlx4_ib_version); ++mlx4_ib_version_printed; } - mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) + mlx4_foreach_ib_transport_port(i, dev) num_ports++; /* No point in registering a device with no ports... */ @@ -563,6 +774,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) return NULL; } + rdmaoe = &ibdev->rdmaoe; + if (mlx4_pd_alloc(dev, &ibdev->priv_pdn)) goto err_dealloc; @@ -607,10 +820,12 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_GET_MAC); ibdev->ib_dev.query_device = mlx4_ib_query_device; ibdev->ib_dev.query_port = mlx4_ib_query_port; + ibdev->ib_dev.get_port_transport = mlx4_ib_port_get_transport; ibdev->ib_dev.query_gid = mlx4_ib_query_gid; ibdev->ib_dev.query_pkey = mlx4_ib_query_pkey; ibdev->ib_dev.modify_device = mlx4_ib_modify_device; @@ -654,15 +869,26 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) ibdev->ib_dev.map_phys_fmr = mlx4_ib_map_phys_fmr; ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; + ibdev->ib_dev.get_mac = mlx4_ib_get_mac; + + mlx4_foreach_ib_transport_port(port, dev) + rdmaoe->netdevs[port - 1] = mlx4_get_prot_dev(dev, MLX4_PROT_EN, port); + spin_lock_init(&rdmaoe->lock); + if (dev->caps.flags & MLX4_DEV_CAP_FLAG_RDMAOE && !rdmaoe->nb.notifier_call) { + rdmaoe->nb.notifier_call = mlx4_ib_netdev_event; + err = register_netdevice_notifier(&rdmaoe->nb); + if (err) + goto err_map; + } if (init_node_data(ibdev)) - goto err_map; + goto err_notif; spin_lock_init(&ibdev->sm_lock); mutex_init(&ibdev->cap_mask_mutex); if (ib_register_device(&ibdev->ib_dev)) - goto err_map; + goto err_notif; if (mlx4_ib_mad_init(ibdev)) goto err_reg; @@ -678,6 +904,10 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) err_reg: ib_unregister_device(&ibdev->ib_dev); +err_notif: + unregister_netdevice_notifier(&rdmaoe->nb); + flush_workqueue(wq); + err_map: iounmap(ibdev->uar_map); @@ -700,11 +930,16 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr) mlx4_ib_mad_cleanup(ibdev); ib_unregister_device(&ibdev->ib_dev); + if (ibdev->rdmaoe.nb.notifier_call) { + unregister_netdevice_notifier(&ibdev->rdmaoe.nb); + flush_workqueue(wq); + ibdev->rdmaoe.nb.notifier_call = NULL; + } + iounmap(ibdev->uar_map); - for (p = 1; p <= ibdev->num_ports; ++p) + mlx4_foreach_port(p, dev, MLX4_PORT_TYPE_IB) mlx4_CLOSE_PORT(dev, p); - iounmap(ibdev->uar_map); mlx4_uar_free(dev, &ibdev->priv_uar); mlx4_pd_free(dev, ibdev->priv_pdn); ib_dealloc_device(&ibdev->ib_dev); @@ -745,17 +980,31 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr, static struct mlx4_interface mlx4_ib_interface = { .add = mlx4_ib_add, .remove = mlx4_ib_remove, - .event = mlx4_ib_event + .event = mlx4_ib_event, + .protocol = MLX4_PROT_IB }; static int __init mlx4_ib_init(void) { - return mlx4_register_interface(&mlx4_ib_interface); + int err; + + wq = create_singlethread_workqueue("mlx4_ib"); + if (!wq) + return -ENOMEM; + + err = mlx4_register_interface(&mlx4_ib_interface); + if (err) { + destroy_workqueue(wq); + return err; + } + + return 0; } static void __exit mlx4_ib_cleanup(void) { mlx4_unregister_interface(&mlx4_ib_interface); + destroy_workqueue(wq); } module_init(mlx4_ib_init); diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c index 510633f..6f30eca 100644 --- a/drivers/net/mlx4/en_main.c +++ b/drivers/net/mlx4/en_main.c @@ -51,6 +51,13 @@ static const char mlx4_en_version[] = DRV_NAME ": Mellanox ConnectX HCA Ethernet driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; +static void *get_netdev(struct mlx4_dev *dev, void *ctx, u8 port) +{ + struct mlx4_en_dev *endev = ctx; + + return endev->pndev[port]; +} + static void mlx4_en_event(struct mlx4_dev *dev, void *endev_ptr, enum mlx4_dev_event event, int port) { @@ -229,9 +236,11 @@ err_free_res: } static struct mlx4_interface mlx4_en_interface = { - .add = mlx4_en_add, - .remove = mlx4_en_remove, - .event = mlx4_en_event, + .add = mlx4_en_add, + .remove = mlx4_en_remove, + .event = mlx4_en_event, + .get_prot_dev = get_netdev, + .protocol = MLX4_PROT_EN, }; static int __init mlx4_en_init(void) diff --git a/drivers/net/mlx4/en_port.c b/drivers/net/mlx4/en_port.c index a29abe8..a249887 100644 --- a/drivers/net/mlx4/en_port.c +++ b/drivers/net/mlx4/en_port.c @@ -127,8 +127,8 @@ int mlx4_SET_PORT_qpn_calc(struct mlx4_dev *dev, u8 port, u32 base_qpn, memset(context, 0, sizeof *context); context->base_qpn = cpu_to_be32(base_qpn); - context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_SHIFT | base_qpn); - context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_SHIFT | base_qpn); + context->promisc = cpu_to_be32(promisc << SET_PORT_PROMISC_EN_SHIFT | base_qpn); + context->mcast = cpu_to_be32(1 << SET_PORT_PROMISC_MODE_SHIFT | base_qpn); context->intra_no_vlan = 0; context->no_vlan = MLX4_NO_VLAN_IDX; context->intra_vlan_miss = 0; diff --git a/drivers/net/mlx4/en_port.h b/drivers/net/mlx4/en_port.h index e6477f1..9354891 100644 --- a/drivers/net/mlx4/en_port.h +++ b/drivers/net/mlx4/en_port.h @@ -36,7 +36,8 @@ #define SET_PORT_GEN_ALL_VALID 0x7 -#define SET_PORT_PROMISC_SHIFT 31 +#define SET_PORT_PROMISC_EN_SHIFT 31 +#define SET_PORT_PROMISC_MODE_SHIFT 30 enum { MLX4_CMD_SET_VLAN_FLTR = 0x47, diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c index 0e7eb10..d64530e 100644 --- a/drivers/net/mlx4/intf.c +++ b/drivers/net/mlx4/intf.c @@ -159,3 +159,23 @@ void mlx4_unregister_device(struct mlx4_dev *dev) mutex_unlock(&intf_mutex); } + +void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_device_context *dev_ctx; + unsigned long flags; + void *result = NULL; + + spin_lock_irqsave(&priv->ctx_lock, flags); + + list_for_each_entry(dev_ctx, &priv->ctx_list, list) + if (dev_ctx->intf->protocol == proto && dev_ctx->intf->get_prot_dev) { + result = dev_ctx->intf->get_prot_dev(dev, dev_ctx->context, port); + break; + } + + spin_unlock_irqrestore(&priv->ctx_lock, flags); + + return result; +} diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 30bea96..c72af51 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -100,6 +100,12 @@ module_param_named(use_prio, use_prio, bool, 0444); MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports " "(0/1, default 0)"); +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port) +{ + return mlx4_find_get_prot_dev(dev, proto, port); +} +EXPORT_SYMBOL(mlx4_get_prot_dev); + int mlx4_check_port_params(struct mlx4_dev *dev, enum mlx4_port_type *port_type) { diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 5bd79c2..db068c9 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -364,6 +364,7 @@ int mlx4_restart_one(struct pci_dev *pdev); int mlx4_register_device(struct mlx4_dev *dev); void mlx4_unregister_device(struct mlx4_dev *dev); void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type, int port); +void *mlx4_find_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port); struct mlx4_dev_cap; struct mlx4_init_hca_param; diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h index 0f82293..22bd8d3 100644 --- a/include/linux/mlx4/cmd.h +++ b/include/linux/mlx4/cmd.h @@ -140,6 +140,7 @@ enum { MLX4_SET_PORT_MAC_TABLE = 0x2, MLX4_SET_PORT_VLAN_TABLE = 0x3, MLX4_SET_PORT_PRIO_MAP = 0x4, + MLX4_SET_PORT_GID_TABLE = 0x5, }; struct mlx4_dev; diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h index 53c5fdb..0083256 100644 --- a/include/linux/mlx4/driver.h +++ b/include/linux/mlx4/driver.h @@ -44,15 +44,23 @@ enum mlx4_dev_event { MLX4_DEV_EVENT_PORT_REINIT, }; +enum mlx4_prot { + MLX4_PROT_IB, + MLX4_PROT_EN, +}; + struct mlx4_interface { - void * (*add) (struct mlx4_dev *dev); - void (*remove)(struct mlx4_dev *dev, void *context); - void (*event) (struct mlx4_dev *dev, void *context, - enum mlx4_dev_event event, int port); + void * (*add) (struct mlx4_dev *dev); + void (*remove)(struct mlx4_dev *dev, void *context); + void (*event) (struct mlx4_dev *dev, void *context, + enum mlx4_dev_event event, int port); + void * (*get_prot_dev) (struct mlx4_dev *dev, void *context, u8 port); + enum mlx4_prot protocol; struct list_head list; }; int mlx4_register_interface(struct mlx4_interface *intf); void mlx4_unregister_interface(struct mlx4_interface *intf); +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port); #endif /* MLX4_DRIVER_H */ -- 1.6.4 From eli at dev.mellanox.co.il Wed Aug 19 10:19:35 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 20:19:35 +0300 Subject: [ofa-general] [PATCHv5 0/10] RDMAoE support Message-ID: <20090819171935.GA14411@mtls03> RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using Ethernet frames, enabling the deployment of IB semantics on lossless Ethernet fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned Ethertype, a GRH, unmodified IB transport headers and payload. IB subnet management and SA services are not required for RDMAoE operation; Ethernet management practices are used instead. RDMAoE encodes IP addresses into its GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs, standard IP to MAC mappings apply. To support RDMAoE, a new transport protocol was added to the IB core. An RDMA device can have ports with different transports, which are identified by a port transport attribute. The RDMA Verbs API is syntactically unmodified. When referring to RDMAoE ports, Address handles are required to contain GIDs while LID fields are ignored. The Ethernet L2 information is subsequently obtained by the vendor-specific driver (both in kernel- and user-space) while modifying QPs to RTR and creating address handles. As there is no SA in RDMAoE, the CMA code is modified to fill the necessary path record attributes locally before sending CM packets. Similarly, the CMA provides to the user the required address handle attributes when processing SIDR requests and joining multicast groups. In this patch set, an RDMAoE port is currently assigned a single GID, encoding the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE code temporarily uses IPv6 link-local addresses as GIDs instead of the IP address provided by the user, thereby supporting any IP address. To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib drivers must be loaded, and the netdevice for the corresponding RDMAoE port must be running. Individual ports of a multi port HCA can be independently configured as Ethernet (with support for RDMAoE) or IB, as is already the case. We have successfully tested MPI, SDP, RDS, and native Verbs applications over RDMAoE. Following is a series of 10 patches based on version 2.6.30 of the Linux kernel. This new series reflects changes based on feedback from the community on the previous set of patches, and is tagged v5. Changes from v4: 1. Added rdma_is_transport_supported() and used it to simplify conditionals throughout the code. 2. ib_register_mad_agent()for QP0 is only called for IB ports 3. PATCH 5/10 changed from "Enable support for RDMAoE ports" to "Enable support only for IB ports". 4. MAD services from userspace currently not supported for RDMAoE ports. 5. Add kref to struct cma_multicast to aid in maintaining reference count on the object. This is to avoid freeing the object while the worker thread is still using it. 6. Return immediate error for invalid MTU when resolving an RDMAoE path 7. Don't fail resolve path if rate is 0 since this value stands for IB_RATE_PORT_CURRENT. 8. In cma_rdmaoe_join_multicast(), fail immediately if mtu is zero. 9. Add ucma_copy_rdmaoe_route()instead of modifying ucma_copy_ib_route(). 10. Bug fix: in PATCH 10/10, call flush_workqueue after unregistering netdev notifiers 11. Multicast no longer use the broadcast MAC. 12. No changes to patches 2, 7 and 8 from the v4 series. Signed-off-by: Eli Cohen --- b/drivers/infiniband/core/agent.c | 38 ++- b/drivers/infiniband/core/cm.c | 25 +- b/drivers/infiniband/core/cma.c | 54 ++-- b/drivers/infiniband/core/mad.c | 41 ++- b/drivers/infiniband/core/multicast.c | 4 b/drivers/infiniband/core/sa_query.c | 39 ++- b/drivers/infiniband/core/ucm.c | 8 b/drivers/infiniband/core/ucma.c | 2 b/drivers/infiniband/core/ud_header.c | 111 ++++++++++ b/drivers/infiniband/core/user_mad.c | 6 b/drivers/infiniband/core/uverbs.h | 1 b/drivers/infiniband/core/uverbs_cmd.c | 32 ++ b/drivers/infiniband/core/uverbs_main.c | 1 b/drivers/infiniband/core/verbs.c | 25 ++ b/drivers/infiniband/hw/mlx4/ah.c | 187 +++++++++++++--- b/drivers/infiniband/hw/mlx4/mad.c | 32 +- b/drivers/infiniband/hw/mlx4/main.c | 309 +++++++++++++++++++++++++--- b/drivers/infiniband/hw/mlx4/mlx4_ib.h | 19 + b/drivers/infiniband/hw/mlx4/qp.c | 172 ++++++++++----- b/drivers/infiniband/ulp/ipoib/ipoib_main.c | 12 - b/drivers/net/mlx4/en_main.c | 15 + b/drivers/net/mlx4/en_port.c | 4 b/drivers/net/mlx4/en_port.h | 3 b/drivers/net/mlx4/fw.c | 3 b/drivers/net/mlx4/intf.c | 20 + b/drivers/net/mlx4/main.c | 6 b/drivers/net/mlx4/mlx4.h | 1 b/include/linux/mlx4/cmd.h | 1 b/include/linux/mlx4/device.h | 31 ++ b/include/linux/mlx4/driver.h | 16 + b/include/linux/mlx4/qp.h | 8 b/include/rdma/ib_addr.h | 92 ++++++++ b/include/rdma/ib_pack.h | 26 ++ b/include/rdma/ib_user_verbs.h | 21 + b/include/rdma/ib_verbs.h | 11 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 b/net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 drivers/infiniband/core/cm.c | 5 drivers/infiniband/core/cma.c | 207 ++++++++++++++++++ drivers/infiniband/core/mad.c | 37 ++- drivers/infiniband/core/ucm.c | 12 - drivers/infiniband/core/ucma.c | 31 ++ drivers/infiniband/core/user_mad.c | 15 - drivers/infiniband/core/verbs.c | 10 include/rdma/ib_verbs.h | 15 + 45 files changed, 1440 insertions(+), 273 deletions(-) From eli at mellanox.co.il Wed Aug 19 10:20:07 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 20:20:07 +0300 Subject: [ofa-general] [PATCHv5 02/10] ib_core: Add RDMAoE transport protocol Message-ID: <20090819172007.GB14411@mtls03> Add a new transport protocol, RDMAoE, used for transporting Infiniband traffic over Ethernet fabrics. Signed-off-by: Eli Cohen --- include/rdma/ib_verbs.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 4cf42f3..d9146c4 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -69,7 +69,8 @@ enum rdma_node_type { enum rdma_transport_type { RDMA_TRANSPORT_IB, - RDMA_TRANSPORT_IWARP + RDMA_TRANSPORT_IWARP, + RDMA_TRANSPORT_RDMAOE }; enum ib_device_cap_flags { -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 10:20:24 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 20:20:24 +0300 Subject: [ofa-general] [PATCHv5 07/10] ib_core: RDMAoE UD packet packing support Message-ID: <20090819172024.GC14411@mtls03> Add support functions to aid in packing RDMAoE packets. Signed-off-by: Eli Cohen --- drivers/infiniband/core/ud_header.c | 111 +++++++++++++++++++++++++++++++++++ include/rdma/ib_pack.h | 26 ++++++++ 2 files changed, 137 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c index 8ec7876..d04b6f2 100644 --- a/drivers/infiniband/core/ud_header.c +++ b/drivers/infiniband/core/ud_header.c @@ -80,6 +80,29 @@ static const struct ib_field lrh_table[] = { .size_bits = 16 } }; +static const struct ib_field eth_table[] = { + { STRUCT_FIELD(eth, dmac_h), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { STRUCT_FIELD(eth, dmac_l), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { STRUCT_FIELD(eth, smac_h), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 }, + { STRUCT_FIELD(eth, smac_l), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 32 }, + { STRUCT_FIELD(eth, type), + .offset_words = 3, + .offset_bits = 0, + .size_bits = 16 } +}; + static const struct ib_field grh_table[] = { { STRUCT_FIELD(grh, ip_version), .offset_words = 0, @@ -241,6 +264,53 @@ void ib_ud_header_init(int payload_bytes, EXPORT_SYMBOL(ib_ud_header_init); /** + * ib_rdmaoe_ud_header_init - Initialize UD header structure + * @payload_bytes:Length of packet payload + * @grh_present:GRH flag (if non-zero, GRH will be included) + * @header:Structure to initialize + * + * ib_rdmaoe_ud_header_init() initializes the grh.ip_version, grh.payload_length, + * grh.next_header, bth.opcode, bth.pad_count and + * bth.transport_header_version fields of a &struct eth_ud_header given + * the payload length and whether a GRH will be included. + */ +void ib_rdmaoe_ud_header_init(int payload_bytes, + int grh_present, + struct eth_ud_header *header) +{ + int header_len; + + memset(header, 0, sizeof *header); + + header_len = + sizeof header->eth + + IB_BTH_BYTES + + IB_DETH_BYTES; + if (grh_present) + header_len += IB_GRH_BYTES; + + header->grh_present = grh_present; + if (grh_present) { + header->grh.ip_version = 6; + header->grh.payload_length = + cpu_to_be16((IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) & ~3); /* round up */ + header->grh.next_header = 0x1b; + } + + if (header->immediate_present) + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + else + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY; + header->bth.pad_count = (4 - payload_bytes) & 3; + header->bth.transport_header_version = 0; +} +EXPORT_SYMBOL(ib_rdmaoe_ud_header_init); + +/** * ib_ud_header_pack - Pack UD header struct into wire format * @header:UD header struct * @buf:Buffer to pack into @@ -281,6 +351,47 @@ int ib_ud_header_pack(struct ib_ud_header *header, EXPORT_SYMBOL(ib_ud_header_pack); /** + * rdmaoe_ud_header_pack - Pack UD header struct into eth wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() packs the UD header structure @header into wire + * format in the buffer @buf. + */ +int rdmaoe_ud_header_pack(struct eth_ud_header *header, + void *buf) +{ + int len = 0; + + ib_pack(eth_table, ARRAY_SIZE(eth_table), + &header->eth, buf); + len += IB_ETH_BYTES; + + if (header->grh_present) { + ib_pack(grh_table, ARRAY_SIZE(grh_table), + &header->grh, buf + len); + len += IB_GRH_BYTES; + } + + ib_pack(bth_table, ARRAY_SIZE(bth_table), + &header->bth, buf + len); + len += IB_BTH_BYTES; + + ib_pack(deth_table, ARRAY_SIZE(deth_table), + &header->deth, buf + len); + len += IB_DETH_BYTES; + + if (header->immediate_present) { + memcpy(buf + len, &header->immediate_data, + sizeof header->immediate_data); + len += sizeof header->immediate_data; + } + + return len; +} +EXPORT_SYMBOL(rdmaoe_ud_header_pack); + +/** * ib_ud_header_unpack - Unpack UD header struct from wire format * @header:UD header struct * @buf:Buffer to pack into diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h index d7fc45c..bf199eb 100644 --- a/include/rdma/ib_pack.h +++ b/include/rdma/ib_pack.h @@ -37,6 +37,7 @@ enum { IB_LRH_BYTES = 8, + IB_ETH_BYTES = 14, IB_GRH_BYTES = 40, IB_BTH_BYTES = 12, IB_DETH_BYTES = 8 @@ -210,6 +211,14 @@ struct ib_unpacked_deth { __be32 source_qpn; }; +struct ib_unpacked_eth { + u8 dmac_h[4]; + u8 dmac_l[2]; + u8 smac_h[2]; + u8 smac_l[4]; + __be16 type; +}; + struct ib_ud_header { struct ib_unpacked_lrh lrh; int grh_present; @@ -220,6 +229,16 @@ struct ib_ud_header { __be32 immediate_data; }; +struct eth_ud_header { + struct ib_unpacked_eth eth; + int grh_present; + struct ib_unpacked_grh grh; + struct ib_unpacked_bth bth; + struct ib_unpacked_deth deth; + int immediate_present; + __be32 immediate_data; +}; + void ib_pack(const struct ib_field *desc, int desc_len, void *structure, @@ -234,10 +253,17 @@ void ib_ud_header_init(int payload_bytes, int grh_present, struct ib_ud_header *header); +void ib_rdmaoe_ud_header_init(int payload_bytes, + int grh_present, + struct eth_ud_header *header); + int ib_ud_header_pack(struct ib_ud_header *header, void *buf); int ib_ud_header_unpack(void *buf, struct ib_ud_header *header); +int rdmaoe_ud_header_pack(struct eth_ud_header *header, + void *buf); + #endif /* IB_PACK_H */ -- 1.6.4 From eli at mellanox.co.il Wed Aug 19 10:20:39 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 19 Aug 2009 20:20:39 +0300 Subject: [ofa-general] [PATCHv5 08/10] ib_core: Add API to support RDMAoE from userspace Message-ID: <20090819172039.GD14411@mtls03> Add ib_uverbs_get_mac() to be used by ibv_create_ah() to retirieve the remore port's MAC address. Port transport is also returned by ibv_query_port(). ABI version is incremented from 6 to 7. Signed-off-by: Eli Cohen --- drivers/infiniband/core/uverbs.h | 1 + drivers/infiniband/core/uverbs_cmd.c | 32 ++++++++++++++++++++++++++++++++ drivers/infiniband/core/uverbs_main.c | 1 + drivers/infiniband/core/verbs.c | 10 ++++++++++ include/rdma/ib_user_verbs.h | 21 ++++++++++++++++++--- include/rdma/ib_verbs.h | 12 ++++++++++++ 6 files changed, 74 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index b3ea958..e69b04c 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -194,5 +194,6 @@ IB_UVERBS_DECLARE_CMD(create_srq); IB_UVERBS_DECLARE_CMD(modify_srq); IB_UVERBS_DECLARE_CMD(query_srq); IB_UVERBS_DECLARE_CMD(destroy_srq); +IB_UVERBS_DECLARE_CMD(get_mac); #endif /* UVERBS_H */ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 56feab6..012aadf 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -452,6 +452,7 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file, resp.active_width = attr.active_width; resp.active_speed = attr.active_speed; resp.phys_state = attr.phys_state; + resp.transport = attr.transport; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) @@ -1824,6 +1825,37 @@ err: return ret; } +ssize_t ib_uverbs_get_mac(struct ib_uverbs_file *file, const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_get_mac cmd; + struct ib_uverbs_get_mac_resp resp; + int ret; + struct ib_pd *pd; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) + return -EINVAL; + + ret = ib_get_mac(pd->device, cmd.port, cmd.gid, resp.mac); + put_pd_read(pd); + if (!ret) { + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + return -EFAULT; + + return in_len; + } + + return ret; +} + ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) { diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index eb36a81..2641845 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -108,6 +108,7 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file, [IB_USER_VERBS_CMD_MODIFY_SRQ] = ib_uverbs_modify_srq, [IB_USER_VERBS_CMD_QUERY_SRQ] = ib_uverbs_query_srq, [IB_USER_VERBS_CMD_DESTROY_SRQ] = ib_uverbs_destroy_srq, + [IB_USER_VERBS_CMD_GET_MAC] = ib_uverbs_get_mac, }; static struct vfsmount *uverbs_event_mnt; diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index d81e217..aaa8778 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -925,3 +925,13 @@ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) return qp->device->detach_mcast(qp, gid, lid); } EXPORT_SYMBOL(ib_detach_mcast); + +int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac) +{ + if (!device->get_mac) + return -ENOSYS; + + return device->get_mac(device, port, gid, mac); +} +EXPORT_SYMBOL(ib_get_mac); + diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h index a17f771..49eee8a 100644 --- a/include/rdma/ib_user_verbs.h +++ b/include/rdma/ib_user_verbs.h @@ -42,7 +42,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_VERBS_ABI_VERSION 6 +#define IB_USER_VERBS_ABI_VERSION 7 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -81,7 +81,8 @@ enum { IB_USER_VERBS_CMD_MODIFY_SRQ, IB_USER_VERBS_CMD_QUERY_SRQ, IB_USER_VERBS_CMD_DESTROY_SRQ, - IB_USER_VERBS_CMD_POST_SRQ_RECV + IB_USER_VERBS_CMD_POST_SRQ_RECV, + IB_USER_VERBS_CMD_GET_MAC }; /* @@ -205,7 +206,8 @@ struct ib_uverbs_query_port_resp { __u8 active_width; __u8 active_speed; __u8 phys_state; - __u8 reserved[3]; + __u8 transport; + __u8 reserved[2]; }; struct ib_uverbs_alloc_pd { @@ -621,6 +623,19 @@ struct ib_uverbs_destroy_ah { __u32 ah_handle; }; +struct ib_uverbs_get_mac { + __u64 response; + __u32 pd_handle; + __u8 port; + __u8 reserved[3]; + __u8 gid[16]; +}; + +struct ib_uverbs_get_mac_resp { + __u8 mac[6]; + __u16 reserved; +}; + struct ib_uverbs_attach_mcast { __u8 gid[16]; __u32 qp_handle; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index d9146c4..bf6e860 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1131,6 +1131,9 @@ struct ib_device { struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); + int (*get_mac)(struct ib_device *device, u8 port, + u8 *gid, u8 *mac); + struct ib_dma_mapping_ops *dma_ops; @@ -2037,4 +2040,13 @@ int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); */ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); +/** + * ib_get_mac - get the mac address for the specified gid + * @device: IB device used for traffic + * @port: port number used. + * @gid: gid to be resolved into mac + * @mac: mac of the port bearing this gid + */ +int ib_get_mac(struct ib_device *device, u8 port, u8 *gid, u8 *mac); + #endif /* IB_VERBS_H */ -- 1.6.4 From devel-ofed at morey-chaisemartin.com Wed Aug 19 11:49:39 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 19 Aug 2009 20:49:39 +0200 Subject: [ofa-general] [PATCH 0/3] Fat-Tree code cleanup Message-ID: <4A8C4943.6090408@morey-chaisemartin.com> Except the first one, patches in this series are trivial cleanups. The first one remove the reverse_hop parameter which is not used anymore thanks to the current_hop. Nicolas Morey-Chaisemartin (3): osm_ucast_ftree.c: Removed reverse_hop parameters from fabric_route_upgoing_by_going_down osm_ucast_ftree.c: Cleaned up many comments osm_ucast_ftree.c: Applied osm_indent opensm/opensm/osm_ucast_ftree.c | 169 +++++++++++++++++---------------------- 1 files changed, 72 insertions(+), 97 deletions(-) From devel-ofed at morey-chaisemartin.com Wed Aug 19 11:50:04 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 19 Aug 2009 20:50:04 +0200 Subject: [ofa-general] [PATCH 1/3] osm_ucast_ftree.c: Removed reverse_hop parameters from fabric_route_upgoing_by_going_down In-Reply-To: References: Message-ID: <4A8C495C.202@morey-chaisemartin.com> The parameter was only used to calculate the number of hops done up to this point but this is not required anymore as there is a curront_hops parameter now. Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: d432935addec203b1a0b3d8343c30016fc49f5c0.diff Type: text/x-patch Size: 1476 bytes Desc: not available URL: From nicolas at morey-chaisemartin.com Wed Aug 19 11:50:23 2009 From: nicolas at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 19 Aug 2009 20:50:23 +0200 Subject: [ofa-general] [PATCH 2/3] osm_ucast_ftree.c: Cleaned up many comments In-Reply-To: References: Message-ID: <4A8C496F.4020705@morey-chaisemartin.com> Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 33 +++++++++++---------------------- 1 files changed, 11 insertions(+), 22 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 9dd7818726e5c3062bf14e0b026dfc7e12f6e8fe.diff Type: text/x-patch Size: 5748 bytes Desc: not available URL: From devel-ofed at morey-chaisemartin.com Wed Aug 19 11:50:52 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 19 Aug 2009 20:50:52 +0200 Subject: [ofa-general] [PATCH 2/3] osm_ucast_ftree.c: Cleaned up many comments In-Reply-To: References: Message-ID: <4A8C498C.8000603@morey-chaisemartin.com> Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 33 +++++++++++---------------------- 1 files changed, 11 insertions(+), 22 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 9dd7818726e5c3062bf14e0b026dfc7e12f6e8fe.diff Type: text/x-patch Size: 5748 bytes Desc: not available URL: From devel-ofed at morey-chaisemartin.com Wed Aug 19 11:51:15 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 19 Aug 2009 20:51:15 +0200 Subject: [ofa-general] [PATCH 3/3] osm_ucast_ftree.c: Applied osm_indent In-Reply-To: References: Message-ID: <4A8C49A3.7020404@morey-chaisemartin.com> Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 135 ++++++++++++++++++--------------------- 1 files changed, 62 insertions(+), 73 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: c09f7c217e85b91938b57660f17fa212e6d27b1d.diff Type: text/x-patch Size: 11863 bytes Desc: not available URL: From kononov at ftml.net Wed Aug 19 15:01:06 2009 From: kononov at ftml.net (Roman Kononov) Date: Wed, 19 Aug 2009 17:01:06 -0500 Subject: [ofa-general] ibv_rc_pingpong hangs with forks Message-ID: <4A8C7622.8080505@ftml.net> Hello, The attached modification to ibv_rc_pingpong makes it never complete. Forking seems to do something bad. I noticed that forking right after ibv_post_recv() "cancels" the posted WR (as if it was never issued): sender keeps retrying. Is it expected behavior or a bug? This happens with ConnectX MT25418 and InfiniHost MT25208 adapters. Extracted from OFED-1.4.tgz: libibverbs-1.1.2 libmlx4-1.0 libmthca-1.0.5 Kernel: 2.6.30.5 x86_64 SMP Roman Kononov -------------- next part -------------- A non-text attachment was scrubbed... Name: fork-test.patch Type: text/x-patch Size: 904 bytes Desc: not available URL: From rdreier at cisco.com Wed Aug 19 15:14:29 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Aug 2009 15:14:29 -0700 Subject: [ofa-general] ibv_rc_pingpong hangs with forks In-Reply-To: <4A8C7622.8080505@ftml.net> (Roman Kononov's message of "Wed, 19 Aug 2009 17:01:06 -0500") References: <4A8C7622.8080505@ftml.net> Message-ID: > The attached modification to ibv_rc_pingpong makes it never complete. > > Forking seems to do something bad. I noticed that forking right after > ibv_post_recv() "cancels" the posted WR (as if it was never issued): > sender keeps retrying. > > Is it expected behavior or a bug? Yes, forking is expected to cause issues. Do things work better if you set the environment variable IBV_FORK_SAFE=1 ? - R. From kononov at ftml.net Wed Aug 19 16:06:21 2009 From: kononov at ftml.net (Roman Kononov) Date: Wed, 19 Aug 2009 18:06:21 -0500 Subject: [ofa-general] ibv_rc_pingpong hangs with forks In-Reply-To: References: <4A8C7622.8080505@ftml.net> Message-ID: <4A8C856D.7010705@ftml.net> On 2009-08-19 17:14, Roland Dreier wrote: > Yes, forking is expected to cause issues. Do things work better if you > set the environment variable IBV_FORK_SAFE=1 ? Yes, indeed. This thing sucked tonnes of my blood. Thanks. Roman From vlad at lists.openfabrics.org Thu Aug 20 03:00:41 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 20 Aug 2009 03:00:41 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090820-0200 daily build status Message-ID: <20090820100042.02B9F10201B5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090820-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From kliteyn at dev.mellanox.co.il Thu Aug 20 06:06:55 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 20 Aug 2009 16:06:55 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query to QoS level with pkey Message-ID: <4A8D4A6F.9050404@dev.mellanox.co.il> Hi Sasha, Fixing a bug in matching PR query to QoS levels when pkey specified - pkeys in QoS policy are held w/o the MSB. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_policy.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index febd7f6..9b72293 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -303,7 +303,7 @@ boolean_t osm_qos_level_has_pkey(IN const osm_qos_level_t * p_qos_level, return FALSE; return __is_num_in_range_arr(p_qos_level->pkey_range_arr, p_qos_level->pkey_range_len, - cl_ntoh16(pkey)); + cl_ntoh16(ib_pkey_get_base(pkey))); } /*************************************************** -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu Aug 20 06:07:16 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 20 Aug 2009 16:07:16 +0300 Subject: [ofa-general] [PATCH] opensm: fixing some data types in osm_req_get/set Message-ID: <4A8D4A84.3050605@dev.mellanox.co.il> Hi Sasha, Attribute ID and attribute modifier are used in osm_req_get/set in network order - fixing data types. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_sm.h | 8 ++++---- opensm/opensm/osm_req.c | 8 ++++---- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h index cc8321d..152ecd7 100644 --- a/opensm/include/opensm/osm_sm.h +++ b/opensm/include/opensm/osm_sm.h @@ -404,8 +404,8 @@ osm_sm_bind(IN osm_sm_t * const p_sm, IN const ib_net64_t port_guid); ib_api_status_t osm_req_get(IN osm_sm_t * sm, IN const osm_dr_path_t * const p_path, - IN const uint16_t attr_id, - IN const uint32_t attr_mod, + IN const ib_net16_t attr_id, + IN const ib_net32_t attr_mod, IN const cl_disp_msgid_t err_msg, IN const osm_madw_context_t * const p_context); /* @@ -452,8 +452,8 @@ osm_req_set(IN osm_sm_t * sm, IN const osm_dr_path_t * const p_path, IN const uint8_t * const p_payload, IN const size_t payload_size, - IN const uint16_t attr_id, - IN const uint32_t attr_mod, + IN const ib_net16_t attr_id, + IN const ib_net32_t attr_mod, IN const cl_disp_msgid_t err_msg, IN const osm_madw_context_t * const p_context); /* diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c index baeeed7..f79e3ab 100644 --- a/opensm/opensm/osm_req.c +++ b/opensm/opensm/osm_req.c @@ -62,8 +62,8 @@ **********************************************************************/ ib_api_status_t osm_req_get(IN osm_sm_t * sm, IN const osm_dr_path_t * const p_path, - IN const uint16_t attr_id, - IN const uint32_t attr_mod, + IN const ib_net16_t attr_id, + IN const ib_net32_t attr_mod, IN const cl_disp_msgid_t err_msg, IN const osm_madw_context_t * const p_context) { @@ -134,8 +134,8 @@ ib_api_status_t osm_req_set(IN osm_sm_t * sm, IN const osm_dr_path_t * const p_path, IN const uint8_t * const p_payload, IN const size_t payload_size, - IN const uint16_t attr_id, - IN const uint32_t attr_mod, + IN const ib_net16_t attr_id, + IN const ib_net32_t attr_mod, IN const cl_disp_msgid_t err_msg, IN const osm_madw_context_t * const p_context) { -- 1.5.1.4 From arlin.r.davis at intel.com Thu Aug 20 11:04:56 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 20 Aug 2009 11:04:56 -0700 Subject: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for mdep processor yield Message-ID: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com> Be thread scheduler friendly and release the current thread thus allowing other threads to run. Signed off by Stan Smith stan.smith at intel.com --- a/test/dapltest/mdep/linux/dapl_mdep_user.h Wed Aug 19 14:09:52 2009 +++ b/test/dapltest/mdep/linux/dapl_mdep_user.h Wed Aug 19 13:32:36 2009 @@ -200,4 +200,9 @@ #define DT_Mdep_flush() fflush(NULL) +/* + * Release processor to reschedule + */ +#define DT_Mdep_yield pthread_yield + #endif --- a/test/dapltest/mdep/solaris/dapl_mdep_user.h Thu Aug 20 08:49:11 2009 +++ b/test/dapltest/mdep/solaris/dapl_mdep_user.h Wed Aug 19 16:23:28 2009 @@ -74,6 +74,10 @@ #define DT_Mdep_printf printf #define DT_Mdep_flush() fflush(NULL) +/* + * Release processor to reschedule + */ +#define DT_Mdep_yield pthread_yield /* * Locks --- a/test/dapltest/mdep/windows/dapl_mdep_user.h Wed Aug 19 14:08:50 2009 +++ b/test/dapltest/mdep/windows/dapl_mdep_user.h Tue Aug 18 13:57:09 2009 @@ -80,6 +80,11 @@ #define DT_Mdep_flush() fflush(NULL) /* + * Release processor to reschedule + */ +#define DT_Mdep_yield() Sleep(0) + +/* * Locks */ --- a/test/dapltest/test/dapl_test_util.c Wed Aug 19 14:20:07 2009 +++ b/test/dapltest/test/dapl_test_util.c Wed Aug 19 14:20:00 2009 @@ -415,7 +415,7 @@ DAT_EVD_HANDLE evd_handle, DAT_DTO_COMPLETION_EVENT_DATA * dto_statusp) { - for (;;) { + for (;;DT_Mdep_yield()) { DAT_RETURN ret; DAT_EVENT event; From rdreier at cisco.com Thu Aug 20 11:24:41 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Aug 2009 11:24:41 -0700 Subject: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for mdep processor yield In-Reply-To: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com> (Arlin Davis's message of "Thu, 20 Aug 2009 11:04:56 -0700") References: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com> Message-ID: > +#define DT_Mdep_yield pthread_yield Be aware that on Linux I believe this turns into sched_yield(), which basically means "put me at the end of the thread list" ie wait for everyone else to get a turn ie possibly huge latency... From stan.smith at intel.com Thu Aug 20 11:43:13 2009 From: stan.smith at intel.com (Smith, Stan) Date: Thu, 20 Aug 2009 11:43:13 -0700 Subject: [ofw] Re: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for mdep processor yield In-Reply-To: References: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com> Message-ID: <3F6F638B8D880340AB536D29CD4C1E1912C553BDC2@orsmsx501.amr.corp.intel.com> Roland Dreier wrote: > > +#define DT_Mdep_yield pthread_yield > > Be aware that on Linux I believe this turns into sched_yield(), which > basically means "put me at the end of the thread list" ie wait for > everyone else to get a turn ie possibly huge latency... > _______________________________________________ > ofw mailing list > ofw at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw Is sleep(0) a preferred way to go? From rdreier at cisco.com Thu Aug 20 11:45:38 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Aug 2009 11:45:38 -0700 Subject: [ofw] Re: [ofa-general] [PATCH] uDAPL v2 - dapltest patches for mdep processor yield In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E1912C553BDC2@orsmsx501.amr.corp.intel.com> (Stan Smith's message of "Thu, 20 Aug 2009 11:43:13 -0700") References: <43692CDDC47B4BDDA59238385903E0D8@amr.corp.intel.com> <3F6F638B8D880340AB536D29CD4C1E1912C553BDC2@orsmsx501.amr.corp.intel.com> Message-ID: > Is sleep(0) a preferred way to go? I think the best solution is not coding spin-loops. Not sure what sleep(0) ends up turning into, but if you can tell the system "I'm waiting for this object, wake me up when it's available" then that should produce the best behavior. - R. From rdreier at cisco.com Thu Aug 20 13:46:53 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Aug 2009 13:46:53 -0700 Subject: [ofa-general][PATCH] mlx4_core: Avoid double icms free In-Reply-To: <4A8805D6.10803@mellanox.co.il> (Yevgeny Petrilin's message of "Sun, 16 Aug 2009 16:12:54 +0300") References: <4A8805D6.10803@mellanox.co.il> Message-ID: thanks, applied. From worleys at gmail.com Thu Aug 20 14:17:01 2009 From: worleys at gmail.com (Chris Worley) Date: Thu, 20 Aug 2009 15:17:01 -0600 Subject: [ofa-general] iSER issues in RHEL 5.3 Message-ID: Configuration: RHEL 5.3, 2.6.18-128 kernel, OFED 1.4.1 With one lun exported, running "fio" with 1MB blocks, I get sporadic errors on the initiator like: sd 4:0:0:1: timing out command, waited 360s sd 4:0:0:1: SCSI error: return code = 0x06000000 end_request: I/O error, dev sdb, sector 3921920 sd 4:0:0:1: timing out command, waited 360s sd 4:0:0:1: SCSI error: return code = 0x06000000 end_request: I/O error, dev sdb, sector 3962880 No problems reported on the target. With multiple LUNS exported, the tgtd hangs on the target as soon as I try to benchmark the LUNs (login was sucessful). iSCSI seems fine (but slow), iSER seems problematic. Any ideas? What are good known configurations (distros/kernel/OFED) for iSER? Thanks, Chris From rdreier at cisco.com Thu Aug 20 14:33:41 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Aug 2009 14:33:41 -0700 Subject: [ofa-general] Better way to get sufficient EQ context memory? Message-ID: Eli, it occurs to me that since we're doing more than one page for EQ context now, we might as well use the normal ICM table stuff that everything else uses. Seems the code becomes much simpler and I don't think there's any real overhead added... thoughts? (Christoph, I tested this with "possible_cpus=32" and it still works for me -- if you get a chance on your Dell systems that would be helpful too) commit 58cafda0c3010fc2cdb0fc9be3fbd6d09640dd6f Author: Roland Dreier Date: Thu Aug 20 14:26:21 2009 -0700 mlx4_core: Allocate and map sufficient ICM memory for EQ context The current implementation allocates a single host page for EQ context memory, which was OK when we only allocated a few EQs. However, since we now allocate an EQ for each CPU core, this patch removes the hard-coded limit (which we exceed with 4 KB pages and 128 byte EQ context entries with 32 CPUs) and uses the same ICM table code as all other context tables. Signed-off-by: Roland Dreier --- drivers/net/mlx4/eq.c | 42 ------------------------------------------ drivers/net/mlx4/main.c | 9 ++++++--- drivers/net/mlx4/mlx4.h | 7 +------ 3 files changed, 7 insertions(+), 51 deletions(-) diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index c11a052..d7974a6 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -525,48 +525,6 @@ static void mlx4_unmap_clr_int(struct mlx4_dev *dev) iounmap(priv->clr_base); } -int mlx4_map_eq_icm(struct mlx4_dev *dev, u64 icm_virt) -{ - struct mlx4_priv *priv = mlx4_priv(dev); - int ret; - - /* - * We assume that mapping one page is enough for the whole EQ - * context table. This is fine with all current HCAs, because - * we only use 32 EQs and each EQ uses 64 bytes of context - * memory, or 1 KB total. - */ - priv->eq_table.icm_virt = icm_virt; - priv->eq_table.icm_page = alloc_page(GFP_HIGHUSER); - if (!priv->eq_table.icm_page) - return -ENOMEM; - priv->eq_table.icm_dma = pci_map_page(dev->pdev, priv->eq_table.icm_page, 0, - PAGE_SIZE, PCI_DMA_BIDIRECTIONAL); - if (pci_dma_mapping_error(dev->pdev, priv->eq_table.icm_dma)) { - __free_page(priv->eq_table.icm_page); - return -ENOMEM; - } - - ret = mlx4_MAP_ICM_page(dev, priv->eq_table.icm_dma, icm_virt); - if (ret) { - pci_unmap_page(dev->pdev, priv->eq_table.icm_dma, PAGE_SIZE, - PCI_DMA_BIDIRECTIONAL); - __free_page(priv->eq_table.icm_page); - } - - return ret; -} - -void mlx4_unmap_eq_icm(struct mlx4_dev *dev) -{ - struct mlx4_priv *priv = mlx4_priv(dev); - - mlx4_UNMAP_ICM(dev, priv->eq_table.icm_virt, 1); - pci_unmap_page(dev->pdev, priv->eq_table.icm_dma, PAGE_SIZE, - PCI_DMA_BIDIRECTIONAL); - __free_page(priv->eq_table.icm_page); -} - int mlx4_alloc_eq_table(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 5c1afe0..528f89b 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -525,7 +525,10 @@ static int mlx4_init_icm(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap, goto err_unmap_aux; } - err = mlx4_map_eq_icm(dev, init_hca->eqc_base); + err = mlx4_init_icm_table(dev, &priv->eq_table.table, + init_hca->eqc_base, dev_cap->eqc_entry_sz, + dev->caps.num_eqs, dev->caps.num_eqs, + 0, 0); if (err) { mlx4_err(dev, "Failed to map EQ context memory, aborting.\n"); goto err_unmap_cmpt; @@ -668,7 +671,7 @@ err_unmap_mtt: mlx4_cleanup_icm_table(dev, &priv->mr_table.mtt_table); err_unmap_eq: - mlx4_unmap_eq_icm(dev); + mlx4_cleanup_icm_table(dev, &priv->eq_table.table); err_unmap_cmpt: mlx4_cleanup_icm_table(dev, &priv->eq_table.cmpt_table); @@ -698,11 +701,11 @@ static void mlx4_free_icms(struct mlx4_dev *dev) mlx4_cleanup_icm_table(dev, &priv->qp_table.qp_table); mlx4_cleanup_icm_table(dev, &priv->mr_table.dmpt_table); mlx4_cleanup_icm_table(dev, &priv->mr_table.mtt_table); + mlx4_cleanup_icm_table(dev, &priv->eq_table.table); mlx4_cleanup_icm_table(dev, &priv->eq_table.cmpt_table); mlx4_cleanup_icm_table(dev, &priv->cq_table.cmpt_table); mlx4_cleanup_icm_table(dev, &priv->srq_table.cmpt_table); mlx4_cleanup_icm_table(dev, &priv->qp_table.cmpt_table); - mlx4_unmap_eq_icm(dev); mlx4_UNMAP_ICM_AUX(dev); mlx4_free_icm(dev, priv->fw.aux_icm, 0); diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 5bd79c2..bc72d6e 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -205,9 +205,7 @@ struct mlx4_eq_table { void __iomem **uar_map; u32 clr_mask; struct mlx4_eq *eq; - u64 icm_virt; - struct page *icm_page; - dma_addr_t icm_dma; + struct mlx4_icm_table table; struct mlx4_icm_table cmpt_table; int have_irq; u8 inta_pin; @@ -373,9 +371,6 @@ u64 mlx4_make_profile(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap, struct mlx4_init_hca_param *init_hca); -int mlx4_map_eq_icm(struct mlx4_dev *dev, u64 icm_virt); -void mlx4_unmap_eq_icm(struct mlx4_dev *dev); - int mlx4_cmd_init(struct mlx4_dev *dev); void mlx4_cmd_cleanup(struct mlx4_dev *dev); void mlx4_cmd_event(struct mlx4_dev *dev, u16 token, u8 status, u64 out_param); From rdreier at cisco.com Thu Aug 20 15:00:14 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Aug 2009 15:00:14 -0700 Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing list to vger? Message-ID: Lately, I've had a few emails that I thought would have been of interest to both lkml and also to general at lists.openfabrics.org. I've held back on cross-posting them because I know that general@ is subscribers-only, and the bounce messages are quite annoying to replies coming from lkml. The general@ list is subscribers-only because the openfabrics.org sysadmin team is already overworked without trying to keep an open list spam free. (I say that with no intention to criticize the openfabrics.org admins -- they do a terrific job of keeping things running with the limited resources available; it's more a testament to how impressive the vger mailing list admins are) I've also noticed one or two messages about the possibility of moving another moderated list to vger. Certainly I prefer open lists that don't require subscriptions to post. So with that background, what would people think about creating an open vger list (say, linux-rdma at vger.kernel.org) to carry the discussion currently on general at lists.openfabrics.org? (The transition plan would probably be to keep the general@ list for a month or two, with frequent announcements of the new list, until archives etc. have caught up with the switch) Thanks, Roland From Jeffrey.C.Becker at nasa.gov Thu Aug 20 15:02:01 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Thu, 20 Aug 2009 15:02:01 -0700 Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: References: Message-ID: <4A8DC7D9.10005@nasa.gov> Roland Dreier wrote: > Lately, I've had a few emails that I thought would have been of interest > to both lkml and also to general at lists.openfabrics.org. I've held back > on cross-posting them because I know that general@ is subscribers-only, > and the bounce messages are quite annoying to replies coming from lkml. > > The general@ list is subscribers-only because the openfabrics.org > sysadmin team is already overworked without trying to keep an open list > spam free. (I say that with no intention to criticize the > openfabrics.org admins -- they do a terrific job of keeping things > running with the limited resources available; it's more a testament to > how impressive the vger mailing list admins are) > > I've also noticed one or two messages about the possibility of moving > another moderated list to vger. Certainly I prefer open lists that > don't require subscriptions to post. > > So with that background, what would people think about creating an open > vger list (say, linux-rdma at vger.kernel.org) to carry the discussion > currently on general at lists.openfabrics.org? (The transition plan would > probably be to keep the general@ list for a month or two, with frequent > announcements of the new list, until archives etc. have caught up with > the switch) > +1 -jeff > Thanks, > Roland > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Thu Aug 20 15:07:24 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 20 Aug 2009 18:07:24 -0400 Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailinglist to vger? In-Reply-To: <4A8DC7D9.10005@nasa.gov> References: <4A8DC7D9.10005@nasa.gov> Message-ID: <463C72F9-8011-4F6B-BE5B-619AA958B171@cisco.com> +1 On Aug 20, 2009, at 6:02 PM, Jeff Becker wrote: > Roland Dreier wrote: > > Lately, I've had a few emails that I thought would have been of > interest > > to both lkml and also to general at lists.openfabrics.org. I've held > back > > on cross-posting them because I know that general@ is subscribers- > only, > > and the bounce messages are quite annoying to replies coming from > lkml. > > > > The general@ list is subscribers-only because the openfabrics.org > > sysadmin team is already overworked without trying to keep an open > list > > spam free. (I say that with no intention to criticize the > > openfabrics.org admins -- they do a terrific job of keeping things > > running with the limited resources available; it's more a > testament to > > how impressive the vger mailing list admins are) > > > > I've also noticed one or two messages about the possibility of > moving > > another moderated list to vger. Certainly I prefer open lists that > > don't require subscriptions to post. > > > > So with that background, what would people think about creating an > open > > vger list (say, linux-rdma at vger.kernel.org) to carry the discussion > > currently on general at lists.openfabrics.org? (The transition plan > would > > probably be to keep the general@ list for a month or two, with > frequent > > announcements of the new list, until archives etc. have caught up > with > > the switch) > > > > +1 > > -jeff > > > Thanks, > > Roland > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- Jeff Squyres jsquyres at cisco.com From robert.j.woodruff at intel.com Thu Aug 20 15:41:39 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 20 Aug 2009 15:41:39 -0700 Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: References: Message-ID: <382A478CAD40FA4FB46605CF81FE39F43A52A9BF@orsmsx507.amr.corp.intel.com> Roland wrote, >Lately, I've had a few emails that I thought would have been of interest >to both lkml and also to general at lists.openfabrics.org. I've held back >on cross-posting them because I know that general@ is subscribers-only, >and the bounce messages are quite annoying to replies coming from lkml. >The general@ list is subscribers-only because the openfabrics.org >sysadmin team is already overworked without trying to keep an open list >spam free. (I say that with no intention to criticize the >openfabrics.org admins -- they do a terrific job of keeping things >running with the limited resources available; it's more a testament to >how impressive the vger mailing list admins are) >I've also noticed one or two messages about the possibility of moving >another moderated list to vger. Certainly I prefer open lists that >don't require subscriptions to post. >So with that background, what would people think about creating an open >vger list (say, linux-rdma at vger.kernel.org) to carry the discussion >currently on general at lists.openfabrics.org? (The transition plan would >probably be to keep the general@ list for a month or two, with frequent >announcements of the new list, until archives etc. have caught up with >the switch) The one question I would have would be do the kernel.org people want to see all of the traffic that we currently have on the open fabrics general list for all of the user-space components? I do not think it would be good if we had to have one list on vger for kernel work and another one for all the user-space work. Other than that, I do not really care where the general develop list is hosted. my 2 cents, woody From davem at davemloft.net Thu Aug 20 16:08:00 2009 From: davem at davemloft.net (David Miller) Date: Thu, 20 Aug 2009 16:08:00 -0700 (PDT) Subject: [ofa-general] Re: Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: References: Message-ID: <20090820.160800.50693597.davem@davemloft.net> From: Roland Dreier Date: Thu, 20 Aug 2009 15:00:14 -0700 > linux-rdma at vger.kernel.org It's there, ready and waiting, should you choose to use it :-) From jgunthorpe at obsidianresearch.com Thu Aug 20 17:04:31 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 20 Aug 2009 18:04:31 -0600 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format Message-ID: <20090821000431.GA5713@obsidianresearch.com> Check that the format of the multicast link address is correct before taking it from dev->mc_list to priv->multicast_list. This way we never try to send a bogus address to the SA, and prevents badness from erronous 'ip maddr addr add', broken bonding drivers, or whatever. Signed-off-by: Jason Gunthorpe --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 18 ++++++++++++++++++ 1 files changed, 18 insertions(+), 0 deletions(-) Same problem Moni was working on, but lets just address it directly. There is work to try and fix the bonding driver but no fixed version is in mainline yet. This is a cheap and simple work around that is worth having even once the driver is fixed. Despite this, I think it is still necessary to do something like Moni was trying - to prevent the MCG join queue from head of line blocking on a single bad SA response. This can happen even if everything is correct. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 425e311..973a24b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -758,6 +758,20 @@ void ipoib_mcast_dev_flush(struct net_device *dev) } } +static int check_mcast(const u8 *addr,unsigned int addrlen, + const u8 *broadcast) +{ + if (addrlen != 20) + return 0; + /* QPN, scope, reserved, sigature upper */ + if (memcmp(addr,broadcast,6) != 0) + return 0; + /* signature lower, pkey */ + if (memcmp(addr + 7,broadcast+7,3) != 0) + return 0; + return 1; +} + void ipoib_mcast_restart_task(struct work_struct *work) { struct ipoib_dev_priv *priv = @@ -791,6 +805,10 @@ void ipoib_mcast_restart_task(struct work_struct *work) for (mclist = dev->mc_list; mclist; mclist = mclist->next) { union ib_gid mgid; + if (!check_mcast(mclist->dmi_addr,mclist->dmi_addrlen, + dev->broadcast)) + continue; + memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid); mcast = __ipoib_mcast_find(dev, &mgid); -- 1.5.4.2 From jenos at ncsa.uiuc.edu Fri Aug 21 00:10:12 2009 From: jenos at ncsa.uiuc.edu (Jeremy Enos) Date: Fri, 21 Aug 2009 02:10:12 -0500 Subject: [ofa-general] Fedora 10 OFED support plans Message-ID: <4A8E4854.2060909@ncsa.uiuc.edu> Coming up on a year of Fedora 10 GA... Fedora 9 no longer maintained. No OFED support for FC10 yet creates a tough spot if trying to stay secure. Is there *any* version (1.5, etc) that will even build on FC10? thx- Jeremy From rdreier at cisco.com Fri Aug 21 02:10:04 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Aug 2009 02:10:04 -0700 Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F43A52A9BF@orsmsx507.amr.corp.intel.com> (Robert J. Woodruff's message of "Thu, 20 Aug 2009 15:41:39 -0700") References: <382A478CAD40FA4FB46605CF81FE39F43A52A9BF@orsmsx507.amr.corp.intel.com> Message-ID: > The one question I would have would be do the kernel.org people want to > see all of the traffic that we currently have on the open fabrics > general list for all of the user-space components? I don't believe there's any problem with that. vger already hosts quite a few lists that are userspace only (eg git) or span user and kernel (eg alsa and kvm). And in any case the total traffic (# of subscribers, # of messages) that the current general@ list generates is pretty minimal compared to what vger already handles, so I think there's no problem with having a linux-rdma at vger.kernel.org carry everything that general at lists.openfabrics.org does today. - R. From vlad at lists.openfabrics.org Fri Aug 21 03:03:36 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 21 Aug 2009 03:03:36 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090821-0200 daily build status Message-ID: <20090821100336.A7718E282A2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090821-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From jsquyres at cisco.com Fri Aug 21 07:44:47 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 21 Aug 2009 10:44:47 -0400 Subject: [ofa-general] Update on Roland's ummunotify kernel module Message-ID: <912100AC-5AEE-4793-9A83-F62424CB027A@cisco.com> Roland has pushed his new Linux "ummunotify" kernel upstream (i.e., it's in his -next git branch): http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=2fadea9acc19674c07ae7a9d90758f4b9b793940 It's not yet guaranteed that it will be accepted, but it looks good so far. With some bug fixes from Pasha/Mellanox and Lenny+Mike/Voltaire, I think it's ready for wide-spread testing (I mailed some OMPI community members yesterday asking for specific testing). I'm asking all MPI implementors to give the prototype code a whirl to shake out any remaining design bugs. Others are welcome to review the design concepts and code as well; the more eyes, the better. Bug fixes are easy later; design flaws are [much] better to be fixed now. I describe the issue that we're fixing in my new MPI-themed blog: http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking The HG where this OMPI work is being done is here: http://bitbucket.org/jsquyres/ummunot/ You need to have a very recent Linux kernel (2.6.31+) and Roland's umunotify module installed/running. Build the OMPI HG tree with the "--enable-mca-no-build=memory-ptmalloc2" to disable ptmalloc2 and enable the ummunotify stuff. This hack-ish "disable ptmalloc2" step is only necessary while we're shaking out the design issues. I'm halfway through merging the ummunot +ptmalloc2 code into a new opal/mca/memory component named "linux". This component will choose at run time whether to use ptmalloc2 or the ummunotify stuff (i.e., the --enable-mca-no-build... step won't be necessary when all is said and done; a default OMPI Linux build will do the Right Things). Thanks. -- Jeff Squyres jsquyres at cisco.com From arlin.r.davis at intel.com Fri Aug 21 12:56:10 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Fri, 21 Aug 2009 12:56:10 -0700 Subject: [ofa-general] [ANNOUNCE] uDAPL v2.0 - dapl-2.0.22 release Message-ID: New release for uDAPL 2.0 available on the OFA download page and in my git tree. New UCM provider uses it's own CM protocol on top of IB-UD queue pairs. During device open, this provider creates a UD queue pair and returns local address information via dat_ia_query. This 24 byte opaque address must be exchange out-of-band before connecting to a server via dat_ep_connect. This provider is targeted for MPI implementations that already exchange address information during boot/init phase and offers better scaling then existing scm and cma providers. md5sum: 9a8be3e780a6105fb4d9c85dacf556af dapl-2.0.22.tar.gz Summary of changes for last 2 releases: 2.0.22 v2 - ucm: new provider using DAPL based IB-UD cm mechanism for MPI v2 - dapltest: add processor yield when polling for completions 2.0.21 v2 - scm: Fix disconnect. QP's need to move to ERROR state in v2 - dtest: modify dtest.c to cleanup CNO wait code and consolidate into v2 - common: CNO events, once triggered will not be returned during the cno wait. v2 - scm, cma: CNO support broken in both CMA and SCM providers. v2 - common osd: include winsock2.h for IPv6 definitions. v2 - common osd: include w2tcpip.h for sockaddr_in6 definitions. v2 - DAPL introduced the concept of directly waiting on the CQ for v2 - dapltest: Implement a malloc() threshold for the completion reaping. v2 - scm: handle connected state when freeing CM objects v2 - scm, dtest: changes for winof gettimeofday and FD_SETSIZE settings. v2 - scm: set TCP_NODELAY sockopt on the server side for sends. v2 - windows: remove obsolete files in dapl/udapl source tree v2 - dtestcm: add UD type QP option to test v2 - scm: destroy QP called before disconnect v2 - cma: add support for rdma_cm TIME_WAIT event. v2 - scm: remove old udapl_scm code replaced by openib_scm. v2 - winof: fix build issues after consolidating cma, scm code base. v2 - cma: lock held when exiting as a result of a rdma_create_event_channel failurb v2 - windows: all dlist functions have been moved to the header file. v2 - dtestcm windows: add build infrastructure for new dtestcm test suite v2 - openib_common: reorganize provider code base to share common mem, cq, qp, dto v2 - scm: fixes and optimizations for connection scaling v2 - scm: double the default fd_set_size v2 - scm: EP reference in CR should be cleared during ep_destroy v2 - dtestx: fix conn establishment event checking v2 - dtestcm: new test to measure dapl connection rates. Vlad, please pull new v2 package into OFED 1.5 beta build and install the following: dapl-2.0.22-1 dapl-utils-2.0.22-1 dapl-devel-2.0.22-1 dapl-debuginfo-2.0.22-1 compat-dapl-1.2.14-1 compat-dapl-devel-1.2.14-1 See http://www.openfabrics.org/downloads/dapl/ more details. -arlin From rdreier at cisco.com Fri Aug 21 15:02:48 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Aug 2009 15:02:48 -0700 Subject: [ofa-general] Re: Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: <20090820.160800.50693597.davem@davemloft.net> (David Miller's message of "Thu, 20 Aug 2009 16:08:00 -0700 (PDT)") References: <20090820.160800.50693597.davem@davemloft.net> Message-ID: > > linux-rdma at vger.kernel.org > It's there, ready and waiting, should you choose to use it :-) Thanks! - R. From vlad at lists.openfabrics.org Sat Aug 22 03:01:21 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 22 Aug 2009 03:01:21 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090822-0200 daily build status Message-ID: <20090822100121.61CC6E28204@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090822-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ogerlitz at voltaire.com Sat Aug 22 23:02:29 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 23 Aug 2009 09:02:29 +0300 Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: References: Message-ID: <4A90DB75.9070303@voltaire.com> Roland Dreier wrote: > what would people think about creating an open vger list (say, linux-rdma at vger.kernel.org) to carry the discussion currently on general at lists.openfabrics.org? yes, lets do that Or. From ogerlitz at voltaire.com Sat Aug 22 23:04:52 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 23 Aug 2009 09:04:52 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query to QoS level with pkey In-Reply-To: <4A8D4A6F.9050404@dev.mellanox.co.il> References: <4A8D4A6F.9050404@dev.mellanox.co.il> Message-ID: <4A90DC04.3020906@voltaire.com> Yevgeny Kliteynik wrote: > Fixing a bug in matching PR query to QoS levels when pkey specified - pkeys in QoS > policy are held w/o the MSB. > Hi Yevgeny, so what's the impact of this bug in the field? does it create false positives or false negatives? Or. From bart.vanassche at gmail.com Sun Aug 23 01:10:17 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sun, 23 Aug 2009 10:10:17 +0200 Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: On Sat, Aug 15, 2009 at 12:15 AM, Roland Dreier wrote: > How about this approach?  Basically it just open-codes delayed work by > splitting the timer and the work struct, and switches to mod_timer() > instead of del_timer() + add_timer().  It passes very light testing here > (basically I started ipoib and nothing blew up). [ ... ] Hello Roland, I'm now using the SRP initiator from a kernel compiled from the http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git repository, and the lockdep complaints also occur on this system. The system even deadlocks during boot about one out of two times. Do you already know when you will have the time to commit the locking-inversion fixes to the infiniband.git repository ? Thanks, Bart. From tziporet at dev.mellanox.co.il Sun Aug 23 01:16:24 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 23 Aug 2009 11:16:24 +0300 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <4A8E4854.2060909@ncsa.uiuc.edu> References: <4A8E4854.2060909@ncsa.uiuc.edu> Message-ID: <4A90FAD8.6000701@mellanox.co.il> Jeremy Enos wrote: > Coming up on a year of Fedora 10 GA... Fedora 9 no longer maintained. > No OFED support for FC10 yet creates a tough spot if trying to stay > secure. Is there *any* version (1.5, etc) that will even build on FC10? > thx- > > Jeremy > > > I think OFED 1.5 might work on it but not sure. Which kernel version FC10 use? In general OFED 1.5 supports FC11 Tziporet From tziporet at dev.mellanox.co.il Sun Aug 23 01:21:14 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 23 Aug 2009 11:21:14 +0300 Subject: [ofa-general] Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: References: Message-ID: <4A90FBFA.3010504@mellanox.co.il> Roland Dreier wrote: > Lately, I've had a few emails that I thought would have been of interest > to both lkml and also to general at lists.openfabrics.org. I've held back > on cross-posting them because I know that general@ is subscribers-only, > and the bounce messages are quite annoying to replies coming from lkml. > > The general@ list is subscribers-only because the openfabrics.org > sysadmin team is already overworked without trying to keep an open list > spam free. (I say that with no intention to criticize the > openfabrics.org admins -- they do a terrific job of keeping things > running with the limited resources available; it's more a testament to > how impressive the vger mailing list admins are) > > I've also noticed one or two messages about the possibility of moving > another moderated list to vger. Certainly I prefer open lists that > don't require subscriptions to post. > > So with that background, what would people think about creating an open > vger list (say, linux-rdma at vger.kernel.org) to carry the discussion > currently on general at lists.openfabrics.org? (The transition plan would > probably be to keep the general@ list for a month or two, with frequent > announcements of the new list, until archives etc. have caught up with > the switch) > > > Very good initiative Tziporet From kliteyn at dev.mellanox.co.il Sun Aug 23 02:04:09 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 23 Aug 2009 12:04:09 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query to QoS level with pkey In-Reply-To: <4A90DC04.3020906@voltaire.com> References: <4A8D4A6F.9050404@dev.mellanox.co.il> <4A90DC04.3020906@voltaire.com> Message-ID: <4A910609.3040305@dev.mellanox.co.il> Or Gerlitz wrote: > Yevgeny Kliteynik wrote: >> Fixing a bug in matching PR query to QoS levels when pkey specified - >> pkeys in QoS >> policy are held w/o the MSB. >> > Hi Yevgeny, so what's the impact of this bug in the field? does it > create false positives or false negatives? False negatives. PR queries with PKeys (e.g. IPoIB interfaces) weren't matched to their rules. -- Yevgeny > Or. > > From vlad at lists.openfabrics.org Sun Aug 23 03:00:42 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 23 Aug 2009 03:00:42 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090823-0200 daily build status Message-ID: <20090823100042.8E634E2816D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: error: implicit declaration of function 'srp_attach_transport' /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2343: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.c:2358: error: implicit declaration of function 'srp_release_transport' make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp/ib_srp.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/srp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090823-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Sun Aug 23 04:08:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Aug 2009 14:08:05 +0300 Subject: [ofa-general] Re: [PATCH] opensm: fixing some data types in osm_req_get/set In-Reply-To: <4A8D4A84.3050605@dev.mellanox.co.il> References: <4A8D4A84.3050605@dev.mellanox.co.il> Message-ID: <20090823110804.GC9547@me> On 16:07 Thu 20 Aug , Yevgeny Kliteynik wrote: > Hi Sasha, > > Attribute ID and attribute modifier are used in osm_req_get/set > in network order - fixing data types. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 23 04:10:30 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Aug 2009 14:10:30 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_qos_policy.c: matching PR query to QoS level with pkey In-Reply-To: <4A8D4A6F.9050404@dev.mellanox.co.il> References: <4A8D4A6F.9050404@dev.mellanox.co.il> Message-ID: <20090823111030.GD9547@me> On 16:06 Thu 20 Aug , Yevgeny Kliteynik wrote: > > Hi Sasha, > > Fixing a bug in matching PR query to QoS > levels when pkey specified - pkeys in QoS > policy are held w/o the MSB. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 23 04:52:58 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Aug 2009 14:52:58 +0300 Subject: [ofa-general] Re: [PATCH 3/5 V2] libibnetdisc: make all fields of ibnd_fabric_t public In-Reply-To: <20090817140338.edd83fe0.weiny2@llnl.gov> References: <20090813204251.df6446c1.weiny2@llnl.gov> <20090816114127.GW25501@me> <20090817140338.edd83fe0.weiny2@llnl.gov> Message-ID: <20090823115258.GF9547@me> On 14:03 Mon 17 Aug , Ira Weiny wrote: > > You are right, good catch. I just copied it blindly with HTSZ which must be > there. > > git am is not working now on the last two patches [4/5 and 5/5] so I am > sending new versions of them so that they apply cleanly. > > V2 below, > Ira > > > From: Ira Weiny > Date: Thu, 13 Aug 2009 20:08:51 -0700 > Subject: [PATCH] libibnetdisc: make all fields of ibnd_fabric_t public > > In addition clean up the name of the chassis struct > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 23 05:06:09 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Aug 2009 15:06:09 +0300 Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object. In-Reply-To: <20090817083023.da17378b.weiny2@llnl.gov> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> <20090817083023.da17378b.weiny2@llnl.gov> Message-ID: <20090823120609.GG9547@me> Hi Ira, On 08:30 Mon 17 Aug , Ira Weiny wrote: > > The immediate benefit is coming with the multi-threaded implementation where > I plan on adding the following function.[*] Ok, but could we discuss first how will multithreading architecture be implemented with libibnetdisc: goals (in particular is it support for multithreaded apps or just multithreaded discovery function), interaction with caller application, etc.? One of the desired feature of this I could think would be to keep API simple for single threaded stuff. Sasha From sashak at voltaire.com Sun Aug 23 08:01:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Aug 2009 18:01:27 +0300 Subject: [ofa-general] Re: [PATCH v2] opensm/complib: account for nsec overflow in timeout values In-Reply-To: <20090813090602.226b2695.weiny2@llnl.gov> References: <20090806183716.c08bbea3.weiny2@llnl.gov> <20090813113620.GV25501@me> <20090813090602.226b2695.weiny2@llnl.gov> Message-ID: <20090823150127.GI9547@me> On 09:06 Thu 13 Aug , Ira Weiny wrote: > > > @@ -148,9 +148,11 @@ cl_event_wait_on(IN cl_event_t * const p_event, > > > } else { > > > /* Get the current time */ > > > if (gettimeofday(&curtime, NULL) == 0) { > > > - timeout.tv_sec = curtime.tv_sec + (wait_us / 1000000); > > > - timeout.tv_nsec = > > > - (curtime.tv_usec + (wait_us % 1000000)) * 1000; > > > + uint32_t n_sec = (curtime.tv_usec + (wait_us % 1000000)) > > > > Do you really need fixed size (uint32_t) variable here? > > Well I need at least int32_t. I chose unsigned because we are not trying to go back in time. I don't like leaving this as "int". As rare as it might be, a compiler could chose 16bits for an int and that is not big enough, right? Right, but tv_nsec field of struct timespec has 'long' type, not 'int'. Actually my question was more about using *fixed* size type - I think that we should avoid using fixed size types in cases when it is not really needed (such as protocol structures, etc.). So I'm changing this uint32_t to unsigned long which should be fine. > From: Ira Weiny > Date: Thu, 6 Aug 2009 18:31:46 -0700 > Subject: [PATCH] opensm/complib: account for nsec overflow in timeout values > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From worleys at gmail.com Sun Aug 23 08:15:13 2009 From: worleys at gmail.com (Chris Worley) Date: Sun, 23 Aug 2009 09:15:13 -0600 Subject: [ofa-general] WinOF 2.1 RC3 issues Message-ID: 1) SRP driver says "cannot start". 2) (trivial) x86_64 Installs into x86 program files. 3) Uninstall hangs interminably. From rdreier at cisco.com Sun Aug 23 08:21:46 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 23 Aug 2009 08:21:46 -0700 Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: (Bart Van Assche's message of "Sun, 23 Aug 2009 10:10:17 +0200") References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: > I'm now using the SRP initiator from a kernel compiled from the > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git > repository, and the lockdep complaints also occur on this system. The > system even deadlocks during boot about one out of two times. Do you > already know when you will have the time to commit the > locking-inversion fixes to the infiniband.git repository ? Everything I know of is already in my tree now. And I just checked http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=shortlog;h=for-next and I see both "IB/mad: Fix possible lock-lock-timer deadlock" and "IPoIB: Drop priv->lock before calling ipoib_send()" there. Those are all the lockdep-related things I know of. I have a hard time imagining that either of those issues could cause a deadlock on half of boots either. Are you sure the deadlock you see is related to one of those fixes? - R. From bugzilla-daemon at bugzilla.kernel.org Sun Aug 23 09:42:06 2009 From: bugzilla-daemon at bugzilla.kernel.org (bugzilla-daemon at bugzilla.kernel.org) Date: Sun, 23 Aug 2009 16:42:06 GMT Subject: [ofa-general] [Bug 14042] New: mlx4: device driver tries to sync DMA memory it has not allocated Message-ID: http://bugzilla.kernel.org/show_bug.cgi?id=14042 Summary: mlx4: device driver tries to sync DMA memory it has not allocated Product: Drivers Version: 2.5 Kernel Version: 2.6.30.4 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Infiniband/RDMA AssignedTo: drivers_infiniband-rdma at kernel-bugs.osdl.org ReportedBy: bart.vanassche at gmail.com Regression: No The following message was generated while booting a system with 2.6.30.4 kernel compiled with CONFIG_DMA_API_DEBUG=y and before any out-of-tree kernel modules were loaded: ------------[ cut here ]------------ WARNING: at lib/dma-debug.c:635 check_sync+0x47c/0x4b0() Hardware name: P5Q DELUXE mlx4_core 0000:01:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x0000000139482000] [size=4096 bytes] Modules linked in: snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd rtc_cmos soundcore i2c_i801 rtc_core hid_belkin mlx4_core( +) rtc_lib sr_mod sg snd_page_alloc pcspkr button intel_agp i2c_core joydev serio_raw cdrom usbhid hid raid456 raid6_pq async_xor async_memcpy async_tx xor raid0 sd_mod crc_t10dif ehci_hcd uhci_hcd usbcore edd raid1 ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix pata_marvell ahci libata scsi_mod thermal processor thermal_sys hwmon Pid: 1325, comm: work_for_cpu Not tainted 2.6.30.4-scst-debug #6 Call Trace: [] ? check_sync+0x47c/0x4b0 [] warn_slowpath_common+0x78/0xd0 [] warn_slowpath_fmt+0x3c/0x40 [] ? _spin_lock_irqsave+0x49/0x60 [] ? check_sync+0xab/0x4b0 [] check_sync+0x47c/0x4b0 [] ? mark_held_locks+0x6c/0x90 [] debug_dma_sync_single_for_cpu+0x1d/0x20 [] mlx4_write_mtt+0x159/0x1e0 [mlx4_core] [] mlx4_create_eq+0x222/0x650 [mlx4_core] [] ? trace_hardirqs_on+0xd/0x10 [] mlx4_init_eq_table+0x1c5/0x4a0 [mlx4_core] [] mlx4_setup_hca+0x98/0x550 [mlx4_core] [] ? __mlx4_init_one+0x8d1/0x920 [mlx4_core] [] __mlx4_init_one+0x371/0x920 [mlx4_core] [] mlx4_init_one+0x22/0x44 [mlx4_core] [] ? do_work_for_cpu+0x0/0x30 [] local_pci_probe+0x12/0x20 [] do_work_for_cpu+0x13/0x30 [] kthread+0x56/0x90 [] child_rip+0xa/0x20 [] ? restore_args+0x0/0x30 [] ? kthread+0x0/0x90 [] ? child_rip+0x0/0x20 ---[ end trace 4480af29bc755c6a ]--- -- Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bart.vanassche at gmail.com Sun Aug 23 11:53:44 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sun, 23 Aug 2009 20:53:44 +0200 Subject: [ofa-general] Re: [PATCH/RFC] IB/mad: Fix possible deadlock (cancel_delayed_work inside spinlock) In-Reply-To: References: <2604ADDDE9F4467BA962BBA8B60F25AA@amr.corp.intel.com> Message-ID: On Sun, Aug 23, 2009 at 5:21 PM, Roland Dreier wrote: > >  > I'm now using the SRP initiator from a kernel compiled from the >  > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git >  > repository, and the lockdep complaints also occur on this system. The >  > system even deadlocks during boot about one out of two times. Do you >  > already know when you will have the time to commit the >  > locking-inversion fixes to the infiniband.git repository ? > > Everything I know of is already in my tree now.  And I just checked > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=shortlog;h=for-next > and I see both "IB/mad: Fix possible lock-lock-timer deadlock" and > "IPoIB: Drop priv->lock before calling ipoib_send()" there.  Those are > all the lockdep-related things I know of. > > I have a hard time imagining that either of those issues could cause a > deadlock on half of boots either.  Are you sure the deadlock you see is > related to one of those fixes? After having switched from the master branch to the for-next branch I do now also see the patches mentioned above. And apparently the phenomenon I observed during boot was not a deadlock but some other strange phenomenon. See also http://bugzilla.kernel.org/show_bug.cgi?id=14043 for the details. Bart. From eli at dev.mellanox.co.il Mon Aug 24 01:22:18 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 24 Aug 2009 11:22:18 +0300 Subject: [ofa-general] Better way to get sufficient EQ context memory?y In-Reply-To: References: Message-ID: <20090824082218.GA16493@mtls03> On Thu, Aug 20, 2009 at 02:33:41PM -0700, Roland Dreier wrote: > Eli, it occurs to me that since we're doing more than one page for EQ > context now, we might as well use the normal ICM table stuff that > everything else uses. Seems the code becomes much simpler and I don't > think there's any real overhead added... thoughts? Yes it look cleaner, and it works well on my 4 core system. Let's wait for Christoph's approval on his system. > > (Christoph, I tested this with "possible_cpus=32" and it still works for > me -- if you get a chance on your Dell systems that would be helpful too) > From monis at Voltaire.COM Mon Aug 24 01:23:01 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 24 Aug 2009 11:23:01 +0300 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format In-Reply-To: <20090821000431.GA5713@obsidianresearch.com> References: <20090821000431.GA5713@obsidianresearch.com> Message-ID: <4A924DE5.30505@Voltaire.COM> Jason Gunthorpe wrote: > Check that the format of the multicast link address is correct before > taking it from dev->mc_list to priv->multicast_list. This way we never > try to send a bogus address to the SA, and prevents badness from > erronous 'ip maddr addr add', broken bonding drivers, or whatever. > > Signed-off-by: Jason Gunthorpe > --- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 18 ++++++++++++++++++ > 1 files changed, 18 insertions(+), 0 deletions(-) > > Same problem Moni was working on, but lets just address it directly. > > There is work to try and fix the bonding driver but no fixed version > is in mainline yet. This is a cheap and simple work around that is > worth having even once the driver is fixed. > The fix is available from at least 2.6.31-rc2. However, I still need to check your claim that it doesn't work for you. > Despite this, I think it is still necessary to do something like Moni > was trying - to prevent the MCG join queue from head of line blocking > on a single bad SA response. This can happen even if everything is > correct. > I'll resend my patch since it has been a long time from the last time I sent it. From monis at Voltaire.COM Mon Aug 24 01:36:44 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 24 Aug 2009 11:36:44 +0300 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format In-Reply-To: <20090821000431.GA5713@obsidianresearch.com> References: <20090821000431.GA5713@obsidianresearch.com> Message-ID: <4A92511C.90402@Voltaire.COM> Jason Gunthorpe wrote: > Check that the format of the multicast link address is correct before > taking it from dev->mc_list to priv->multicast_list. This way we never > try to send a bogus address to the SA, and prevents badness from > erronous 'ip maddr addr add', broken bonding drivers, or whatever. > > Signed-off-by: Jason Gunthorpe > --- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 18 ++++++++++++++++++ > 1 files changed, 18 insertions(+), 0 deletions(-) > > Same problem Moni was working on, but lets just address it directly. > > There is work to try and fix the bonding driver but no fixed version > is in mainline yet. This is a cheap and simple work around that is > worth having even once the driver is fixed. > > Despite this, I think it is still necessary to do something like Moni > was trying - to prevent the MCG join queue from head of line blocking > on a single bad SA response. This can happen even if everything is > correct. > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index 425e311..973a24b 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -758,6 +758,20 @@ void ipoib_mcast_dev_flush(struct net_device *dev) > } > } > > +static int check_mcast(const u8 *addr,unsigned int addrlen, > + const u8 *broadcast) > +{ > + if (addrlen != 20) > + return 0; > + /* QPN, scope, reserved, sigature upper */ > + if (memcmp(addr,broadcast,6) != 0) > + return 0; > + /* signature lower, pkey */ > + if (memcmp(addr + 7,broadcast+7,3) != 0) > + return 0; > + return 1; > +} > + > void ipoib_mcast_restart_task(struct work_struct *work) > { > struct ipoib_dev_priv *priv = > @@ -791,6 +805,10 @@ void ipoib_mcast_restart_task(struct work_struct *work) > for (mclist = dev->mc_list; mclist; mclist = mclist->next) { > union ib_gid mgid; > > + if (!check_mcast(mclist->dmi_addr,mclist->dmi_addrlen, > + dev->broadcast)) > + continue; > + > memcpy(mgid.raw, mclist->dmi_addr + 4, sizeof mgid); > > mcast = __ipoib_mcast_find(dev, &mgid); Why not check validity with something that looks like the reverse operation of the kernel function that maps ip -> link mcast addresses? for example this is the function for IPv4 static inline void ip_ib_mc_map(__be32 naddr, const unsigned char *broadcast, char *buf) { __u32 addr; unsigned char scope = broadcast[5] & 0xF; buf[0] = 0; /* Reserved */ buf[1] = 0xff; /* Multicast QPN */ buf[2] = 0xff; buf[3] = 0xff; addr = ntohl(naddr); buf[4] = 0xff; buf[5] = 0x10 | scope; /* scope from broadcast address */ buf[6] = 0x40; /* IPv4 signature */ buf[7] = 0x1b; buf[8] = broadcast[8]; /* P_Key */ buf[9] = broadcast[9]; buf[10] = 0; buf[11] = 0; buf[12] = 0; buf[13] = 0; buf[14] = 0; buf[15] = 0; buf[19] = addr & 0xff; addr >>= 8; buf[18] = addr & 0xff; addr >>= 8; buf[17] = addr & 0xff; addr >>= 8; buf[16] = addr & 0x0f; } From vlad at lists.openfabrics.org Mon Aug 24 02:44:41 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 24 Aug 2009 02:44:41 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090824-0200 daily build status Message-ID: <20090824094441.95F30E61E7C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Failed: Build failed on i686 with linux-2.6.19 Build failed on i686 with linux-2.6.18 Build failed on i686 with linux-2.6.21.1 Build failed on i686 with linux-2.6.22 Build failed on i686 with linux-2.6.24 Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18-128.el5 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-128.el5_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-128.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18-93.el5 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18-93.el5_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-93.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.21.1 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.21.1' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.20 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.20_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.20' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.22 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.22' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1313: warning: pointer targets in passing argument 2 of 'memcpy_toiovec' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_bz_setup': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1430: warning: pointer targets in assignment differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1313: warning: pointer targets in passing argument 2 of 'memcpy_toiovec' differ in signedness /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c: In function 'sdp_bz_setup': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.c:1430: warning: pointer targets in assignment differ in signedness make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.21.1 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.21.1_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.21.1' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.22 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.22_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.22' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.24_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.23 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.23_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.23' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.18_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h: In function 'sdp_alloc_skb': /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: error: implicit declaration of function 'sdp_stream_alloc_skb' /home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp.h:564: warning: assignment makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp/sdp_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband/ulp/sdp] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090824-0200_linux-2.6.19_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From monis at Voltaire.COM Mon Aug 24 04:20:41 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 24 Aug 2009 14:20:41 +0300 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format In-Reply-To: <4A92511C.90402@Voltaire.COM> References: <20090821000431.GA5713@obsidianresearch.com> <4A92511C.90402@Voltaire.COM> Message-ID: <4A927789.2060300@Voltaire.COM> > > Why not check validity with something that looks like the reverse operation > of the kernel function that maps ip -> link mcast addresses? Sorry. Please ignore this comment. Jason's patch does exactly that. From eli at dev.mellanox.co.il Mon Aug 24 05:13:07 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 24 Aug 2009 15:13:07 +0300 Subject: [ofa-general] [PATCHv5 0/10] RDMAoE support In-Reply-To: <20090819171935.GA14411@mtls03> References: <20090819171935.GA14411@mtls03> Message-ID: <20090824121307.GA3919@mtls03> Roland, what about this series of patches? Would you like me to re-create them over your xrc branch or would you rather take them before xrc? On Wed, Aug 19, 2009 at 08:19:35PM +0300, Eli Cohen wrote: > RDMA over Ethernet (RDMAoE) allows running the IB transport protocol using > Ethernet frames, enabling the deployment of IB semantics on lossless Ethernet > fabrics. RDMAoE packets are standard Ethernet frames with an IEEE assigned > Ethertype, a GRH, unmodified IB transport headers and payload. IB subnet > management and SA services are not required for RDMAoE operation; Ethernet > management practices are used instead. RDMAoE encodes IP addresses into its > GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs, > standard IP to MAC mappings apply. > > To support RDMAoE, a new transport protocol was added to the IB core. An RDMA > device can have ports with different transports, which are identified by a port > transport attribute. The RDMA Verbs API is syntactically unmodified. When > referring to RDMAoE ports, Address handles are required to contain GIDs while > LID fields are ignored. The Ethernet L2 information is subsequently obtained by > the vendor-specific driver (both in kernel- and user-space) while modifying QPs > to RTR and creating address handles. As there is no SA in RDMAoE, the CMA code > is modified to fill the necessary path record attributes locally before sending > CM packets. Similarly, the CMA provides to the user the required address handle > attributes when processing SIDR requests and joining multicast groups. > > In this patch set, an RDMAoE port is currently assigned a single GID, encoding > the IPv6 link-local address of the corresponding netdev; the CMA RDMAoE code > temporarily uses IPv6 link-local addresses as GIDs instead of the IP address > provided by the user, thereby supporting any IP address. > > To enable RDMAoE with the mlx4 driver stack, both the mlx4_en and mlx4_ib > drivers must be loaded, and the netdevice for the corresponding RDMAoE port > must be running. Individual ports of a multi port HCA can be independently > configured as Ethernet (with support for RDMAoE) or IB, as is already the case. > We have successfully tested MPI, SDP, RDS, and native Verbs applications over > RDMAoE. > > Following is a series of 10 patches based on version 2.6.30 of the Linux > kernel. This new series reflects changes based on feedback from the community > on the previous set of patches, and is tagged v5. > > Changes from v4: > 1. Added rdma_is_transport_supported() and used it to simplify conditionals > throughout the code. > 2. ib_register_mad_agent()for QP0 is only called for IB ports 3. PATCH 5/10 > changed from "Enable support for RDMAoE ports" to "Enable support only for IB > ports". > 4. MAD services from userspace currently not supported for RDMAoE ports. > 5. Add kref to struct cma_multicast to aid in maintaining reference count on > the object. This is to avoid freeing the object while the worker thread is > still using it. > 6. Return immediate error for invalid MTU when resolving an RDMAoE path 7. > Don't fail resolve path if rate is 0 since this value stands for > IB_RATE_PORT_CURRENT. > 8. In cma_rdmaoe_join_multicast(), fail immediately if mtu is zero. > 9. Add ucma_copy_rdmaoe_route()instead of modifying ucma_copy_ib_route(). > 10. Bug fix: in PATCH 10/10, call flush_workqueue after unregistering netdev > notifiers > 11. Multicast no longer use the broadcast MAC. > 12. No changes to patches 2, 7 and 8 from the v4 series. > > Signed-off-by: Eli Cohen > --- > > b/drivers/infiniband/core/agent.c | 38 ++- > b/drivers/infiniband/core/cm.c | 25 +- > b/drivers/infiniband/core/cma.c | 54 ++-- > b/drivers/infiniband/core/mad.c | 41 ++- > b/drivers/infiniband/core/multicast.c | 4 > b/drivers/infiniband/core/sa_query.c | 39 ++- > b/drivers/infiniband/core/ucm.c | 8 > b/drivers/infiniband/core/ucma.c | 2 > b/drivers/infiniband/core/ud_header.c | 111 ++++++++++ > b/drivers/infiniband/core/user_mad.c | 6 > b/drivers/infiniband/core/uverbs.h | 1 > b/drivers/infiniband/core/uverbs_cmd.c | 32 ++ > b/drivers/infiniband/core/uverbs_main.c | 1 > b/drivers/infiniband/core/verbs.c | 25 ++ > b/drivers/infiniband/hw/mlx4/ah.c | 187 +++++++++++++--- > b/drivers/infiniband/hw/mlx4/mad.c | 32 +- > b/drivers/infiniband/hw/mlx4/main.c | 309 +++++++++++++++++++++++++--- > b/drivers/infiniband/hw/mlx4/mlx4_ib.h | 19 + > b/drivers/infiniband/hw/mlx4/qp.c | 172 ++++++++++----- > b/drivers/infiniband/ulp/ipoib/ipoib_main.c | 12 - > b/drivers/net/mlx4/en_main.c | 15 + > b/drivers/net/mlx4/en_port.c | 4 > b/drivers/net/mlx4/en_port.h | 3 > b/drivers/net/mlx4/fw.c | 3 > b/drivers/net/mlx4/intf.c | 20 + > b/drivers/net/mlx4/main.c | 6 > b/drivers/net/mlx4/mlx4.h | 1 > b/include/linux/mlx4/cmd.h | 1 > b/include/linux/mlx4/device.h | 31 ++ > b/include/linux/mlx4/driver.h | 16 + > b/include/linux/mlx4/qp.h | 8 > b/include/rdma/ib_addr.h | 92 ++++++++ > b/include/rdma/ib_pack.h | 26 ++ > b/include/rdma/ib_user_verbs.h | 21 + > b/include/rdma/ib_verbs.h | 11 > b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 > b/net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 > drivers/infiniband/core/cm.c | 5 > drivers/infiniband/core/cma.c | 207 ++++++++++++++++++ > drivers/infiniband/core/mad.c | 37 ++- > drivers/infiniband/core/ucm.c | 12 - > drivers/infiniband/core/ucma.c | 31 ++ > drivers/infiniband/core/user_mad.c | 15 - > drivers/infiniband/core/verbs.c | 10 > include/rdma/ib_verbs.h | 15 + > 45 files changed, 1440 insertions(+), 273 deletions(-) > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From monis at Voltaire.COM Mon Aug 24 06:51:03 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 24 Aug 2009 16:51:03 +0300 Subject: [ofa-general] [PATCHv2 RESEND] IB/IPoIB: Don't let a bad muticast address in the join list stop subsequent joins Message-ID: <4A929AC7.4060402@Voltaire.COM> Hi Roland http://lists.openfabrics.org/pipermail/general/2009-July/060496.html The discussion in the link above didn't end with a decision. You were asking about a way to inject illegal mcast addresses from userspace to ib_ipoib and Jason pointed about such (described below). Could you please review the patch? thanks MoniS ------------------- Illegal multicast address can be handed for IPoIB from userspace. For example the command ip maddr add 33:33:00:00:00:01 dev ib0 injects an illegal muticast address to IPoIB that will start a join task for this address. However, whenever an illegal multicast address is passed to IPoIB it stops all subsequent requests from join attempts. That happens because IPoIB joins to multicast addresses in the order they arrived and doesn't handle the next address until the current address join finishes with success. This patch moves the multicast address to the end of the list after a join attempt. Even if the join fails the next attempt will be with a different address. Signed-off-by: Moni Shoua -- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index a0e9753..3c3c63d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -379,6 +379,7 @@ static int ipoib_mcast_join_complete(int status, struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *next_mcast; ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n", mcast->mcmember.mgid.raw, status); @@ -427,9 +428,17 @@ static int ipoib_mcast_join_complete(int status, mutex_lock(&mcast_mutex); spin_lock_irq(&priv->lock); - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, - mcast->backoff * HZ); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { + list_for_each_entry(next_mcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &next_mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_BUSY, &next_mcast->flags) + && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &next_mcast->flags)) + break; + } + if (&next_mcast->list != &priv->multicast_list) + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + next_mcast->backoff * HZ); + } spin_unlock_irq(&priv->lock); mutex_unlock(&mcast_mutex); @@ -570,13 +579,16 @@ void ipoib_mcast_join_task(struct work_struct *work) break; } } - spin_unlock_irq(&priv->lock); if (&mcast->list == &priv->multicast_list) { /* All done */ + spin_unlock_irq(&priv->lock); break; } + list_move_tail(&mcast->list, &priv->multicast_list); + spin_unlock_irq(&priv->lock); + ipoib_mcast_join(dev, mcast, 1); return; } _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jenos at ncsa.uiuc.edu Mon Aug 24 07:16:38 2009 From: jenos at ncsa.uiuc.edu (Jeremy Enos) Date: Mon, 24 Aug 2009 09:16:38 -0500 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <4A90FAD8.6000701@mellanox.co.il> References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il> Message-ID: <4A92A0C6.9030501@ncsa.uiuc.edu> 2.6.27.29-170.2.79 is the current fc10 x64 kernel. I had tried the latest tarball for 1.5- perhaps that's too late? I can try something older but would be great to have a starting point. Thx- Jeremy Tziporet Koren wrote: > Jeremy Enos wrote: >> Coming up on a year of Fedora 10 GA... Fedora 9 no longer >> maintained. No OFED support for FC10 yet creates a tough spot if >> trying to stay >> secure. Is there *any* version (1.5, etc) that will even build on >> FC10? thx- >> >> Jeremy >> >> >> > > I think OFED 1.5 might work on it but not sure. Which kernel version > FC10 use? > In general OFED 1.5 supports FC11 > > Tziporet > > From jgunthorpe at obsidianresearch.com Mon Aug 24 09:15:09 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 24 Aug 2009 10:15:09 -0600 Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad muticast address in the join list stop subsequent joins In-Reply-To: <4A929AC7.4060402@Voltaire.COM> References: <4A929AC7.4060402@Voltaire.COM> Message-ID: <20090824161509.GC4973@obsidianresearch.com> On Mon, Aug 24, 2009 at 04:51:03PM +0300, Moni Shoua wrote: > http://lists.openfabrics.org/pipermail/general/2009-July/060496.html > The discussion in the link above didn't end with a decision. You were asking > about a way to inject illegal mcast addresses from userspace to ib_ipoib and > Jason pointed about such (described below). Could you please review the patch? FWIW, upon looking at this more closely, I would rather see this patch of yours fix the timeout problem. This actually has nothing to do with illegal addreses but with any situation where the SA returns failure (ie MLID exhaustion, etc) There is already a per-event increasing back off, it just needs a little fussing to keep track of time properly and sort the list by expiration. Jason From jgl at johngroves.net Mon Aug 24 09:37:51 2009 From: jgl at johngroves.net (John Groves) Date: Mon, 24 Aug 2009 11:37:51 -0500 Subject: [ofa-general] OFED Source Code Cross Reference Server Announcement Message-ID: I'm pleased to announce that System Fabric Works is hosting a code cross reference server for the OFED distributions at http://SystemFabricWorks.com/ofed-xr.html. We've used the LXR indexing engine, which will already be familiar to most Linux kernel developers. The code can be browsed and searched, and symbols appear as hyper links that retrieve all references to the symbols. We already have many of the recent OFED distributions indexed. Feel free to send questions, suggestions or problem reports directly to me. Regards, John Groves System Fabric Works John at SystemFabricWorks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From yosefe at voltaire.com Mon Aug 24 09:48:51 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Mon, 24 Aug 2009 19:48:51 +0300 Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad muticast address in the join list stop subsequent joins In-Reply-To: <20090824161509.GC4973@obsidianresearch.com> References: <4A929AC7.4060402@Voltaire.COM> <20090824161509.GC4973@obsidianresearch.com> Message-ID: <4A92C473.4090208@voltaire.com> On 24/08/09 19:15, Jason Gunthorpe wrote: > On Mon, Aug 24, 2009 at 04:51:03PM +0300, Moni Shoua wrote: > >> http://lists.openfabrics.org/pipermail/general/2009-July/060496.html >> The discussion in the link above didn't end with a decision. You were asking >> about a way to inject illegal mcast addresses from userspace to ib_ipoib and >> Jason pointed about such (described below). Could you please review the patch? > > FWIW, upon looking at this more closely, I would rather see this patch > of yours fix the timeout problem. This actually has nothing to do with > illegal addreses but with any situation where the SA returns failure > (ie MLID exhaustion, etc) > > There is already a per-event increasing back off, it just needs a > little fussing to keep track of time properly and sort the list by > expiration. > > Jason Are you suggesting to sort the list each time we have add/remove a new entry, or search for the correct location to insert the new entry? I'm afraid that would add too much complexity and be inefficient (in O() terms). Moreover, I believe that moving a failed mcast entry to the end of the list behaves the same as always joining the least-backoff-value mcast entry (since everybody start with the same backoff). BTW Moni - Do send-only joins need the same solution too? --Yossi From John at SystemFabricWorks.com Mon Aug 24 09:30:19 2009 From: John at SystemFabricWorks.com (John Groves) Date: Mon, 24 Aug 2009 11:30:19 -0500 Subject: [ofa-general] OFED Source Code Cross Reference Server Announcement Message-ID: I'm pleased to announce that System Fabric Works is hosting a code cross reference server for the OFED distributions at http://SystemFabricWorks.com/ofed-xr.html. We've used the LXR indexing engine, which will already be familiar to most Linux kernel developers. The code can be browsed and searched, and symbols appear as hyper links that retrieve all references to the symbols. We already have many of the recent OFED distributions indexed. Feel free to send questions, suggestions or problem reports directly to me. Regards, John Groves System Fabric Works John at SystemFabricWorks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgunthorpe at obsidianresearch.com Mon Aug 24 10:18:58 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 24 Aug 2009 11:18:58 -0600 Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad muticast address in the join list stop subsequent joins In-Reply-To: <4A92C473.4090208@voltaire.com> References: <4A929AC7.4060402@Voltaire.COM> <20090824161509.GC4973@obsidianresearch.com> <4A92C473.4090208@voltaire.com> Message-ID: <20090824171858.GJ406@obsidianresearch.com> On Mon, Aug 24, 2009 at 07:48:51PM +0300, Yossi Etigin wrote: > Are you suggesting to sort the list each time we have add/remove a new entry, > or search for the correct location to insert the new entry? I'm afraid that > would add too much complexity and be inefficient (in O() terms). 1) This is an unlikely failure path 2) It is only O(n) to insert a entry into the proper place in an already sorted linked list. 3) I think you can do it about four lines of code. list_for_each_entry_reverse(i,priv->multicast_list,list) { if (i->xx < mcast->xx || priv->multicast_list == i) { list_move(mcast->list,i->list); break; } } Which is actually O(1) in the most common cases. > Moreover, I believe that moving a failed mcast entry to the end of > the list behaves the same as always joining the least-backoff-value > mcast entry (since everybody start with the same backoff). Nope.. New entries can be added at any time which unsorts things. Jason From yosefe at voltaire.com Mon Aug 24 10:23:36 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Mon, 24 Aug 2009 20:23:36 +0300 Subject: [ofa-general] Re: [PATCHv2 RESEND] IB/IPoIB: Don't let a bad muticast address in the join list stop subsequent joins In-Reply-To: <20090824171858.GJ406@obsidianresearch.com> References: <4A929AC7.4060402@Voltaire.COM> <20090824161509.GC4973@obsidianresearch.com> <4A92C473.4090208@voltaire.com> <20090824171858.GJ406@obsidianresearch.com> Message-ID: <4A92CC98.4090508@voltaire.com> On 24/08/09 20:18, Jason Gunthorpe wrote: > 1) This is an unlikely failure path 2) It is only O(n) to insert a > entry into the proper place in an already sorted linked list. 3) I > think you can do it about four lines of code. > > list_for_each_entry_reverse(i,priv->multicast_list,list) { > if (i->xx < mcast->xx || priv->multicast_list == i) { > list_move(mcast->list,i->list); > break; > } > } > So you suggest putting these 4 lines instead of list_move_tail() ? From cl at linux-foundation.org Mon Aug 24 10:23:33 2009 From: cl at linux-foundation.org (Christoph Lameter) Date: Mon, 24 Aug 2009 13:23:33 -0400 (EDT) Subject: [ofa-general] Re: Better way to get sufficient EQ context memory? In-Reply-To: References: Message-ID: On Thu, 20 Aug 2009, Roland Dreier wrote: > (Christoph, I tested this with "possible_cpus=32" and it still works for > me -- if you get a chance on your Dell systems that would be helpful too) Works here. Tested-by: Christoph Lameter From Rafael.Tinoco at Sun.COM Mon Aug 24 11:46:04 2009 From: Rafael.Tinoco at Sun.COM (Rafael David Tinoco) Date: Mon, 24 Aug 2009 15:46:04 -0300 Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology. Message-ID: <4A92DFEC.3010300@Sun.COM> Hello, I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics each, 8 qnems). They are configured in a MESH topology. I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5. I'm booting PXE from IB, my initrd image is bringing the ib0 interface, getting the squashfs image and mounting with aufs. The problem is.. When booting more then 60 nodes, I start to get above errors on subnet manager. And the problem seems to be intermitent, because each time it gives errors on different path. Any ideas ? Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713838 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1 Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713842 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1 Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713847 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1 Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713866 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1 Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713871 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1 Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP Aug 24 15:36:19 714805 [48D7D940] 0x01 -> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19. Adding to light sweep sampling list Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path: Path = 0,1,15,15,15 Aug 24 15:36:19 714822 [48D7D940] 0x01 -> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21. Adding to light sweep sampling list Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path: Path = 0,1,15,15,15 Aug 24 15:36:19 714831 [48D7D940] 0x01 -> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25. Adding to light sweep sampling list Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path: Path = 0,1,15,15,15 Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) -- dropping Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path: Initial path = 0,0,0,0,0,0 Return path = 0,0,0,0,0,0 Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x5 trans_id................0x36595 attr_id.................0x15 (PortInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1,15,15,15,19 Return path: 0,0,0,0,0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) -- dropping Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path: Initial path = 0,0,0,0,0,0 Return path = 0,0,0,0,0,0 Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x5 trans_id................0x36596 attr_id.................0x15 (PortInfo) resv....................0x0 .... -------------- next part -------------- An HTML attachment was scrubbed... URL: From stan.smith at intel.com Mon Aug 24 18:01:51 2009 From: stan.smith at intel.com (Smith, Stan) Date: Mon, 24 Aug 2009 18:01:51 -0700 Subject: [ofa-general] WinOF 2.1 RC 4 available for download. Message-ID: <3F6F638B8D880340AB536D29CD4C1E1912C58BC901@orsmsx501.amr.corp.intel.com> WinOF 2.1 Release Candidate #4 (RC4) is available @ http://www.openfabrics.org/downloads/WinOF/v2.1-RC4/ just in case. Please address comments and concerns to the ofw at lists.openfabrics.org Changes since RC3 ----------------- All RC4 WinOF installers are signed with a new OpenFabrics Alliance Digital-ID from Verisign. Only unattended installs (HPC compute nodes generally) are affected. Console installs will be prompted for an answer to the question 'Do you trust drivers from the OpenFabrics Alliance 3rd party SW Publisher?' Check the 'always trust the OpenFabrics Alliance SW Publisher' box and answer OK to install drivers. Server 2008 HPC installs only: The implication for HPC compute node installation is the head-node script '%ProgramFiles(x86)\WinOF\HPC\cert-add.bat' must be run to install the 'new' OFA cert in the compute nodes cert store prior to performing an unattended WinOF install; otherwise the unattended install attempts to prompt for an answer to 'Do you trust the OpenFabrics Alliance 3rd party SW Publisher?' and fails as it's unattended (no one home to answer). See Release_notes.htm (Server 2008 install) and HPC\ReadMe-HPC.txt for details. SVN Commits 2381 - [WinOF] RC4 staging; again. Modified : /gen1/branches/WOF2-1/WinOF/buildrelease.bat Modified : /gen1/branches/WOF2-1/WinOF/Wix/ReadMe_release.txt Modified : /gen1/branches/WOF2-1/WinOF/Wix/Release_notes.htm 2380 - [LIBRDMACM] Fix a potential race with ucma_init() and calls that check whether the library is ready for use. Modified : /gen1/branches/WOF2-1/ulp/librdmacm/src/cma.cpp 2379 - [WinOF] OFA Digital-ID expired 8/20/09, added new OFA cert signature so the 'new' cert can be added to compute nodes cert store. Modified : /gen1/branches/WOF2-1/WinOF/WIX/HPC/cert-add.bat 2373 - [WinOF] RC4 staging. Modified : /gen1/branches/WOF2-1/WinOF/buildrelease.bat Modified : /gen1/branches/WOF2-1/WinOF/Wix/common/WinOF_cfg.inc Modified : /gen1/branches/WOF2-1/WinOF/Wix/ReadMe_release.txt Modified : /gen1/branches/WOF2-1/WinOF/Wix/Release_notes.htm 2372 - [DAPL2] Completion Channel refactoring Modified : /gen1/branches/WOF2-1/etc/user/comp_channel.cpp Modified : /gen1/branches/WOF2-1/inc/user/comp_channel.h Modified : /gen1/branches/WOF2-1/ulp/dapl2/dapl/openib_cma/device.c Modified : /gen1/branches/WOF2-1/ulp/dapl2/dapl/openib_scm/device.c 2371 - [DAPL2] dapltest.exe: yield the processor as the Windows thread scheduler will starve other threads unlike the Linux scheduler. Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/mdep/linux/dapl_mdep_user.h Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/mdep/solaris/dapl_mdep_user.h Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/mdep/windows/dapl_mdep_user.h Modified : /gen1/branches/WOF2-1/ulp/dapl2/test/dapltest/test/dapl_test_util.c 2370,2369 [WIX] wix\win7\x64\wof.wxs: added comment indicating Win7_x64 installer was for Server 2008 R2 also. WIX/build-OFA-dist.bat: update usage instructions to reflect WinOF is now under Trunk\ and no longer under branches\ 2368 - [DAPL2] dt-cli.bat: %ERRORLEVEL% inside a for loop does not evaluate as expected; change to !ERRORLEVEL! 2367 - [DOCS] manual.htm: Update DAPL provider text w.r.t. names in DAT.conf file. 2364 - [IPOIB] IPoIB PXE boot support: Clear remainder of chaddr The IPoIB PXE boot firmware (gPXE) now sends the 8-byte port GUID in the DHCP chaddr field. WinOF replaces the first 6 bytes of chaddr with the Ethernet-style MAC address, but leaves the remain untouched. This results in trailing garbage after the Ethernet-style MAC in the modified chaddr. Fix by explicitly zeroing the remainder of chaddr. Modified : /gen1/branches/WOF2-1/ulp/ipoib/kernel/ipoib_port.c 2363 - [IBAT] allow simultaneous IBAT device access from user mode by adding RW sharing attributes to CreateFileW() call. Modified : /gen1/branches/WOF2-1/core/ibat/user/ibat.cpp 2362 - [LIBRDMACM] retry IBAT call on E_PENDING return. Modified : /gen1/branches/WOF2-1/ulp/librdmacm/src/cma.cpp 2361 - [MLX4] on catastrophic error, dump error buffer before reset. [winof: 2358] Modified : /gen1/branches/WOF2-1/hw/mlx4/kernel/bus/net/catas.c 2360 - [MLX4] bug fix in error flow: doesn't return error on allocation failure. [winof: 2356] Modified : /gen1/branches/WOF2-1/hw/mlx4/kernel/bus/core/l2w_umem.c 2359 - [IBAL] fix to 2226. cause an asynchronic event to be handled immediately (and not after SMI_POLL_INTERVAL, which is 20 secs) Modified : /gen1/branches/WOF2-1/core/al/al_ci_ca_shared.c Modified : /gen1/branches/WOF2-1/core/al/kernel/al_pnp.c 2353 - [ND provider] patch to fix to 2333. Eemove a facility to define MaxDataInlineSize from application, because it breaks MS API [ND porvider] Improved latency of ND provider by using INLINE send. [winof: 2333, 2352] This patch adds usage of INLINE DATA facility of Mellanox HCAs for improving latency of ND provider. Ideas of the patch: - by default, ND provider will create QP with inline data of 160 bytes; (this can enlarge user's QP size) - one can change this default by defining environment variable IBNDPROV_MAX_INLINE_SIZE; Modified : /gen1/branches/WOF2-1/ulp/nd/user/NdEndpoint.cpp Modified : /gen1/branches/WOF2-1/ulp/nd/user/NdEndpoint.h Modified : /gen1/branches/WOF2-1/ulp/nd/user/NdProv.cpp 2351 - [IPOIB] Prevent a BSOD which happens when restarting the opensm more than once (if the local endpoint was not in the lid_endpts list). Modified : /gen1/branches/WOF2-1/ulp/ipoib/kernel/ipoib_port.c Modified : /gen1/trunk/ulp/ipoib/kernel/ipoib_port.c WinOF 2.1 Summary ----------------- 1) The WinOF 2.1 release is based on openib-windows source svn revision (branches\WOF2-1 svn.2381). Last WinOF release (2.0.2) based on svn.1975. 2) Bug fixes in IB Core IPoIB WSD SRP DAT/DAPL WinVerbs WinMAD OFED (Open Fabrics Enterprise Distribution [Linux]) verbs API OFED Diagnostic utilities WinOF Installer 3) Integrated Functionality - OFED Compatibility layers allow for easy porting of OFED applications into the WinOF environment. libibverbs - OFED verbs API library. libmad - InfiniBand MAD (Management Datagram) library. libumad - IB MAD exported user-mode interface library. librdmacm - OFED RDMA CM (Comunications Manager). - OFED Fabric Diagnostics available ( for usage info, see --help ). ibaddr - query InfiniBand address(es) ibnetdiscover - generate a fabric topology. iblinkinfo - report link info for all links in the fabric ibping - ping an InfiniBand address ibportstate - manage IB port (physical) state and link speed ibqueryerrors - query and report non-zero IB port counters ibroute - query InfiniBand switch forwarding tables ibstat - display HCA information. ibsysstat - system status for an InfiniBand address ibtracert - trace InfiniBand path saquery - SA (Subnet Administrator) query test. sminfo - query InfiniBand SMInfo attributes smpdump - dump InfiniBand subnet management attributes smpquery - query InfiniBand subnet management attributes vendstat - query InfiniBand vendor specific functions 4) New Functionality - All WinOF installs now utilize the Windows Driver Store along with the Plug-n-Play (PNP) subsystem to install the correct HCA driver(s). Selection of a specific Mellanox HCA device type is no longer required. - Server 2008-HPC install support has been enhanced to provide a no-drivers installed mode to ease WinOF installation when drivers have been previously installed with WDM (Windows Deployment Manager) node templates. From an msiexec.exe command line when NODRV=1, device driver '.inf' files are not processed during the WinOF install. The base assumption is the WDM node provisioning template (see cluster Manager) will install HCA drivers. All other WinOF files are installed to the standard WinOF location '%ProgramFiles(x86)%\WinOF'. When uninstalling a WinOF install which was done with NODRV=1, you MUST include NODRV=1 on the msiexec.exe uninstall command line or the uninstall will uninstall HCA drivers installed via WDM templates. Incorporating a msiexec based WinOF install into a node provisioning template works well. See examples '%ProgramFiles(x86)%\WinOF\HPC\ReadMe-HPC.txt' For 'first' time HPC WinOF installs or node provisioning with WinOF drivers via WDM, the batch script cert-add.bat, in '%ProgramFiles(x86)%\WinOF\HPC', should be utilized to extract the 3rd Party Software Publisher certificate from the WinOF_2-1_wlh_x64.msi installer and inserted in all compute nodes certificate store. Suggest WinOF install on head node then run 'cert-add.bat' from head-node; requires a common share visiable from all remote nodes prio to execution. For WDM node provisioning, suggest cert-add.bat invocation followed by WinOF-Install.bat from Node provisioning template. Examples unattended install (for use with clusrun.bat) start/wait msiexec /I WOF.msi /quiet NODRV=1 console based non-interactive install with progress window: start/wait msiexec /I WOF.msi /passive NODRV=1 Install selectable features (No drivers): start/wait msiexec /I WOF.msi NODRV=1 Extract WinOF install files (aka driver files for WDM install) start/wait msiexec /A WinOF_wlh_x64.msi TARGETDIR=%TEMP% The folder %TEMP%\PFiles\WinOF will be created. console based unattended uninstall with auto-reboot: start/wait msiexec /X WOF.msi /passive clusrun unattended uninstall with auto-reboot start/wait msiexec /X WOF.msi /quiet /forcereboot - Subnet Management started as a local Windows Service from a command line: start/wait msiexec /I WOF.msi /passive OSMS=1 - HCA drivers now load WinVerbs and WinMad filter drivers by default. - ndinstall.exe command line interface changes - see manual.htm 5) Technology Previews DAPL2 Socket-CM provider Uses IPv4 sockets to exchange Queue pair setup information, thus bypassing IB Path Record lookups. DAPL rdma-CM Compatible with OFED rdma-CM interfaces; facilates IB application portability between Linux/OFED and Windows/WinOF. 6) Vista installs Only: Vista installs must be performed from an Administrator priviledged command window. Right-clicking the .msi installer file for a Vista installation will fail due to insufficent privileges to install the HCA driver! From the Administrator privileged cmd-window (Interactive install) say start/wait msiexec /I WinOF_wlh_xxx.msi -or- a quiet, default install: start/wait msiexec /I WinOF_wlh_xxx.msi /passive **** WARNING **** After the WinOF.msi file has started installation execution, an errant "Welcome to the Found New Hardware Wizard" window 'may' popup. Just ignore or 'cancel' the errant FNHW popup window in order to proceed with the installation. XP requires a cancel, for WLH & WNET, the notifiers will disappear on their own. You do need to answer 'Yes' or 'Continue' to those popup windows which refer to non-WHQL'ed drivers. If the install appears to hang, look around for popup windows requesting input which are covered by other windows. Such is the case on Server 2008 initial install - Answer 'yes' to always trust the OpenFabrics Alliance as a SW publisher. Please: read the Release_notes.htm file! make 'sure' your HCA firmware is recent; vstat.exe displays HCA firmware version. Thank you, WinOF Developers. From weiny2 at llnl.gov Mon Aug 24 18:52:06 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 24 Aug 2009 18:52:06 -0700 Subject: [ofa-general] Combined DR path with empty DR path, what is the expected behavior? Message-ID: <20090824185206.39e5e377.weiny2@llnl.gov> If I send a combined DR path with a start lid but an empty (0 length) DR path. What is the expected behavior? I know this could be specified with LID routing, but I don't see anywhere in the specification which says this is an error. I do however seem to have 2 different implementations on 2 different switches. For example: I have Switch A (Lid 1) and Switch B (Lid 7). I attempt to query PortInfo of Port 1 of each switch using the LID followed by an empty DR path. 17:55:22 > ./smpquery -c portinfo 1 0 1 ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1) ./smpquery: iberror: failed: operation portinfo: port info query failed 17:55:31 > ./smpquery -c portinfo 7 0 1 # Port info: Lid 7 port 1 Mkey:............................0x0000000000000000 GidPrefix:.......................0x0000000000000000 ... Detecting this special case in libibmad and turning the packet into a LID routed one succeeds but I wonder if this is an error in the SMI? I also notice this is an error on the HCA I am running from (lid 2). 17:57:42 > ./smpquery -c portinfo 2 0 1 ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2) ./smpquery: iberror: failed: operation portinfo: port info query failed Running with a simple DR path works, I guess because this is the loopback case mentioned on page 805. 17:58:16 > ./smpquery -D portinfo 0 1 # Port info: DR path slid 65535; dlid 65535; 0 port 1 Mkey:............................0x0000000000000000 GidPrefix:.......................0x2007000000000000 ... It guess that the comment "Since each part may be empty, there are eight combinations, although only four are really useful:" on line 36 Page 805 can be interpreted to mean that only those 4 combinations need to be supported. Is this true? On the other hand I think strictly this should be supported. Item 4 of C14-9 (line 24 page 810) requires the SMI to handle the packet if the HopPointer equals HopCount +1, which it is in my case (HopCount == 0, HopPointer == 1). Then after processing the SMI should return the packet as specified in C14-13 item 3 on line 9 page 812. Am I wrong? In the end it does not matter as I have to make the software work for all the hardware I have; so I will change the software. However, I wonder where exactly the spec falls on this, because I think it will influence where the fix resides. If the spec does not allow this then I think it is fine to have libibmad return an error since the user specified an invalid combined DR path. However, if this should be legal I think libibmad should work around the bad hardware out there. Thoughts? Ira -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov From ofedrnicuser at yahoo.com Mon Aug 24 21:25:51 2009 From: ofedrnicuser at yahoo.com (Bill N) Date: Mon, 24 Aug 2009 21:25:51 -0700 (PDT) Subject: [ofa-general] when to use get_dma_mr, which doesn't take physical buffer list size argument Message-ID: <612829.47304.qm@web111214.mail.gq1.yahoo.com> Hi, When to use get_dma_mr() instead of reg_phys_mr()? Looking at the current implementation of get_dma_mr() for Chelsio and NetEffect driver's, they seem to register all the possible system memory to the device. 4GB for Chelsio & 64GB for Neteffect. Isn't this a hole in case where system has less memory then the capability of the address bus? (Say - 2GB of physical memory & we register 4GB??) Is get_dma_mr() used for kernel space ULPs with only SGEs(stag=0) instead of stag based send(), recv()? Can user get_dma_mr() can reregister lesser size memory? If yes, how? get_dma_mr() is not equivalent to 9.2.6.1 Allocate Non-Shared Memory Region STag of iWarp-RDMAC specification? Regards, Bill From vlad at dev.mellanox.co.il Tue Aug 25 02:02:19 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 25 Aug 2009 12:02:19 +0300 Subject: [ofa-general] Re: [ANNOUNCE] uDAPL v2.0 - dapl-2.0.22 release In-Reply-To: References: Message-ID: <4A93A89B.7080200@dev.mellanox.co.il> Davis, Arlin R wrote: > Vlad, please pull new v2 package into OFED 1.5 beta build and install the following: > > dapl-2.0.22-1 > dapl-utils-2.0.22-1 > dapl-devel-2.0.22-1 > dapl-debuginfo-2.0.22-1 > compat-dapl-1.2.14-1 > compat-dapl-devel-1.2.14-1 > > See http://www.openfabrics.org/downloads/dapl/ more details. > > -arlin > > Done, Regards, Vladimir From vlad at lists.openfabrics.org Tue Aug 25 03:08:34 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 25 Aug 2009 03:08:34 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090825-0200 daily build status Message-ID: <20090825100835.0CC58E61DD3@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090825-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From kovlensky at interia.pl Tue Aug 25 09:25:16 2009 From: kovlensky at interia.pl (kovlensky at interia.pl) Date: 25 Aug 2009 18:25:16 +0200 Subject: [ofa-general] ofed 1.3.2 opensmd failover Message-ID: <20090825162517.7955C21C827@f28.poczta.interia.pl> Hi all, Quick question - is there a need to run anything except opensmd deamons to provide failover capability on ib network in ofed 1.3? I'm aware that when master manager dies standby one comes in and manages the network, but that does not necessary means that lids are preserved, especially for nodes joining in. I used to run sldd.sh for distributing lids list on ofed 1.2.5, but while this script seems to be in place noone mentions necessity for it. So subnet manager failover is provided by running standby opensm. And how LID preservation is provided? Regards, Zdenek Kovlensky ---------------------------------------------------------------------- Kup wlasne mieszkanie za 33 tys. zl! Sprawdz >>> http://link.interia.pl/f22f2 From akepner at sgi.com Tue Aug 25 14:12:50 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Tue, 25 Aug 2009 14:12:50 -0700 Subject: [ofa-general] [TRIVIAL PATCH] ibutils: fix regexp for pkey matching Message-ID: <20090825211250.GI16590@sgi.com> There's an error in a regular expression for matching pkeys in ibdebug.tcl. The following fixes it. Signed-off-by: Arthur Kepner --- diff -rup a/ibutils-1.2/ibdiag/src/ibdebug.tcl b/ibutils-1.2/ibdiag/src/ibdebug.tcl --- a/ibutils-1.2/ibdiag/src/ibdebug.tcl 2009-08-25 12:38:45.646392453 -0700 +++ b/ibutils-1.2/ibdiag/src/ibdebug.tcl 2009-08-25 12:39:23.180706933 -0700 @@ -3048,7 +3048,7 @@ proc GetPortPkeys {drPath portNum numPKe continue } foreach pkey $pkeyTable { - if {[regexp {^0x[0-9a-fA-F]$} $pkey]} { + if {[regexp {^0x[0-9a-fA-F]+$} $pkey]} { lappend pkeys $pkey } } From hal.rosenstock at gmail.com Tue Aug 25 14:59:16 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 25 Aug 2009 17:59:16 -0400 Subject: [ofa-general] ofed 1.3.2 opensmd failover In-Reply-To: <20090825162517.7955C21C827@f28.poczta.interia.pl> References: <20090825162517.7955C21C827@f28.poczta.interia.pl> Message-ID: On 8/25/09, kovlensky at interia.pl wrote: > > Hi all, > > Quick question - is there a need to run anything except opensmd deamons to > provide failover capability on ib network in ofed 1.3? In terms of SM failover, modulo bugs fixed relative to this feature since OFED 1.3 (there are a couple of things here which may affect your environment if I recall correctly), you only need to run more than 1 SM for this (one will become master, the other standby). I'm aware that when master manager dies standby one comes in and manages > the network, but that does not necessary means that lids are preserved, > especially for nodes joining in. I used to run sldd.sh for distributing lids > list on ofed 1.2.5, but while this script seems to be in place noone > mentions necessity for it. So subnet manager failover is provided by running standby opensm. And how LID preservation is provided? If you want LIDs to be preserved, the guid2lid file needs to be sync'd (copied from the master SM once it's fully assembled to the node which is running the standby SM). That's what the sldd.sh script does. -- Hal Regards, > > Zdenek Kovlensky > > ---------------------------------------------------------------------- > Kup wlasne mieszkanie za 33 tys. zl! > Sprawdz >>> http://link.interia.pl/f22f2 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Aug 25 15:04:55 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 25 Aug 2009 18:04:55 -0400 Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology. In-Reply-To: <4A92DFEC.3010300@Sun.COM> References: <4A92DFEC.3010300@Sun.COM> Message-ID: On 8/24/09, Rafael David Tinoco wrote: > > Hello, > > I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics > each, 8 qnems). > They are configured in a MESH topology. > I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5. > > I'm booting PXE from IB, my initrd image is bringing the ib0 interface, > getting the squashfs image and mounting with aufs. > > The problem is.. When booting more then 60 nodes, I start to get above > errors on subnet manager. > And the problem seems to be intermitent, because each time it gives errors > on different path. > > Any ideas ? > > Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713838 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: > Discovered new port with GUID:0x50800200008d9381 LID range [78,78] of > node:b03n06 HCA-1 > Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713842 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: > Discovered new port with GUID:0x50800200008d4689 LID range [76,76] of > node:b03n04 HCA-1 > Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713847 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: > Discovered new port with GUID:0x50800200008e5191 LID range [82,82] of > node:b03n11 HCA-1 > Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713866 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: > Discovered new port with GUID:0x50800200008d94c9 LID range [80,80] of > node:b03n08 HCA-1 > Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713871 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: > Discovered new port with GUID:0x50800200008daedd LID range [83,83] of > node:b03n12 HCA-1 > Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP > Aug 24 15:36:19 714805 [48D7D940] 0x01 -> > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node > 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19. > Adding to light sweep sampling list > Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path: > Path = 0,1,15,15,15 > Aug 24 15:36:19 714822 [48D7D940] 0x01 -> > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node > 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21. > Adding to light sweep sampling list > Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path: > Path = 0,1,15,15,15 > Aug 24 15:36:19 714831 [48D7D940] 0x01 -> > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node > 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25. > Adding to light sweep sampling list > Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path: > Path = 0,1,15,15,15 > Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) -- > dropping > Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP > Hop Ptr: 0x0 > Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path: > Initial path = 0,0,0,0,0,0 > Return path = 0,0,0,0,0,0 > Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: > ERR 3113: MAD completed in error (IB_TIMEOUT) > Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x5 > trans_id................0x36595 > attr_id.................0x15 (PortInfo) > resv....................0x0 > attr_mod................0x0 > m_key...................0x0000000000000000 > dr_slid.................65535 > dr_dlid.................65535 > > Initial path: 0,1,15,15,15,19 > Return path: 0,0,0,0,0,0 > Reserved: [0][0][0][0][0][0][0] > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) -- > dropping > Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP > Hop Ptr: 0x0 > Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path: > Initial path = 0,0,0,0,0,0 > Return path = 0,0,0,0,0,0 > Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: > ERR 3113: MAD completed in error (IB_TIMEOUT) > Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x5 > trans_id................0x36596 > attr_id.................0x15 (PortInfo) > resv....................0x0 > .... > These errors are transient as you indicate. They mean that some node has brought the link physically up but there is no SMA at the remote side of the link. The different paths are paths to the HCAs. This occurs during PXE boot as the node transitions from the boot ROM to the Linux environment. Other than these messages, do things seem to work in terms of the end nodes ? -- Hal _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kalaiya.1 at buckeyemail.osu.edu Tue Aug 25 16:03:41 2009 From: kalaiya.1 at buckeyemail.osu.edu (MANIKANTAN KALAIYA) Date: Tue, 25 Aug 2009 23:03:41 +0000 Subject: [ofa-general] Number of devices returned by ibv_get_device_list() In-Reply-To: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com> References: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com> Message-ID: <122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com> Resending to the mailing list... We have Ofed1.3.1 installed, one of the sub packages is libibverbs version 1.1.1. We have a small program that lists the number of IB cards available in the system through ibv_get_device_list(). See below for the sample code. The system has two IB cards, the value returned by ibv_get_device_list() in 'num_devices' is two, as expected. However, when we disable one of the cards using the modprobe command, the program continues to return two cards present (monitoring is continuous in a while loop). Killing and restarting the sample test process results in reporting correct number of IB cards available (returns one after it is restarted). One of the prior versions was known to report the correct number of IB cards without requiring to restart the program. We would like to determine the number of cards present without having to go through a restart. Any inputs on this behavior is appreciated. modprobe command - "sudo modprobe -r ib_mthca" Test program: ================================================= #include #include int main(int argc, char **argv) { int ret, num_devices; struct ibv_device **dev_list; while(1) { dev_list = ibv_get_device_list(&num_devices); if (num_devices != 0) { printf("IB ADAPTER AVAILABLE:%d\n", num_devices); } else { printf("IB ADAPTER UNAVAILABLE\n"); } sleep(2); ibv_free_device_list(dev_list); } return(0); } ================================================= Thanks, Mani. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Aug 25 16:15:19 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 25 Aug 2009 19:15:19 -0400 Subject: [ofa-general] Combined DR path with empty DR path, what is the expected behavior? In-Reply-To: <20090824185206.39e5e377.weiny2@llnl.gov> References: <20090824185206.39e5e377.weiny2@llnl.gov> Message-ID: On 8/24/09, Ira Weiny wrote: > If I send a combined DR path with a start lid but an empty (0 length) DR > path. Hop Count 0 ? > What is the expected behavior? Not sure what you mean by expected here. Are you referring to expectation based on the spec ? > I know this could be specified with LID routing, but I don't see anywhere > in > the specification which says this is an error. I don't think it should be an error (certainly not for the form you are using LID routed part followed by a DR part) but a null DR part is a little funny/odd. > I do however seem to have 2 > different implementations on 2 different switches. For example: > > I have Switch A (Lid 1) and Switch B (Lid 7). I attempt to query PortInfo > of > Port 1 of each switch using the LID followed by an empty DR path. > > 17:55:22 > ./smpquery -c portinfo 1 0 1 > ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1) > ./smpquery: iberror: failed: operation portinfo: port info query failed Is this a timeout ? > 17:55:31 > ./smpquery -c portinfo 7 0 1 > # Port info: Lid 7 port 1 > Mkey:............................0x0000000000000000 > GidPrefix:.......................0x0000000000000000 > ... > > > Detecting this special case in libibmad and turning the packet into a LID > routed one Ugh... Is this special case really needed ? I don't think the underlying issue is understood sufficiently yet. > succeeds but I wonder if this is an error in the SMI? Switch SMI ? Is this a proprietary implementation ? > I also notice this is an error on the HCA I am running from (lid 2). Is this HCA node OpenIB based ? 17:57:42 > ./smpquery -c portinfo 2 0 1 > ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2) > ./smpquery: iberror: failed: operation portinfo: port info query failed Is this also a timeout ? Also, does the result differ based on where you source these from matter (locally v. remotely)? > Running with a simple DR path works, You're referring to the same DR path here that fails in the combined route examples above, right ? > I guess because this is the loopback case mentioned on page 805. Yes but that's the high level requirement rather than the SMI rules which make that work. > 17:58:16 > ./smpquery -D portinfo 0 1 > # Port info: DR path slid 65535; dlid 65535; 0 port 1 > Mkey:............................0x0000000000000000 > GidPrefix:.......................0x2007000000000000 > ... > > > It guess that the comment "Since each part may be empty, there are eight > combinations, although only four are really useful:" on line 36 Page 805 > can > be interpreted to mean that only those 4 combinations need to be supported. > Is this true? Not all 4 combinations are supported/known to work. When this was added for ibportstate, the only combined routing form that was important was LID routed part followed by a DR part. > On the other hand I think strictly this should be supported. In an ideal world yes but are they all required or is it just the one form most heavily used ? > Item 4 of C14-9 > (line 24 page 810) requires the SMI to handle the packet if the HopPointer > equals HopCount +1, which it is in my case (HopCount == 0, HopPointer == 1) By handle, this means "The SMI *shall *output the packet on the port whose number is in the entry indexed by Hop Pointer in the Initial Path. If that port number is invalid, the SMI *shall *discard the SMP." Are you sure the Hop Pointer is 1 ? Where do you see this ? If so, what's the initial path at this point (or more specifically index 1 of the initial path) ? I think that needs to be port 0 (if a switch) but this is a little weird as I would think it should be handed to the SMA which is different cases in the spec. > Then after processing by the SMA and doing the required returning initialization the SMI should return the packet as specified in C14-13 > item 3 on line 9 page 812. I'm not sure it would use this case in the case of an empty DR pafh on return. Am I wrong? In the end it does not matter as I have to make the software > work > for all the hardware I have; so I will change the software. IMO it does matter as to where the problem lies (SMI or otherwise) and how the layers are comprised in the implementation. However, I wonder > where exactly the spec falls on this, because I think it will influence > where > the fix resides. If the spec does not allow this then I think it is fine > to > have libibmad return an error since the user specified an invalid combined > DR > path. However, if this should be legal I think libibmad should work around > the bad hardware out there. Is it hardware or firmware that needs fixing ? I think it may depend on the specific workaround for this as to whether it is acceptable as it might harm something else or might violate the spec. -- Hal Thoughts? > Ira > > -- > Ira Weiny > Math Programmer/Computer Scientist > Lawrence Livermore National Lab > 925-423-8008 > weiny2 at llnl.gov > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Tue Aug 25 16:20:24 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 25 Aug 2009 19:20:24 -0400 Subject: [ofa-general] [PATCH] opensm/osm_helper.c: Only change method when > rather than >= Message-ID: <20090825232024.GA17650@comcast.net> Also, cosmetic formatting change to combine lines like: uint16_t host_attr; host_attr = cl_ntoh16(attr); Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 23392a4..3692474 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -458,12 +458,12 @@ const char *ib_get_sa_method_str(IN uint8_t method) { if (method & 0x80) { method = method & 0x7f; - if (method >= OSM_SA_METHOD_STR_UNKNOWN_VAL) + if (method > OSM_SA_METHOD_STR_UNKNOWN_VAL) method = OSM_SA_METHOD_STR_UNKNOWN_VAL; /* it is a response - use the response table */ return (__ib_sa_resp_method_str[method]); } else { - if (method >= OSM_SA_METHOD_STR_UNKNOWN_VAL) + if (method > OSM_SA_METHOD_STR_UNKNOWN_VAL) method = OSM_SA_METHOD_STR_UNKNOWN_VAL; return (__ib_sa_method_str[method]); } @@ -475,7 +475,7 @@ const char *ib_get_sm_method_str(IN uint8_t method) { if (method & 0x80) method = (method & 0x0F) | 0x10; - if (method >= OSM_SM_METHOD_STR_UNKNOWN_VAL) + if (method > OSM_SM_METHOD_STR_UNKNOWN_VAL) method = OSM_SM_METHOD_STR_UNKNOWN_VAL; return (__ib_sm_method_str[method]); } @@ -484,10 +484,9 @@ const char *ib_get_sm_method_str(IN uint8_t method) **********************************************************************/ const char *ib_get_sm_attr_str(IN ib_net16_t attr) { - uint16_t host_attr; - host_attr = cl_ntoh16(attr); + uint16_t host_attr = cl_ntoh16(attr); - if (host_attr >= OSM_SM_ATTR_STR_UNKNOWN_VAL) + if (host_attr > OSM_SM_ATTR_STR_UNKNOWN_VAL) host_attr = OSM_SM_ATTR_STR_UNKNOWN_VAL; return (__ib_sm_attr_str[host_attr]); @@ -497,10 +496,9 @@ const char *ib_get_sm_attr_str(IN ib_net16_t attr) **********************************************************************/ const char *ib_get_sa_attr_str(IN ib_net16_t attr) { - uint16_t host_attr; - host_attr = cl_ntoh16(attr); + uint16_t host_attr = cl_ntoh16(attr); - if (host_attr >= OSM_SA_ATTR_STR_UNKNOWN_VAL) + if (host_attr > OSM_SA_ATTR_STR_UNKNOWN_VAL) host_attr = OSM_SA_ATTR_STR_UNKNOWN_VAL; return (__ib_sa_attr_str[host_attr]); From kalaiya.1 at buckeyemail.osu.edu Tue Aug 25 15:55:54 2009 From: kalaiya.1 at buckeyemail.osu.edu (MANIKANTAN KALAIYA) Date: Tue, 25 Aug 2009 22:55:54 +0000 Subject: [ofa-general] Number of devices returned by ibv_get_device_list() Message-ID: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com> We have Ofed1.3.1 installed, one of the sub packages is libibverbs version 1.1.1. We have a small program that lists the number of IB cards available in the system through ibv_get_device_list(). See below for the sample code. The system has two IB cards, the value returned by ibv_get_device_list() in 'num_devices' is two, as expected. However, when we disable one of the cards using the modprobe command, the program continues to return two cards present (monitoring is continuous in a while loop). Killing and restarting the sample test process results in reporting correct number of IB cards available (returns one after it is restarted). One of the prior versions was known to report the correct number of IB cards without requiring to restart the program. We would like to determine the number of cards present without having to go through a restart. Any inputs on this behavior is appreciated. modprobe command - "sudo modprobe -r ib_mthca" Test program: ================================================= #include #include int main(int argc, char **argv) { int ret, num_devices; struct ibv_device **dev_list; while(1) { dev_list = ibv_get_device_list(&num_devices); if (num_devices != 0) { printf("IB ADAPTER AVAILABLE:%d\n", num_devices); } else { printf("IB ADAPTER UNAVAILABLE\n"); } sleep(2); ibv_free_device_list(dev_list); } return(0); } ================================================= Thanks, Mani. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenos at ncsa.uiuc.edu Tue Aug 25 17:31:30 2009 From: jenos at ncsa.uiuc.edu (Jeremy Enos) Date: Tue, 25 Aug 2009 19:31:30 -0500 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <4A92A0C6.9030501@ncsa.uiuc.edu> References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il> <4A92A0C6.9030501@ncsa.uiuc.edu> Message-ID: <4A948262.7030508@ncsa.uiuc.edu> Latest available 1.5 tar fails w/ this error building. This an fc10 x64 machine up to date as of last week. thx- Jeremy Failed to build ofa_kernel RPM See /tmp/OFED.26978.logs/ofa_kernel.rpmbuild.log [root at ac27 OFED-1.5-20090825-0729]# tail -50 /tmp/OFED.26978.logs/ofa_kernel.rpmbuild.log mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions ; rm -f /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions/* make -f scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5 make -f scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband make -f scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core gcc -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.addr.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/4.3.2/include -D__KERNEL__ \ -D__OFED_BUILD__ \ -include include/linux/autoconf.h \ -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/ \ \ \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \ -I/usr/local/include/scst \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \ -Iinclude \ \ -I/usr/src/kernels/2.6.27.29-170.2.79.fc10.x86_64/arch/x86/include \ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -fno-delete-null-pointer-checks -Os -m64 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default -fno-stack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)" -D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h:7, from include/asm/paravirt.h:33, from include/asm/page.h:159, from include/asm/pda.h:9, from include/asm/current.h:20, from include/asm/processor.h:16, from include/linux/prefetch.h:15, from include/linux/list.h:7, from include/linux/mutex.h:14, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:37: include/asm/topology.h: In function 'cpu_to_node': include/asm/topology.h:93: error: implicit declaration of function 'cpu_pda' include/asm/topology.h:93: error: invalid type argument of '->' (have 'int') include/asm/topology.h: In function 'early_cpu_to_node': include/asm/topology.h:102: error: invalid type argument of '->' (have 'int') make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core] Error 2 make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband] Error 2 make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.27.29-170.2.79.fc10.x86_64' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.ynduey (%build) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.ynduey (%build) [root at ac27 OFED-1.5-20090825-0729]# uname -r 2.6.27.29-170.2.79.fc10.x86_64 [root at ac27 OFED-1.5-20090825-0729]# Jeremy Enos wrote: > 2.6.27.29-170.2.79 is the current fc10 x64 kernel. I had tried the > latest tarball for 1.5- perhaps that's too late? I can try something > older but would be great to have a starting point. Thx- > > Jeremy > > Tziporet Koren wrote: >> Jeremy Enos wrote: >>> Coming up on a year of Fedora 10 GA... Fedora 9 no longer >>> maintained. No OFED support for FC10 yet creates a tough spot if >>> trying to stay >>> secure. Is there *any* version (1.5, etc) that will even build on >>> FC10? thx- >>> >>> Jeremy >>> >>> >>> >> >> I think OFED 1.5 might work on it but not sure. Which kernel version >> FC10 use? >> In general OFED 1.5 supports FC11 >> >> Tziporet >> >> > From weiny2 at llnl.gov Tue Aug 25 17:55:43 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 25 Aug 2009 17:55:43 -0700 Subject: [ofa-general] Combined DR path with empty DR path, what is the expected behavior? In-Reply-To: References: <20090824185206.39e5e377.weiny2@llnl.gov> Message-ID: <20090825175543.4f929646.weiny2@llnl.gov> On Tue, 25 Aug 2009 19:15:19 -0400 Hal Rosenstock wrote: > On 8/24/09, Ira Weiny wrote: > > > If I send a combined DR path with a start lid but an empty (0 length) DR > > path. > > > Hop Count 0 ? Yes > > > > What is the expected behavior? > > > Not sure what you mean by expected here. Are you referring to expectation > based on the spec ? > yes > > > I know this could be specified with LID routing, but I don't see anywhere > > in > > the specification which says this is an error. > > > I don't think it should be an error (certainly not for the form you are > using LID routed part followed by a DR part) but a null DR part is a little > funny/odd. Yea I know. It turns out that the new iblinkinfo issues queries like this when it is removes recurses back from the last DR portion of the combined route path. It only showed up as an error when using the -S option of iblinkinfo with this new switch I have. Works fine with the old switches. > > > I do however seem to have 2 > > different implementations on 2 different switches. For example: > > > > I have Switch A (Lid 1) and Switch B (Lid 7). I attempt to query PortInfo > > of > > Port 1 of each switch using the LID followed by an empty DR path. > > > > 17:55:22 > ./smpquery -c portinfo 1 0 1 > > ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1) > > ./smpquery: iberror: failed: operation portinfo: port info query failed > > > Is this a timeout ? yes 16:26:25 > ./smpquery -e -c portinfo 1 0 1 ibwarn: [27150] _do_madrpc: retry 1 (timeout 1000 ms) ibwarn: [27150] _do_madrpc: retry 2 (timeout 1000 ms) ibwarn: [27150] _do_madrpc: timeout after 3 retries, 3000 ms ibwarn: [27150] mad_rpc: _do_madrpc failed; dport (Lid 1) ./smpquery: iberror: failed: operation portinfo: port info query failed > > > > 17:55:31 > ./smpquery -c portinfo 7 0 1 > > # Port info: Lid 7 port 1 > > Mkey:............................0x0000000000000000 > > GidPrefix:.......................0x0000000000000000 > > ... > > > > > > Detecting this special case in libibmad and turning the packet into a LID > > routed one > > > Ugh... Is this special case really needed ? I don't think the underlying > issue is understood sufficiently yet. Well I just did it to prove that what I was doing would work with a "simple" lid routed packet. Like I said it might be that this portid which is being specified to libibmad by libibnetdisc is not valid. If that is true then libibnetdisc should detect when the DR path is empty and go back to LID routed requests. That is a valid fix in my mind. > > > succeeds but I wonder if this is an error in the SMI? > > > Switch SMI ? Is this a proprietary implementation ? > Yes I see the bug with 2 different vendors switches. One is managed and the other is not. My "old" switches (3 different vendors) do not show this behavior. (Just to be clear I now I have 5 switches in my 5 node cluster! ;-) > > > > I also notice this is an error on the HCA I am running from (lid 2). > > > Is this HCA node OpenIB based ? yes > > 17:57:42 > ./smpquery -c portinfo 2 0 1 > > ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2) > > ./smpquery: iberror: failed: operation portinfo: port info query failed > > > Is this also a timeout ? yes > > Also, does the result differ based on where you source these from matter > (locally v. remotely)? Same result local and remote. > > > > > Running with a simple DR path works, > > > You're referring to the same DR path here that fails in the combined route > examples above, right ? > No. the example below is a DR path with Hop Count == 0 but without the initial LID routing. > > > I guess because this is the loopback case mentioned on page 805. > > > Yes but that's the high level requirement rather than the SMI rules which > make that work. > > > > > 17:58:16 > ./smpquery -D portinfo 0 1 > > # Port info: DR path slid 65535; dlid 65535; 0 port 1 > > Mkey:............................0x0000000000000000 > > GidPrefix:.......................0x2007000000000000 > > ... > > > > > > It guess that the comment "Since each part may be empty, there are eight > > combinations, although only four are really useful:" on line 36 Page 805 > > can > > be interpreted to mean that only those 4 combinations need to be supported. > > Is this true? > > > Not all 4 combinations are supported/known to work. When this was added for > ibportstate, the only combined routing form that was important was LID > routed part followed by a DR part. > When you say "known to work" you mean implemented with the diags? Or known to work in all hardware? > > > On the other hand I think strictly this should be supported. > > > In an ideal world yes but are they all required or is it just the one form > most heavily used ? That is what I am unclear on. Does the spec require that all 8 combinations are required to work? I don't see a specific compliance which says that and I am not sure if C14-9 and C14-13 cover all 8 combinations. > > > Item 4 of C14-9 > > (line 24 page 810) requires the SMI to handle the packet if the HopPointer > > equals HopCount +1, which it is in my case (HopCount == 0, HopPointer == 1) > > > By handle, this means "The SMI *shall *output the packet on the port whose > number is in the entry indexed by Hop Pointer in the Initial Path. If that > port number is invalid, the SMI *shall *discard the SMP." > > Are you sure the Hop Pointer is 1 ? Where do you see this ? No I was wrong. I think I read the wrong madeye packet as I see the packet right before this one did have a hop pointer of 1. I Added some debug prints to mad_encode to get the following output: 17:26:10 > ./smpquery -e -c portinfo 1 0 1 trid 2a0f0cb5; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0 trid 2a0f0cb6; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0 trid 2a0f0cb7; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt 0 ibwarn: [27322] _do_madrpc: recv failed: Connection timed out ibwarn: [27322] mad_rpc: _do_madrpc failed; dport (Lid 1) ./smpquery: iberror: failed: operation portinfo: port info query failed madeye for these packets: Aug 25 17:28:03 woprjr0 Madeye:recv SMP Aug 25 17:28:03 woprjr0 MAD version....0x1 Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) Aug 25 17:28:03 woprjr0 Class version..0x1 Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response) Aug 25 17:28:03 woprjr0 Status.........0x8000 Aug 25 17:28:03 woprjr0 Hop pointer....0x1 Aug 25 17:28:03 woprjr0 Hop counter....0x0 Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5 Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info) Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 Aug 25 17:28:03 woprjr0 Mkey...........0x0 Aug 25 17:28:03 woprjr0 DR SLID........0xffff Aug 25 17:28:03 woprjr0 DR DLID........0xffff Aug 25 17:28:03 woprjr0 Madeye:sent SMP Aug 25 17:28:03 woprjr0 MAD version....0x1 Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) Aug 25 17:28:03 woprjr0 Class version..0x1 Aug 25 17:28:03 woprjr0 Method.........0x1 (Get) Aug 25 17:28:03 woprjr0 Status.........0x00 Aug 25 17:28:03 woprjr0 Hop pointer....0x1 Aug 25 17:28:03 woprjr0 Hop counter....0x0 Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5 Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info) Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 Aug 25 17:28:03 woprjr0 Mkey...........0x0 Aug 25 17:28:03 woprjr0 DR SLID........0xffff Aug 25 17:28:03 woprjr0 DR DLID........0xffff Aug 25 17:28:03 woprjr0 Madeye:recv SMP Aug 25 17:28:03 woprjr0 MAD version....0x1 Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) Aug 25 17:28:03 woprjr0 Class version..0x1 Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response) Aug 25 17:28:03 woprjr0 Status.........0x8000 Aug 25 17:28:03 woprjr0 Hop pointer....0x1 Aug 25 17:28:03 woprjr0 Hop counter....0x0 Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6 Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info) Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 Aug 25 17:28:03 woprjr0 Mkey...........0x0 Aug 25 17:28:03 woprjr0 DR SLID........0xffff Aug 25 17:28:03 woprjr0 DR DLID........0xffff Aug 25 17:28:03 woprjr0 Madeye:sent SMP Aug 25 17:28:03 woprjr0 MAD version....0x1 Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) Aug 25 17:28:03 woprjr0 Class version..0x1 Aug 25 17:28:03 woprjr0 Method.........0x1 (Get) Aug 25 17:28:03 woprjr0 Status.........0x00 Aug 25 17:28:03 woprjr0 Hop pointer....0x1 Aug 25 17:28:03 woprjr0 Hop counter....0x0 Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6 Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info) Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 Aug 25 17:28:03 woprjr0 Mkey...........0x0 Aug 25 17:28:03 woprjr0 DR SLID........0xffff Aug 25 17:28:03 woprjr0 DR DLID........0xffff Aug 25 17:28:03 woprjr0 Madeye:sent SMP Aug 25 17:28:03 woprjr0 MAD version....0x1 Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) Aug 25 17:28:03 woprjr0 Class version..0x1 Aug 25 17:28:03 woprjr0 Method.........0x1 (Get) Aug 25 17:28:03 woprjr0 Status.........0x00 Aug 25 17:28:03 woprjr0 Hop pointer....0x0 Aug 25 17:28:03 woprjr0 Hop counter....0x0 Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb7 Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info) Aug 25 17:28:03 woprjr0 Attr modifier..0x0001 Aug 25 17:28:03 woprjr0 Mkey...........0x0 Aug 25 17:28:03 woprjr0 DR SLID........0x02 Aug 25 17:28:03 woprjr0 DR DLID........0xffff No response is shown for trid 0x1b9d2a0f0cb7... As an aside I see the hop pointer is set to 1 at a lower level since mad_encode does not do it. So I guess the proper case for C14-9 would be "3) If Hop Pointer is equal to Hop Count". (They are both 0.) > > If so, what's the initial path at this point (or more specifically index 1 > of the initial path) ? I think that needs to be port 0 (if a switch) but > this is a little weird as I would think it should be handed to the SMA which > is different cases in the spec. Yes I think I was wrong on the case. But still wouldn't the SMI detect that this is the end of the DRPath and simply hand it to the SMA. > > > > Then after processing > > > by the SMA and doing the required returning initialization > > the SMI should return the packet as specified in C14-13 > > item 3 on line 9 page 812. > > > I'm not sure it would use this case in the case of an empty DR pafh on > return. Actually I think it will use this. C14-9 item 3) states "the Hop Pointer shall be incremented by 1" Therefore when the response is handed back to the SMI the Hop pointer will be 1 and the hop count 0. And the SMI uses the DRSLID to send the packet back to the requester. > > Am I wrong? In the end it does not matter as I have to make the software > > work > > for all the hardware I have; so I will change the software. > > > IMO it does matter as to where the problem lies (SMI or otherwise) and how > the layers are comprised in the implementation. Agreed. I am mainly confused because I have 2 different implementations of this. My "old" switches seem to handle this case just fine. My "new" switches do not. So I am really wondering what is going on. Here is the above output for the same query which works with an "old" switch. 17:28:04 > ./smpquery -e -c portinfo 7 0 1 ... trid 1a4329de; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt 0 ... Aug 25 17:46:40 woprjr0 Madeye:sent SMP Aug 25 17:46:40 woprjr0 MAD version....0x1 Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP) Aug 25 17:46:40 woprjr0 Class version..0x1 Aug 25 17:46:40 woprjr0 Method.........0x1 (Get) Aug 25 17:46:40 woprjr0 Status.........0x00 Aug 25 17:46:40 woprjr0 Hop pointer....0x0 Aug 25 17:46:40 woprjr0 Hop counter....0x0 Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info) Aug 25 17:46:40 woprjr0 Attr modifier..0x0001 Aug 25 17:46:40 woprjr0 Mkey...........0x0 Aug 25 17:46:40 woprjr0 DR SLID........0x02 Aug 25 17:46:40 woprjr0 DR DLID........0xffff Aug 25 17:46:40 woprjr0 Madeye:recv SMP Aug 25 17:46:40 woprjr0 MAD version....0x1 Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP) Aug 25 17:46:40 woprjr0 Class version..0x1 Aug 25 17:46:40 woprjr0 Method.........0x81 (Get response) Aug 25 17:46:40 woprjr0 Status.........0x8000 Aug 25 17:46:40 woprjr0 Hop pointer....0x0 Aug 25 17:46:40 woprjr0 Hop counter....0x0 Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info) Aug 25 17:46:40 woprjr0 Attr modifier..0x0001 Aug 25 17:46:40 woprjr0 Mkey...........0x0 Aug 25 17:46:40 woprjr0 DR SLID........0x02 Aug 25 17:46:40 woprjr0 DR DLID........0xffff Hop Pointer and Count are both 0 and things work just fine... > > However, I wonder > > where exactly the spec falls on this, because I think it will influence > > where > > the fix resides. If the spec does not allow this then I think it is fine > > to > > have libibmad return an error since the user specified an invalid combined > > DR > > path. However, if this should be legal I think libibmad should work around > > the bad hardware out there. > > > Is it hardware or firmware that needs fixing ? I think it may depend on the > specific workaround for this as to whether it is acceptable as it might harm > something else or might violate the spec. I agree, however, if the switch hardware needs fixing I fear it is too late for the ones I have. Firmware might be upgradable although I have had issues with un-managed switches in the past. So where do we put the fix in software? Ira > -- Hal > > > Thoughts? > > Ira > > > > -- > > Ira Weiny > > Math Programmer/Computer Scientist > > Lawrence Livermore National Lab > > 925-423-8008 > > weiny2 at llnl.gov > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://*lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://*openib.org/mailman/listinfo/openib-general > > > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov From poknam at gmail.com Tue Aug 25 19:15:07 2009 From: poknam at gmail.com (PN) Date: Wed, 26 Aug 2009 10:15:07 +0800 Subject: [ofa-general] ofed 1.3.2 opensmd failover In-Reply-To: References: <20090825162517.7955C21C827@f28.poczta.interia.pl> Message-ID: <92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com> HI, I can think of a situation in which all servers have dual port IB cards and need failover of OpenSM to achieve HA. As I know, OpenSM can only bind to 1 port at a time, so do I need to start 2 OpenSM in server A and 2 OpenSM in server B? Will they use the same guid2lid file? Do I need to set something in the config file or they will automatically communcate each other? Do I need to run sldd.sh manually or it will automatically sync with other OpenSM? Thanks a lot. Regards, PN 2009/8/26 Hal Rosenstock > > > On 8/25/09, kovlensky at interia.pl wrote: >> >> Hi all, >> >> Quick question - is there a need to run anything except opensmd deamons to >> provide failover capability on ib network in ofed 1.3? > > > In terms of SM failover, modulo bugs fixed relative to this feature since > OFED 1.3 (there are a couple of things here which may affect your > environment if I recall correctly), you only need to run more than 1 SM for > this (one will become master, the other standby). > > I'm aware that when master manager dies standby one comes in and manages >> the network, but that does not necessary means that lids are preserved, >> especially for nodes joining in. I used to run sldd.sh for distributing lids >> list on ofed 1.2.5, but while this script seems to be in place noone >> mentions necessity for it. > > > So subnet manager failover is provided by running standby opensm. > > > And how LID preservation is provided? > > > If you want LIDs to be preserved, the guid2lid file needs to be sync'd > (copied from the master SM once it's fully assembled to the node which is > running the standby SM). That's what the sldd.sh script does. > > -- Hal > > Regards, >> >> Zdenek Kovlensky >> >> ---------------------------------------------------------------------- >> Kup wlasne mieszkanie za 33 tys. zl! >> Sprawdz >>> http://link.interia.pl/f22f2 >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- Best Regards, PN Lai HPC Specialist Galactic Computng Corp. Tel: 86-755-26733939 ext 826 Mobile: 86-13823161729 Fax: 86-755-26733780 URL: http://www.galactic.com.hk -------------- next part -------------- An HTML attachment was scrubbed... URL: From Rafael.Tinoco at Sun.COM Tue Aug 25 19:33:34 2009 From: Rafael.Tinoco at Sun.COM (Rafael David Tinoco) Date: Tue, 25 Aug 2009 23:33:34 -0300 Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology. In-Reply-To: References: <4A92DFEC.3010300@Sun.COM> Message-ID: <4A949EFE.9000302@Sun.COM> Hello Hal, Bellow... Hal Rosenstock wrote: > > > On 8/24/09, *Rafael David Tinoco* > wrote: > > Hello, > > I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs > (2 asics each, 8 qnems). > They are configured in a MESH topology. > I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5. > > I'm booting PXE from IB, my initrd image is bringing the ib0 > interface, getting the squashfs image and mounting with aufs. > > The problem is.. When booting more then 60 nodes, I start to get > above errors on subnet manager. > And the problem seems to be intermitent, because each time it > gives errors on different path. > > Any ideas ? > > Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: > Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713838 [48D7D940] 0x02 -> > __osm_state_mgr_report_new_ports: Discovered new port with > GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1 > Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: > Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713842 [48D7D940] 0x02 -> > __osm_state_mgr_report_new_ports: Discovered new port with > GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1 > Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: > Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713847 [48D7D940] 0x02 -> > __osm_state_mgr_report_new_ports: Discovered new port with > GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1 > Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: > Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713866 [48D7D940] 0x02 -> > __osm_state_mgr_report_new_ports: Discovered new port with > GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1 > Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: > Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 > GID:fe80::5080:200:8d:9931 > Aug 24 15:36:19 713871 [48D7D940] 0x02 -> > __osm_state_mgr_report_new_ports: Discovered new port with > GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1 > Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP > Aug 24 15:36:19 714805 [48D7D940] 0x01 -> > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side > for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched > NEM I4A) port 19. Adding to light sweep sampling list > Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 > hop path: > Path = 0,1,15,15,15 > Aug 24 15:36:19 714822 [48D7D940] 0x01 -> > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side > for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched > NEM I4A) port 21. Adding to light sweep sampling list > Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 > hop path: > Path = 0,1,15,15,15 > Aug 24 15:36:19 714831 [48D7D940] 0x01 -> > __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side > for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched > NEM I4A) port 25. Adding to light sweep sampling list > Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 > hop path: > Path = 0,1,15,15,15 > Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x15 > trans_id=0x4700036595) -- dropping > Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: > DR SMP Hop Ptr: 0x0 > Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop > path: > Initial path = 0,0,0,0,0,0 > Return path = 0,0,0,0,0,0 > Aug 24 15:36:20 514333 [4977E940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x5 > trans_id................0x36595 > attr_id.................0x15 (PortInfo) > resv....................0x0 > attr_mod................0x0 > m_key...................0x0000000000000000 > dr_slid.................65535 > dr_dlid.................65535 > > Initial path: 0,1,15,15,15,19 > Return path: 0,0,0,0,0,0 > Reserved: [0][0][0][0][0][0][0] > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x15 > trans_id=0x4700036596) -- dropping > Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: > DR SMP Hop Ptr: 0x0 > Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop > path: > Initial path = 0,0,0,0,0,0 > Return path = 0,0,0,0,0,0 > Aug 24 15:36:20 514375 [4977E940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x5 > trans_id................0x36596 > attr_id.................0x15 (PortInfo) > resv....................0x0 > .... > > > These errors are transient as you indicate. They mean that some node > has brought the link physically up but there is no SMA at the remote > side of the link. The different paths are paths to the HCAs. This > occurs during PXE boot as the node transitions from the boot ROM to > the Linux environment. > They are transient.. but sometimes opensm hangs with the same message and loops this errors messages. First I was using centos 5.3 kernel with updates .. and the IPoIB stopped working after these messages. Using the "vanilla" centos 5.3 kernel solved this issue. But SOMETIMES, booting the nodes, these messages appear and dont go away. > Other than these messages, do things seem to work in terms of the end > nodes ? They seem to work with vanilla kernel. Even with the messages, no problems reaching the nodes so far. Tks Rafael Tinoco > > -- Hal > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Wed Aug 26 00:04:57 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 26 Aug 2009 10:04:57 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query to QoS level with pkey In-Reply-To: <4A910609.3040305@dev.mellanox.co.il> References: <4A8D4A6F.9050404@dev.mellanox.co.il> <4A90DC04.3020906@voltaire.com> <4A910609.3040305@dev.mellanox.co.il> Message-ID: <4A94DE99.5050308@voltaire.com> Yevgeny Kliteynik wrote: > False negatives. PR queries with PKeys (e.g. IPoIB interfaces) weren't > matched to their rules. Yevgeny, Our understanding is that the bug comes into play only for queries done on a partial membership pkey, do you agree? Or. From sashak at voltaire.com Tue Aug 25 12:01:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 25 Aug 2009 22:01:41 +0300 Subject: [ofa-general] Re: [PATCHv3] opensm: Parallelize (Stripe) LFT sets across switches In-Reply-To: <20090807110811.GA23431@comcast.net> References: <20090807110811.GA23431@comcast.net> Message-ID: <20090825190141.GG28379@me> Hi Hal, On 07:08 Fri 07 Aug , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock I'm applying this patch as it is, but have couple of comments below. Actually I even prepared patches over those comment and will push it to the list soon, please review. > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 78a7031..e28752a 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c [snip...] > @@ -516,6 +471,101 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, > OSM_LOG_EXIT(p_mgr->p_log); > } > > +static void ucast_mgr_process_top(IN cl_map_item_t * p_map_item, > + IN void *context) > +{ > + osm_ucast_mgr_t *p_mgr = context; > + osm_switch_t *const p_sw = (osm_switch_t *) p_map_item; > + > + set_fwd_tbl_top(p_mgr, p_sw); > +} > + > +static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm, > + IN uint8_t * p_block, > + IN osm_dr_path_t * p_path, > + IN uint16_t block_id_ho, > + IN osm_madw_context_t * p_context) > +{ > + ib_api_status_t status; > + boolean_t sts; > + > + OSM_LOG_ENTER(p_sm->p_log); > + > + for (; > + (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block)); > + block_id_ho++) { > + if (!p_sw->need_update && !p_sm->p_subn->need_update && > + !memcmp(p_block, > + p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, > + IB_SMP_DATA_SIZE)) > + continue; This function is called in loop with block number incremented. Inside it loops by itself in looking for changed block, caller will repeat this looping again and again. It would be really nice to avoid such useless action. I prepared the patch, please review. > @@ -940,6 +1025,9 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm) > > osm->routing_engine_used = osm_routing_engine_type(r->name); > > + if (r->ucast_build_fwd_tables) > + osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); > + Any reason to not simplify (and unify) fwd table decision flow over routing engines with and without ucast_build_fwd_tables method? The patch to follow. Sasha From kliteyn at dev.mellanox.co.il Wed Aug 26 01:00:53 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 26 Aug 2009 11:00:53 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query to QoS level with pkey In-Reply-To: <4A94DE99.5050308@voltaire.com> References: <4A8D4A6F.9050404@dev.mellanox.co.il> <4A90DC04.3020906@voltaire.com> <4A910609.3040305@dev.mellanox.co.il> <4A94DE99.5050308@voltaire.com> Message-ID: <4A94EBB5.7050107@dev.mellanox.co.il> Or Gerlitz wrote: > Yevgeny Kliteynik wrote: >> False negatives. PR queries with PKeys (e.g. IPoIB interfaces) weren't >> matched to their rules. > Yevgeny, > > Our understanding is that the bug comes into play only for queries done > on a partial membership pkey, do you agree? Nope, just the other way around. When some pkey is defined in QoS policy, it is stored internally w/o the MSB. When query comes with a full member pkey (such as 0xFFFF for IPoIB), this pkey is not matched to the stored QoS policy rule. The fix was to treat any pkey that comes from request as partial membership pkey. Note that this is done for the QoS policy rules matching only. The two sides of this PR query still have to comply to the usual IB spec pkey rules. -- Yevgeny > Or. > > From ogerlitz at voltaire.com Wed Aug 26 02:07:51 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 26 Aug 2009 12:07:51 +0300 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format In-Reply-To: <20090821000431.GA5713@obsidianresearch.com> References: <20090821000431.GA5713@obsidianresearch.com> Message-ID: <4A94FB67.6050600@voltaire.com> Jason Gunthorpe wrote: > Check that the format of the multicast link address is correct before taking it from dev->mc_list to priv->multicast_list. This way we never try to send a bogus address to the SA, and prevents badness from erronous 'ip maddr addr add', broken bonding drivers, or whatever. Jason, This is great (and simple!) idea, lets go for it. > Same problem Moni was working on, but lets just address it directly. There is work to try and fix the bonding driver but no fixed version is in mainline yet. This is a cheap and simple work around that is worth having even once the driver is fixed. Moni, isn't Jason's approach enough for the bonding case?! I saw that your patch ("bonding: clean muticast addresses when device changes type" commit e36b9d16c6a6d0f59803b3ef04ff3c22c3844c10) is present in net-next and maybe also in mainline .31-rcX . However, it has the down-side-effect of e.g loosing routes already set for the the bond while adding the underline IPoIB devices, so if Jason's patch is enough we can just ask to revert the bonding fix saying we have something better. Or. From o.w.saastad at usit.uio.no Wed Aug 26 02:09:21 2009 From: o.w.saastad at usit.uio.no (Ole Widar Saastad) Date: Wed, 26 Aug 2009 11:09:21 +0200 Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards Message-ID: <1251277761.28564.45.camel@pyren.uio.no> I am experiencing problems using the Infinipath cards and the OFED stack. (details are given below). It seems to be a problem somewhere when mpi packet size grows above 2k. This is what I recall the changeover from one transport mechanism to another ? The test is easy to run and to test, it is just a bandwidth program : (I got far better latency using the Pathscale stack that the OFED. Is this something that will be looked up in the newer releases?). Two nodes in node.txt file compute-1-0 and compute-1-1. They are connected to a SilverStorm switch. [olews at login-0-2 bandwidth]$ mpirun -np 2 -machinefile ./nodes.txt ./bandwidth.openmpi.x -b o Resolution (usec): 2.145767 Benchmark ping-pong =================== lenght iterations elapsed time transfer rate latency (bytes) (count) (seconds) (Mbytes/s) (usec) -------------------------------------------------------------------------- 0 10046 0.121 0.000 6.011 1 10261 0.124 0.166 6.026 1024 7695 0.140 112.615 9.093 1536 6260 0.133 144.469 10.632 2048 5275 0.128 168.420 12.160 [0,1,0][btl_openib_component.c:1375:btl_openib_component_progress] from compute-1-0 to: compute-1-1 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278309104 opcode 1 -------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is somethin/site/VERSIONS/openmpi-1.2.8.gnu/bin/g awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. -------------------------------------------------------------------------- mpirun noticed that job rank 1 with PID 9184 on node compute-1-1 exited on signal 15 (Terminated). [olews at login-0-2 bandwidth]$ Background information : 07:00.0 InfiniBand: QLogic, Corp. InfiniPath PE-800 (rev 02) Subsystem: QLogic, Corp. InfiniPath PE-800 Flags: bus master, fast devsel, latency 0, IRQ 66 Memory at fde00000 (64-bit, non-prefetchable) [size=2M] Capabilities: [40] Power Management version 2 Capabilities: [50] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+ Capabilities: [70] Express Endpoint IRQ 0 compute-1-0.local# uname -a Linux compute-1-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux compute-1-0.local# compute-1-0.local# rpm -qa| grep ofed libibverbs-utils-1.1.2-1.ofed1.4.2 librdmacm-utils-1.0.8-1.ofed1.4.2 libcxgb3-1.2.2-1.ofed1.4.2 ofed-scripts-1.4.2-0 libmlx4-1.0-1.ofed1.4.2 libibverbs-devel-1.1.2-1.ofed1.4.2 ofed-docs-1.4.2-0 ibvexdmtools-0.0.1-1.ofed1.4.2 libmthca-1.0.5-1.ofed1.4.2 libipathverbs-1.1-1.ofed1.4.2 mstflint-1.4-1.ofed1.4.2 libibumad-1.2.3_20090314-1.ofed1.4.2 libnes-0.6-1.ofed1.4.2 libibcommon-1.1.2_20090314-1.ofed1.4.2 libibverbs-1.1.2-1.ofed1.4.2 librdmacm-1.0.8-1.ofed1.4.2 qlgc_vnic_daemon-0.0.1-1.ofed1.4.2 compute-1-0.local# OpenMPI is : openmpi-1.2.8 compiled for gcc. -- Ole W. Saastad, dr. scient. Scientific Computing Group, USIT, University of Oslo http://hpc.uio.no From kliteyn at dev.mellanox.co.il Wed Aug 26 02:25:45 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 26 Aug 2009 12:25:45 +0300 Subject: [ofa-general] [TRIVIAL PATCH] ibutils: fix regexp for pkey matching In-Reply-To: <20090825211250.GI16590@sgi.com> References: <20090825211250.GI16590@sgi.com> Message-ID: <4A94FF99.5030501@dev.mellanox.co.il> akepner at sgi.com wrote: > There's an error in a regular expression for matching pkeys > in ibdebug.tcl. The following fixes it. Thanks. Applied. -- Yevgeny > Signed-off-by: Arthur Kepner > --- From sneha0930 at gmail.com Wed Aug 26 02:34:31 2009 From: sneha0930 at gmail.com (Sneha Mistry) Date: Wed, 26 Aug 2009 15:04:31 +0530 Subject: [ofa-general] OFED-1.5-alpha4 installation problem Message-ID: Hi, I am new be to Infiniband and trying to install OFED-1.5-alpha4 on opensuse 10.3 . Kernel version is 2.6.26-2-686 . But it gives me error message. Failed to build ofa_kernel RPM See /tmp/OFED.29482.logs/ofa_kernel.rpmbuild.log Regards, sgm From keshetti.mahesh at gmail.com Wed Aug 26 02:43:30 2009 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Wed, 26 Aug 2009 15:13:30 +0530 Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards Message-ID: <829ded920908260243g6a9e5217h4886cb7ec460fc35@mail.gmail.com> There was a similar thread "Retry count error with ipath on OFED-1.3" dated 27 May 2008. And it turned out to be some hardware problem with Infinipath cards. - Mahesh From vlad at lists.openfabrics.org Wed Aug 26 03:09:28 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 26 Aug 2009 03:09:28 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090826-0200 daily build status Message-ID: <20090826100928.40648E28249@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090826-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sneha0930 at gmail.com Wed Aug 26 03:17:51 2009 From: sneha0930 at gmail.com (Sneha Mistry) Date: Wed, 26 Aug 2009 15:47:51 +0530 Subject: [ofa-general] Fwd: OFED-1.5-alpha4 installation problem In-Reply-To: References: Message-ID: Hi, I am new be to Infiniband and trying to install OFED-1.5-alpha4 on opensuse 10.3 . Kernel version is  2.6.26-2-686 . But it gives me error  message. Failed to build ofa_kernel RPM See /tmp/OFED.29482.logs/ofa_kernel.rpmbuild.log I checked release note it says suse 10.3 is supported. Output of uname -a is Linux linux-ljhr 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC i686 i686 i386 GNU/Linux Last few line of log is as given. make[1]: Entering directory `/usr/src/linux-2.6.22.5-31-obj/i386/default' make -C ../../../linux-2.6.22.5-31 O=../linux-2.6.22.5-31-obj/i386/default modules make -C /usr/src/linux-2.6.22.5-31-obj/i386/default \ KBUILD_SRC=/usr/src/linux-2.6.22.5-31 \ KBUILD_EXTMOD="/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5" -f /usr/src/linux-2.6.22.5-31/Makefile modules test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( \ echo; \ echo " ERROR: Kernel configuration is invalid."; \ echo " include/linux/autoconf.h or include/config/auto.conf are missing."; \ echo " Run 'make oldconfig && make prepare' on kernel src to fix it."; \ echo; \ /bin/false) mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions rm -f /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions/* make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5 make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core gcc -m32 -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.addr.o.d -nostdinc -isystem /usr/lib/gcc/i586-suse-linux/4.2.1/include -D__KERNEL__ \ -D__OFED_BUILD__ \ -include include/linux/autoconf.h \ -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.22_suse10_3/include/ \ \ \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \ -I/usr/local/include/scst \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \ -Iinclude \ -Iinclude2 -I/usr/src/linux-2.6.22.5-31/include \ -I/usr/src/linux-2.6.22.5-31/arch//include \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -Os -pipe -msoft-float -mregparm=3 -freg-struct-return -mpreferred-stack-boundary=2 -march=i586 -mtune=generic -ffreestanding -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-generic -Iinclude/asm-i386/mach-generic -I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-default -Iinclude/asm-i386/mach-default -fomit-frame-pointer -g -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)" -D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_addr.h:41, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:46: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In function ‘ib_dma_mapping_error’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677: warning: passing argument 1 of ‘dma_mapping_error’ makes integer from pointer without a cast /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677: error: too many arguments to function ‘dma_mapping_error’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716: warning: ‘struct dma_attrs’ declared inside parameter list /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716: warning: its scope is only this definition or declaration, which is probably not what you want /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In function ‘ib_dma_map_single_attrs’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1718: error: implicit declaration of function ‘dma_map_single_attrs’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1725: warning: ‘struct dma_attrs’ declared inside parameter list /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In function ‘ib_dma_unmap_single_attrs’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1727: error: implicit declaration of function ‘dma_unmap_single_attrs’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1728: warning: ‘return’ with a value, in function returning void /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1803: warning: ‘struct dma_attrs’ declared inside parameter list /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In function ‘ib_dma_map_sg_attrs’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1805: error: implicit declaration of function ‘dma_map_sg_attrs’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1811: warning: ‘struct dma_attrs’ declared inside parameter list /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In function ‘ib_dma_unmap_sg_attrs’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1813: error: implicit declaration of function ‘dma_unmap_sg_attrs’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: In function ‘rdma_translate_ip’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122: error: ‘init_net’ undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122: error: (Each undeclared identifier is reported only once /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122: error: for each function it appears in.) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:123: error: too many arguments to function ‘ip_dev_find’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:33: error: macro "for_each_netdev" passed 2 arguments, but takes just 1 /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134: error: ‘for_each_netdev’ undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134: error: expected ‘;’ before ‘{’ token /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: In function ‘addr_send_arp’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191: error: ‘init_net’ undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191: warning: passing argument 2 of ‘ip_route_output_key’ from incompatible pointer type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191: error: too many arguments to function ‘ip_route_output_key’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:206: error: too many arguments to function ‘ip6_route_output’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: In function ‘addr4_resolve_remote’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232: error: ‘init_net’ undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232: warning: passing argument 2 of ‘ip_route_output_key’ from incompatible pointer type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232: error: too many arguments to function ‘ip_route_output_key’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: In function ‘addr6_resolve_remote’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281: error: ‘init_net’ undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281: error: too many arguments to function ‘ip6_route_output’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: In function ‘addr_resolve_local’: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368: error: ‘init_net’ undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368: error: too many arguments to function ‘ip_dev_find’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:372: error: implicit declaration of function ‘ipv4_is_zeronet’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:376: error: implicit declaration of function ‘ipv4_is_loopback’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394:33: error: macro "for_each_netdev" passed 2 arguments, but takes just 1 /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394: error: ‘for_each_netdev’ undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:395: error: expected ‘;’ before ‘if’ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:410: error: implicit declaration of function ‘ipv6_addr_loopback’ make[6]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o] Error 1 make[5]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core] Error 2 make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband] Error 2 make[3]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2 make[2]: *** [modules] Error 2 make[1]: *** [modules] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.22.5-31-obj/i386/default' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.64786 (%build) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.64786 (%build) Regards, sgm From fenkes at de.ibm.com Wed Aug 26 04:37:55 2009 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 26 Aug 2009 13:37:55 +0200 Subject: [ofa-general] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD Message-ID: <200908261337.56128.fenkes@de.ibm.com> The old code used a lot of hardcoded values, which might not be valid in all environments (especially routed fabrics or partitioned subnets). Copy as much information as possible from the incoming request to prevent that. Signed-off-by: Joachim Fenkes --- Hal, Jason -- here's the change I promised. Looks okay to you? Roland -- if Hal and Jason don't object, please queue this up for the next kernel. Thanks! Regards, Joachim drivers/infiniband/hw/ehca/ehca_sqp.c | 47 ++++++++++++++++++++++++++++---- 1 files changed, 41 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c index c568b28..8c1213f 100644 --- a/drivers/infiniband/hw/ehca/ehca_sqp.c +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c @@ -125,14 +125,30 @@ struct ib_perf { u8 data[192]; } __attribute__ ((packed)); +/* TC/SL/FL packed into 32 bits, as in ClassPortInfo */ +struct tcslfl { + u32 tc:8; + u32 sl:4; + u32 fl:20; +} __attribute__ ((packed)); + +/* IP Version/TC/FL packed into 32 bits, as in GRH */ +struct vertcfl { + u32 ver:4; + u32 tc:8; + u32 fl:20; +} __attribute__ ((packed)); static int ehca_process_perf(struct ib_device *ibdev, u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad) { struct ib_perf *in_perf = (struct ib_perf *)in_mad; struct ib_perf *out_perf = (struct ib_perf *)out_mad; struct ib_class_port_info *poi = (struct ib_class_port_info *)out_perf->data; + struct tcslfl *tcslfl = + (struct tcslfl *)&poi->redirect_tcslfl; struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); struct ehca_sport *sport = &shca->sport[port_num - 1]; @@ -158,10 +174,29 @@ static int ehca_process_perf(struct ib_device *ibdev, u8 port_num, poi->base_version = 1; poi->class_version = 1; poi->resp_time_value = 18; - poi->redirect_lid = sport->saved_attr.lid; - poi->redirect_qp = sport->pma_qp_nr; + + /* copy local routing information from WC where applicable */ + tcslfl->sl = in_wc->sl; + poi->redirect_lid = + sport->saved_attr.lid | in_wc->dlid_path_bits; + poi->redirect_qp = sport->pma_qp_nr; poi->redirect_qkey = IB_QP1_QKEY; - poi->redirect_pkey = IB_DEFAULT_PKEY_FULL; + + ehca_query_pkey(ibdev, port_num, in_wc->pkey_index, + &poi->redirect_pkey); + + /* if request was globally routed, copy route info */ + if (in_grh) { + struct vertcfl *vertcfl = + (struct vertcfl *)&in_grh->version_tclass_flow; + memcpy(poi->redirect_gid, in_grh->dgid.raw, + sizeof(poi->redirect_gid)); + tcslfl->tc = vertcfl->tc; + tcslfl->fl = vertcfl->fl; + } else + /* else only fill in default GID */ + ehca_query_gid(ibdev, port_num, 0, + (union ib_gid *)&poi->redirect_gid); ehca_dbg(ibdev, "ehca_pma_lid=%x ehca_pma_qp=%x", sport->saved_attr.lid, sport->pma_qp_nr); @@ -183,8 +218,7 @@ perf_reply: int ehca_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, struct ib_wc *in_wc, struct ib_grh *in_grh, - struct ib_mad *in_mad, - struct ib_mad *out_mad) + struct ib_mad *in_mad, struct ib_mad *out_mad) { int ret; @@ -196,7 +230,8 @@ int ehca_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, return IB_MAD_RESULT_SUCCESS; ehca_dbg(ibdev, "port_num=%x src_qp=%x", port_num, in_wc->src_qp); - ret = ehca_process_perf(ibdev, port_num, in_mad, out_mad); + ret = ehca_process_perf(ibdev, port_num, in_wc, in_grh, + in_mad, out_mad); return ret; } -- 1.6.0.4 From jean-vincent.ficet at bull.net Wed Aug 26 05:03:04 2009 From: jean-vincent.ficet at bull.net (Vincent Ficet) Date: Wed, 26 Aug 2009 14:03:04 +0200 Subject: [ofa-general] [PATCH] Duplicated file man/umad_get_mad.3 in libibumad/Makefile.am Message-ID: <4A952478.7060407@bull.net> Hello, the file man/umad_get_mad.3 was listed twice in libibumad/Makefile.am resulting in the following error: /usr/bin/install: will not overwrite just-created `/home/vficet/work/infiniband/I686/usr/share/man/man3/umad_get_mad.3' with `man/umad_get_mad.3' This patch removes the duplicated entry. Cheers, Vincent Signed-off-by: Jean-Vincent Ficet --- libibumad/Makefile.am | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Duplicated-file-man-umad_get_mad.3-in-libibumad-Make.patch Type: text/x-patch Size: 622 bytes Desc: not available URL: From hnrose at comcast.net Wed Aug 26 07:02:02 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 26 Aug 2009 10:02:02 -0400 Subject: [ofa-general] [PATCH] libibmad: Add support for MulticastFDBTop Message-ID: <20090826140202.GA19158@comcast.net> Add support for SwitchInfo:MulticastFDBTop and PortInfo:CapabilityMask.IsMulticastFDBTopSupported Added by MgtWG errata #4505-4508 Signed-off-by: Hal Rosenstock --- diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 3093fbd..5f3b52b 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -400,6 +401,7 @@ enum MAD_FIELDS { IB_SW_FILTER_RAW_INB_F, IB_SW_FILTER_RAW_OUTB_F, IB_SW_ENHANCED_PORT0_F, + IB_SW_MCAST_FDB_TOP_F, IB_SW_LAST_F, /* diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c index 051c708..d97d359 100644 --- a/libibmad/src/dump.c +++ b/libibmad/src/dump.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -518,6 +519,8 @@ void mad_dump_portcapmask(char *buf, int bufsz, void *val, int valsz) if (mask & (1 << 27)) s += sprintf(s, "\t\t\t\tIsLinkSpeedWidthPairsTableSupported\n"); + if (mask & (1 << 30)) + s += sprintf(s, "\t\t\t\tIsMulticastFDBTopSupported\n"); if (s != buf) *(--s) = 0; diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index c8e4e79..5f30116 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -206,6 +207,7 @@ static const ib_field_t ib_mad_f[] = { {BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint}, {BITSOFFS(131, 1), "FilterRawOutbound", mad_dump_uint}, {BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint}, + {BITSOFFS(144, 16), "MulticastFDBTop", mad_dump_hex}, {0, 0}, /* IB_SW_LAST_F */ /* From hnrose at comcast.net Wed Aug 26 07:04:50 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 26 Aug 2009 10:04:50 -0400 Subject: [ofa-general] [PATCH] opensm: Add infrastructure support for MulticastFDBTop Message-ID: <20090826140450.GC19158@comcast.net> Add support for SwitchInfo:MulticastFDBTop Added by MgtWG errata #4505-4508 Add OpenSM infrastructure support to ib_types.h and osm_helper.c Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index fe3f051..e1e2bdb 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. * @@ -4492,7 +4492,7 @@ typedef struct _ib_port_info { #define IB_PORT_CAP_HAS_LINK_SPEED_WIDTH_PAIRS_TBL (CL_HTON32(0x08000000)) #define IB_PORT_CAP_RESV28 (CL_HTON32(0x10000000)) #define IB_PORT_CAP_RESV29 (CL_HTON32(0x20000000)) -#define IB_PORT_CAP_RESV30 (CL_HTON32(0x40000000)) +#define IB_PORT_CAP_HAS_MCAST_FDB_TOP (CL_HTON32(0x40000000)) #define IB_PORT_CAP_RESV31 (CL_HTON32(0x80000000)) /****f* IBA Base: Types/ib_port_info_get_port_state @@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info { ib_net16_t lids_per_port; ib_net16_t enforce_cap; uint8_t flags; + uint8_t resvd; + ib_net16_t mcast_top; } PACK_SUFFIX ib_switch_info_t; #include /************/ @@ -5908,7 +5910,7 @@ typedef struct _ib_switch_info_record { ib_net16_t lid; uint16_t resv0; ib_switch_info_t switch_info; - uint8_t pad[3]; + uint8_t pad[1]; } PACK_SUFFIX ib_switch_info_record_t; #include diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 23392a4..b8a6523 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. @@ -766,9 +766,9 @@ static void dbg_get_capabilities_str(IN char *p_buf, IN const uint32_t buf_size, &total_len) != IB_SUCCESS) return; } - if (p_pi->capability_mask & IB_PORT_CAP_RESV30) { + if (p_pi->capability_mask & IB_PORT_CAP_HAS_MCAST_FDB_TOP) { if (dbg_do_line(&p_local, buf_size, p_prefix_str, - "IB_PORT_CAP_RESV30\n", + "IB_PORT_CAP_HAS_MCAST_FDB_TOP\n", &total_len) != IB_SUCCESS) return; } @@ -1514,7 +1514,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log, "\t\t\t\tlife_state..............0x%X\n" "\t\t\t\tlids_per_port...........%u\n" "\t\t\t\tpartition_enf_cap.......0x%X\n" - "\t\t\t\tflags...................0x%X\n", + "\t\t\t\tflags...................0x%X\n" + "\t\t\t\tmcast_top...............0x%X\n", cl_ntoh16(p_si->lin_cap), cl_ntoh16(p_si->rand_cap), cl_ntoh16(p_si->mcast_cap), @@ -1524,7 +1525,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log, p_si->def_mcast_not_port, p_si->life_state, cl_ntoh16(p_si->lids_per_port), - cl_ntoh16(p_si->enforce_cap), p_si->flags); + cl_ntoh16(p_si->enforce_cap), p_si->flags, + cl_ntoh16(p_si->mcast_top)); } } From hnrose at comcast.net Wed Aug 26 07:03:50 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 26 Aug 2009 10:03:50 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/ibroute: Add support for MulticastFDBTop Message-ID: <20090826140350.GB19158@comcast.net> Add support for SwitchInfo:MulticastFDBTop Added by MgtWG errata #4505-4508 and #4640 If MulticastFDBTop is set to other than 0, only fetch MulticastForwardingTable blocks up through MulticastFDBTop rather than MulticastFDBCap If MulticastFDBTop is set to 0xbfff, this means no entries (per #4640) Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 106c934..f3ebe56 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, char *s; uint64_t nodeguid; uint32_t mod; - unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock; + unsigned block, i, j, e, nports, cap, top, chunks, + startblock, lastblock; int n = 0; if ((s = check_switch(portid, &nports, &nodeguid, sw, nd))) return s; mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap); + mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top); if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1) endlid = IB_MIN_MCAST_LID + cap - 1; + if (!dump_all && top && top < endlid) { + if (top < IB_MIN_MCAST_LID - 1 || top == 0xffff) + IBWARN("illegal top mlid %x", top); + else + endlid = top; + } if (!startlid) startlid = IB_MIN_MCAST_LID; @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, printf(" MLid\n"); } if (ibverbose) - printf("Switch multicast mlid capability is %d\n", cap); + printf("Switch multicast mlid capability is %d top is %d\n", + cap, top); chunks = ALIGN(nports + 1, 16) / 16; From dorfman.eli at gmail.com Wed Aug 26 07:37:30 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Wed, 26 Aug 2009 17:37:30 +0300 Subject: [ofa-general] [PATCH] infiniband-diags: Fix IB network discovery from switch node. Message-ID: <4A9548AA.4020900@gmail.com> Subject: [PATCH] Fix IB network discovery from switch node. Signed-off-by: Eli Dorfman --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 16 +++++++++------- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index c69467e..779e659 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -590,13 +590,15 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, if (!port) goto error; - rc = get_remote_node(ibmad_port, fabric, node, port, from, - mad_get_field(node->info, 0, - IB_NODE_LOCAL_PORT_F), 0); - if (rc < 0) - goto error; - if (rc > 0) /* non-fatal error, nothing more to be done */ - return ((ibnd_fabric_t *) fabric); + if (node->node.type != IB_NODE_SWITCH) { + rc = get_remote_node(ibmad_port, fabric, node, port, from, + mad_get_field(node->info, 0, + IB_NODE_LOCAL_PORT_F), 0); + if (rc < 0) + goto error; + if (rc > 0) /* non-fatal error, nothing more to be done */ + return ((ibnd_fabric_t *) fabric); + } for (dist = 0; dist <= max_hops; dist++) { -- 1.5.5 From hal.rosenstock at gmail.com Wed Aug 26 07:55:41 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 26 Aug 2009 10:55:41 -0400 Subject: [ofa-general] Combined DR path with empty DR path, what is the expected behavior? In-Reply-To: <20090825175543.4f929646.weiny2@llnl.gov> References: <20090824185206.39e5e377.weiny2@llnl.gov> <20090825175543.4f929646.weiny2@llnl.gov> Message-ID: On 8/25/09, Ira Weiny wrote: > > On Tue, 25 Aug 2009 19:15:19 -0400 > Hal Rosenstock wrote: > > > On 8/24/09, Ira Weiny wrote: > > > > > If I send a combined DR path with a start lid but an empty (0 length) > DR > > > path. > > > > > > Hop Count 0 ? > > Yes > > > > > > > > What is the expected behavior? > > > > > > Not sure what you mean by expected here. Are you referring to expectation > > based on the spec ? > > > > yes > > > > > > I know this could be specified with LID routing, but I don't see > anywhere > > > in > > > the specification which says this is an error. > > > > > > I don't think it should be an error (certainly not for the form you are > > using LID routed part followed by a DR part) but a null DR part is a > little > > funny/odd. > > Yea I know. It turns out that the new iblinkinfo issues queries like this > when it is removes recurses back from the last DR portion of the combined > route path. It only showed up as an error when using the -S option > of > iblinkinfo with this new switch I have. Works fine with the old switches. > > > > > > I do however seem to have 2 > > > different implementations on 2 different switches. For example: > > > > > > I have Switch A (Lid 1) and Switch B (Lid 7). I attempt to query > PortInfo > > > of > > > Port 1 of each switch using the LID followed by an empty DR path. > > > > > > 17:55:22 > ./smpquery -c portinfo 1 0 1 > > > ibwarn: [21005] mad_rpc: _do_madrpc failed; dport (Lid 1) > > > ./smpquery: iberror: failed: operation portinfo: port info query failed > > > > > > Is this a timeout ? > > yes > > 16:26:25 > ./smpquery -e -c portinfo 1 0 1 > ibwarn: [27150] _do_madrpc: retry 1 (timeout 1000 ms) > ibwarn: [27150] _do_madrpc: retry 2 (timeout 1000 ms) > ibwarn: [27150] _do_madrpc: timeout after 3 retries, 3000 ms > ibwarn: [27150] mad_rpc: _do_madrpc failed; dport (Lid 1) > ./smpquery: iberror: failed: operation portinfo: port info query failed > > > > > > > > > 17:55:31 > ./smpquery -c portinfo 7 0 1 > > > # Port info: Lid 7 port 1 > > > Mkey:............................0x0000000000000000 > > > GidPrefix:.......................0x0000000000000000 > > > ... > > > > > > > > > Detecting this special case in libibmad and turning the packet into a > LID > > > routed one > > > > > > Ugh... Is this special case really needed ? I don't think the underlying > > issue is understood sufficiently yet. > > Well I just did it to prove that what I was doing would work with a > "simple" > lid routed packet. Like I said it might be that this portid which is being > specified to libibmad by libibnetdisc is not valid. If that is true then > libibnetdisc should detect when the DR path is empty and go back to LID > routed > requests. That is a valid fix in my mind. Sure; there's no real need for combined route when the DR path is empty but it should work (at least with switches). > > > > succeeds but I wonder if this is an error in the SMI? > > > > > > Switch SMI ? Is this a proprietary implementation ? > > > > Yes I see the bug with 2 different vendors switches. One is managed and > the > other is not. My "old" switches (3 different vendors) do not show this > behavior. (Just to be clear I now I have 5 switches in my 5 node cluster! > ;-) > > > > > > > > I also notice this is an error on the HCA I am running from (lid 2). > > > > > > Is this HCA node OpenIB based ? > > yes If I recall correctly, there is something in the spec that makes combined routing not be allowed on HCA (and router) ports so this seems correct. I can dig this out if really needed. > > > 17:57:42 > ./smpquery -c portinfo 2 0 1 > > > ibwarn: [21008] mad_rpc: _do_madrpc failed; dport (Lid 2) > > > ./smpquery: iberror: failed: operation portinfo: port info query failed > > > > > > Is this also a timeout ? > > yes > > > > > Also, does the result differ based on where you source these from matter > > (locally v. remotely)? > > Same result local and remote. > > > > > > > > > > Running with a simple DR path works, > > > > > > You're referring to the same DR path here that fails in the combined > route > > examples above, right ? > > > > No. the example below is a DR path with Hop Count == 0 but without the > initial > LID routing. > > > > > > I guess because this is the loopback case mentioned on page 805. > > > > > > Yes but that's the high level requirement rather than the SMI rules which > > make that work. > > > > > > > > > 17:58:16 > ./smpquery -D portinfo 0 1 > > > # Port info: DR path slid 65535; dlid 65535; 0 port 1 > > > Mkey:............................0x0000000000000000 > > > GidPrefix:.......................0x2007000000000000 > > > ... > > > > > > > > > It guess that the comment "Since each part may be empty, there are > eight > > > combinations, although only four are really useful:" on line 36 Page > 805 > > > can > > > be interpreted to mean that only those 4 combinations need to be > supported. > > > Is this true? > > > > > > Not all 4 combinations are supported/known to work. When this was added > for > > ibportstate, the only combined routing form that was important was LID > > routed part followed by a DR part. > > > > When you say "known to work" you mean implemented with the diags? Or known > to > work in all hardware? The former with most hardware up to some time ago. Note there is no compliance testing of combined routing and heavy reliance on this makes some a little nervous. > > > > On the other hand I think strictly this should be supported. > > > > > > In an ideal world yes but are they all required or is it just the one > form > > most heavily used ? > > That is what I am unclear on. Does the spec require that all 8 > combinations > are required to work? I don't see a specific compliance which says that > and I > am not sure if C14-9 and C14-13 cover all 8 combinations. I don't think there's any compliance on this. It all appears to be informative text. Perhaps a shortcoming of the spec. So there's nothing definitive. It just says there are 8 combinations (2**3 as there are 3 parts with 2 possibilities in each part) and that only 4 are really useful. > > > > Item 4 of C14-9 > > > (line 24 page 810) requires the SMI to handle the packet if the > HopPointer > > > equals HopCount +1, which it is in my case (HopCount == 0, HopPointer > == 1) > > > > > > By handle, this means "The SMI *shall *output the packet on the port > whose > > number is in the entry indexed by Hop Pointer in the Initial Path. If > that > > port number is invalid, the SMI *shall *discard the SMP." > > > > Are you sure the Hop Pointer is 1 ? Where do you see this ? > > No I was wrong. I think I read the wrong madeye packet as I see the packet > right before this one did have a hop pointer of 1. I Added some debug > prints > to mad_encode to get the following output: > > 17:26:10 > ./smpquery -e -c portinfo 1 0 1 > trid 2a0f0cb5; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0 > trid 2a0f0cb6; HopCount 0; HopPointer 0; slid 0; dlid 0; 0, drpath->cnt 0 > trid 2a0f0cb7; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt > 0 > ibwarn: [27322] _do_madrpc: recv failed: Connection timed out > ibwarn: [27322] mad_rpc: _do_madrpc failed; dport (Lid 1) > ./smpquery: iberror: failed: operation portinfo: port info query failed > > madeye for these packets: > > Aug 25 17:28:03 woprjr0 Madeye:recv SMP > Aug 25 17:28:03 woprjr0 MAD version....0x1 > Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) > Aug 25 17:28:03 woprjr0 Class version..0x1 > Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response) > Aug 25 17:28:03 woprjr0 Status.........0x8000 > Aug 25 17:28:03 woprjr0 Hop pointer....0x1 > Aug 25 17:28:03 woprjr0 Hop counter....0x0 > Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5 > Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info) > Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 > Aug 25 17:28:03 woprjr0 Mkey...........0x0 > Aug 25 17:28:03 woprjr0 DR SLID........0xffff > Aug 25 17:28:03 woprjr0 DR DLID........0xffff > Aug 25 17:28:03 woprjr0 Madeye:sent SMP > Aug 25 17:28:03 woprjr0 MAD version....0x1 > Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) > Aug 25 17:28:03 woprjr0 Class version..0x1 > Aug 25 17:28:03 woprjr0 Method.........0x1 (Get) > Aug 25 17:28:03 woprjr0 Status.........0x00 > Aug 25 17:28:03 woprjr0 Hop pointer....0x1 > Aug 25 17:28:03 woprjr0 Hop counter....0x0 > Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb5 > Aug 25 17:28:03 woprjr0 Attr ID........0x11 (node info) > Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 > Aug 25 17:28:03 woprjr0 Mkey...........0x0 > Aug 25 17:28:03 woprjr0 DR SLID........0xffff > Aug 25 17:28:03 woprjr0 DR DLID........0xffff > Aug 25 17:28:03 woprjr0 Madeye:recv SMP > Aug 25 17:28:03 woprjr0 MAD version....0x1 > Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) > Aug 25 17:28:03 woprjr0 Class version..0x1 > Aug 25 17:28:03 woprjr0 Method.........0x81 (Get response) > Aug 25 17:28:03 woprjr0 Status.........0x8000 > Aug 25 17:28:03 woprjr0 Hop pointer....0x1 > Aug 25 17:28:03 woprjr0 Hop counter....0x0 > Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6 > Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info) > Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 > Aug 25 17:28:03 woprjr0 Mkey...........0x0 > Aug 25 17:28:03 woprjr0 DR SLID........0xffff > Aug 25 17:28:03 woprjr0 DR DLID........0xffff > Aug 25 17:28:03 woprjr0 Madeye:sent SMP > Aug 25 17:28:03 woprjr0 MAD version....0x1 > Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) > Aug 25 17:28:03 woprjr0 Class version..0x1 > Aug 25 17:28:03 woprjr0 Method.........0x1 (Get) > Aug 25 17:28:03 woprjr0 Status.........0x00 > Aug 25 17:28:03 woprjr0 Hop pointer....0x1 > Aug 25 17:28:03 woprjr0 Hop counter....0x0 > Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb6 > Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info) > Aug 25 17:28:03 woprjr0 Attr modifier..0x0000 > Aug 25 17:28:03 woprjr0 Mkey...........0x0 > Aug 25 17:28:03 woprjr0 DR SLID........0xffff > Aug 25 17:28:03 woprjr0 DR DLID........0xffff > Aug 25 17:28:03 woprjr0 Madeye:sent SMP > Aug 25 17:28:03 woprjr0 MAD version....0x1 > Aug 25 17:28:03 woprjr0 Class..........0x81 (Directed route SMP) > Aug 25 17:28:03 woprjr0 Class version..0x1 > Aug 25 17:28:03 woprjr0 Method.........0x1 (Get) > Aug 25 17:28:03 woprjr0 Status.........0x00 > Aug 25 17:28:03 woprjr0 Hop pointer....0x0 > Aug 25 17:28:03 woprjr0 Hop counter....0x0 > Aug 25 17:28:03 woprjr0 Trans ID.......0x1b9d2a0f0cb7 > Aug 25 17:28:03 woprjr0 Attr ID........0x15 (port info) > Aug 25 17:28:03 woprjr0 Attr modifier..0x0001 > Aug 25 17:28:03 woprjr0 Mkey...........0x0 > Aug 25 17:28:03 woprjr0 DR SLID........0x02 > Aug 25 17:28:03 woprjr0 DR DLID........0xffff > > No response is shown for trid 0x1b9d2a0f0cb7... > > As an aside I see the hop pointer is set to 1 at a lower level since > mad_encode does not do it. Right; the SMI would do that. So I guess the proper case for C14-9 would be "3) If Hop Pointer is equal to > Hop Count". (They are both 0.) I'm not sure; maybe C14-9 4) > > > If so, what's the initial path at this point (or more specifically index > 1 > > of the initial path) ? I think that needs to be port 0 (if a switch) but > > this is a little weird as I would think it should be handed to the SMA > which > > is different cases in the spec. > > Yes I think I was wrong on the case. But still wouldn't the SMI detect > that > this is the end of the DRPath and simply hand it to the SMA. Yes, that's what should happen. > > > > > > Then after processing > > > > > > by the SMA and doing the required returning initialization > > > > the SMI should return the packet as specified in C14-13 > > > item 3 on line 9 page 812. > > > > > > I'm not sure it would use this case in the case of an empty DR pafh on > > return. > > Actually I think it will use this. C14-9 item 3) states "the Hop Pointer > shall be incremented by 1" Therefore when the response is handed back to > the > SMI the Hop pointer will be 1 and the hop count 0. And the SMI uses the > DRSLID to send the packet back to the requester. It goes up to the SMA and then when the response is to be made it goes through returning SMI initialization and handling. -- Hal > > > Am I wrong? In the end it does not matter as I have to make the software > > > work > > > for all the hardware I have; so I will change the software. > > > > > > IMO it does matter as to where the problem lies (SMI or otherwise) and > how > > the layers are comprised in the implementation. > > Agreed. I am mainly confused because I have 2 different implementations of > this. My "old" switches seem to handle this case just fine. My "new" > switches do not. So I am really wondering what is going on. > > Here is the above output for the same query which works with an "old" > switch. > > 17:28:04 > ./smpquery -e -c portinfo 7 0 1 > ... > trid 1a4329de; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt > 0 > ... > > Aug 25 17:46:40 woprjr0 Madeye:sent SMP > Aug 25 17:46:40 woprjr0 MAD version....0x1 > Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP) > Aug 25 17:46:40 woprjr0 Class version..0x1 > Aug 25 17:46:40 woprjr0 Method.........0x1 (Get) > Aug 25 17:46:40 woprjr0 Status.........0x00 > Aug 25 17:46:40 woprjr0 Hop pointer....0x0 > Aug 25 17:46:40 woprjr0 Hop counter....0x0 > Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de > Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info) > Aug 25 17:46:40 woprjr0 Attr modifier..0x0001 > Aug 25 17:46:40 woprjr0 Mkey...........0x0 > Aug 25 17:46:40 woprjr0 DR SLID........0x02 > Aug 25 17:46:40 woprjr0 DR DLID........0xffff > Aug 25 17:46:40 woprjr0 Madeye:recv SMP > Aug 25 17:46:40 woprjr0 MAD version....0x1 > Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP) > Aug 25 17:46:40 woprjr0 Class version..0x1 > Aug 25 17:46:40 woprjr0 Method.........0x81 (Get response) > Aug 25 17:46:40 woprjr0 Status.........0x8000 > Aug 25 17:46:40 woprjr0 Hop pointer....0x0 > Aug 25 17:46:40 woprjr0 Hop counter....0x0 > Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de > Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info) > Aug 25 17:46:40 woprjr0 Attr modifier..0x0001 > Aug 25 17:46:40 woprjr0 Mkey...........0x0 > Aug 25 17:46:40 woprjr0 DR SLID........0x02 > Aug 25 17:46:40 woprjr0 DR DLID........0xffff > > Hop Pointer and Count are both 0 and things work just fine... > > > > > However, I wonder > > > where exactly the spec falls on this, because I think it will influence > > > where > > > the fix resides. If the spec does not allow this then I think it is > fine > > > to > > > have libibmad return an error since the user specified an invalid > combined > > > DR > > > path. However, if this should be legal I think libibmad should work > around > > > the bad hardware out there. > > > > > > Is it hardware or firmware that needs fixing ? I think it may depend on > the > > specific workaround for this as to whether it is acceptable as it might > harm > > something else or might violate the spec. > > I agree, however, if the switch hardware needs fixing I fear it is too late > for the ones I have. Firmware might be upgradable although I have had > issues > with un-managed switches in the past. > > So where do we put the fix in software? > Ira > > > -- Hal > > > > > > Thoughts? > > > Ira > > > > > > -- > > > Ira Weiny > > > Math Programmer/Computer Scientist > > > Lawrence Livermore National Lab > > > 925-423-8008 > > > weiny2 at llnl.gov > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://*lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > > http://*openib.org/mailman/listinfo/openib-general > > > > > > > > -- > Ira Weiny > Math Programmer/Computer Scientist > Lawrence Livermore National Lab > 925-423-8008 > weiny2 at llnl.gov > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Aug 26 08:15:03 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 26 Aug 2009 11:15:03 -0400 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: <200908261337.56128.fenkes@de.ibm.com> References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: On 8/26/09, Joachim Fenkes wrote: > > The old code used a lot of hardcoded values, which might not be valid in > all > environments (especially routed fabrics or partitioned subnets). Copy as > much information as possible from the incoming request to prevent that. > > Signed-off-by: Joachim Fenkes > --- > > Hal, Jason -- here's the change I promised. Looks okay to you? > Roland -- if Hal and Jason don't object, please queue this up for the next > kernel. Thanks! Thanks for doing this. It looks sane to me. The only issue I recall that appears to be remaining is a better setting of ClassPortInfo:RespTimeValue rather than hardcoding. Perhaps using the value from PortInfo is the way to go (ideally it would be that value from the port to which the the requester is being redirected to but that might not be so easy to get from this port (I guess that could be SA Get PortInfoRecord for that port but that is a larger change and it likely to be same as local port issuing the redirect response). -- Hal Regards, > Joachim > > drivers/infiniband/hw/ehca/ehca_sqp.c | 47 > ++++++++++++++++++++++++++++---- > 1 files changed, 41 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c > b/drivers/infiniband/hw/ehca/ehca_sqp.c > index c568b28..8c1213f 100644 > --- a/drivers/infiniband/hw/ehca/ehca_sqp.c > +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c > @@ -125,14 +125,30 @@ struct ib_perf { > u8 data[192]; > } __attribute__ ((packed)); > > +/* TC/SL/FL packed into 32 bits, as in ClassPortInfo */ > +struct tcslfl { > + u32 tc:8; > + u32 sl:4; > + u32 fl:20; > +} __attribute__ ((packed)); > + > +/* IP Version/TC/FL packed into 32 bits, as in GRH */ > +struct vertcfl { > + u32 ver:4; > + u32 tc:8; > + u32 fl:20; > +} __attribute__ ((packed)); > > static int ehca_process_perf(struct ib_device *ibdev, u8 port_num, > + struct ib_wc *in_wc, struct ib_grh *in_grh, > struct ib_mad *in_mad, struct ib_mad *out_mad) > { > struct ib_perf *in_perf = (struct ib_perf *)in_mad; > struct ib_perf *out_perf = (struct ib_perf *)out_mad; > struct ib_class_port_info *poi = > (struct ib_class_port_info *)out_perf->data; > + struct tcslfl *tcslfl = > + (struct tcslfl *)&poi->redirect_tcslfl; > struct ehca_shca *shca = > container_of(ibdev, struct ehca_shca, ib_device); > struct ehca_sport *sport = &shca->sport[port_num - 1]; > @@ -158,10 +174,29 @@ static int ehca_process_perf(struct ib_device *ibdev, > u8 port_num, > poi->base_version = 1; > poi->class_version = 1; > poi->resp_time_value = 18; > - poi->redirect_lid = sport->saved_attr.lid; > - poi->redirect_qp = sport->pma_qp_nr; > + > + /* copy local routing information from WC where applicable > */ > + tcslfl->sl = in_wc->sl; > + poi->redirect_lid = > + sport->saved_attr.lid | in_wc->dlid_path_bits; > + poi->redirect_qp = sport->pma_qp_nr; > poi->redirect_qkey = IB_QP1_QKEY; > - poi->redirect_pkey = IB_DEFAULT_PKEY_FULL; > + > + ehca_query_pkey(ibdev, port_num, in_wc->pkey_index, > + &poi->redirect_pkey); > + > + /* if request was globally routed, copy route info */ > + if (in_grh) { > + struct vertcfl *vertcfl = > + (struct vertcfl > *)&in_grh->version_tclass_flow; > + memcpy(poi->redirect_gid, in_grh->dgid.raw, > + sizeof(poi->redirect_gid)); > + tcslfl->tc = vertcfl->tc; > + tcslfl->fl = vertcfl->fl; > + } else > + /* else only fill in default GID */ > + ehca_query_gid(ibdev, port_num, 0, > + (union ib_gid *)&poi->redirect_gid); > > ehca_dbg(ibdev, "ehca_pma_lid=%x ehca_pma_qp=%x", > sport->saved_attr.lid, sport->pma_qp_nr); > @@ -183,8 +218,7 @@ perf_reply: > > int ehca_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, > struct ib_wc *in_wc, struct ib_grh *in_grh, > - struct ib_mad *in_mad, > - struct ib_mad *out_mad) > + struct ib_mad *in_mad, struct ib_mad *out_mad) > { > int ret; > > @@ -196,7 +230,8 @@ int ehca_process_mad(struct ib_device *ibdev, int > mad_flags, u8 port_num, > return IB_MAD_RESULT_SUCCESS; > > ehca_dbg(ibdev, "port_num=%x src_qp=%x", port_num, in_wc->src_qp); > - ret = ehca_process_perf(ibdev, port_num, in_mad, out_mad); > + ret = ehca_process_perf(ibdev, port_num, in_wc, in_grh, > + in_mad, out_mad); > > return ret; > } > -- > 1.6.0.4 > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Aug 26 08:20:16 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 26 Aug 2009 11:20:16 -0400 Subject: [ofa-general] ofed 1.3.2 opensmd failover In-Reply-To: <92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com> References: <20090825162517.7955C21C827@f28.poczta.interia.pl> <92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com> Message-ID: On 8/25/09, PN wrote: > > HI, > > I can think of a situation in which all servers have dual port IB cards and > need failover of OpenSM to achieve HA. > As I know, OpenSM can only bind to 1 port at a time, Yes. so do I need to start 2 OpenSM in server A and 2 OpenSM in server B? That would be one valid configuration. I'm assuming all ports are connected to same subnet. Will they use the same guid2lid file? Depends how the OpenSM configuration is done. Do I need to set something in the config file or they will automatically > communcate each other? What communication are you referring to ? The all need to share the same subnet prefix. > Do I need to run sldd.sh manually or it will automatically sync with other > OpenSM? You can either manually copy the guid2lid file around to the appropriate places. I'm not that familiar with sldd.sh but I think it can either be run manually or made to run automatically but I'm not familiar with the details. -- Hal Thanks a lot. > > Regards, > PN > > > > > 2009/8/26 Hal Rosenstock > >> >> >> On 8/25/09, kovlensky at interia.pl wrote: >>> >>> Hi all, >>> >>> Quick question - is there a need to run anything except opensmd deamons >>> to provide failover capability on ib network in ofed 1.3? >> >> >> In terms of SM failover, modulo bugs fixed relative to this feature since >> OFED 1.3 (there are a couple of things here which may affect your >> environment if I recall correctly), you only need to run more than 1 SM for >> this (one will become master, the other standby). >> >> I'm aware that when master manager dies standby one comes in and manages >>> the network, but that does not necessary means that lids are preserved, >>> especially for nodes joining in. I used to run sldd.sh for distributing lids >>> list on ofed 1.2.5, but while this script seems to be in place noone >>> mentions necessity for it. >> >> >> So subnet manager failover is provided by running standby opensm. >> >> >> And how LID preservation is provided? >> >> >> If you want LIDs to be preserved, the guid2lid file needs to be sync'd >> (copied from the master SM once it's fully assembled to the node which is >> running the standby SM). That's what the sldd.sh script does. >> >> -- Hal >> >> Regards, >>> >>> Zdenek Kovlensky >>> >>> ---------------------------------------------------------------------- >>> Kup wlasne mieszkanie za 33 tys. zl! >>> Sprawdz >>> http://link.interia.pl/f22f2 >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > > -- > Best Regards, > PN Lai > HPC Specialist > Galactic Computng Corp. > Tel: 86-755-26733939 ext 826 > Mobile: 86-13823161729 > Fax: 86-755-26733780 > URL: http://www.galactic.com.hk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Aug 26 08:23:53 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 26 Aug 2009 11:23:53 -0400 Subject: [ofa-general] Problems with OpenSM from ofed 1.4.1 and MESH topology. In-Reply-To: <4A949EFE.9000302@Sun.COM> References: <4A92DFEC.3010300@Sun.COM> <4A949EFE.9000302@Sun.COM> Message-ID: Hi Rafael, On 8/25/09, Rafael David Tinoco wrote: > > Hello Hal, > > Bellow... > > Hal Rosenstock wrote: > > > > On 8/24/09, Rafael David Tinoco wrote: >> >> Hello, >> >> I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics >> each, 8 qnems). >> They are configured in a MESH topology. >> I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5. >> >> I'm booting PXE from IB, my initrd image is bringing the ib0 interface, >> getting the squashfs image and mounting with aufs. >> >> The problem is.. When booting more then 60 nodes, I start to get above >> errors on subnet manager. >> And the problem seems to be intermitent, because each time it gives errors >> on different path. >> >> Any ideas ? >> >> Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713838 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1 >> Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713842 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1 >> Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713847 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1 >> Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713866 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1 >> Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713871 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1 >> Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP >> Aug 24 15:36:19 714805 [48D7D940] 0x01 -> >> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node >> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19. >> Adding to light sweep sampling list >> Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop >> path: >> Path = 0,1,15,15,15 >> Aug 24 15:36:19 714822 [48D7D940] 0x01 -> >> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node >> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21. >> Adding to light sweep sampling list >> Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop >> path: >> Path = 0,1,15,15,15 >> Aug 24 15:36:19 714831 [48D7D940] 0x01 -> >> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node >> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25. >> Adding to light sweep sampling list >> Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop >> path: >> Path = 0,1,15,15,15 >> Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send >> completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) -- >> dropping >> Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP >> Hop Ptr: 0x0 >> Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path: >> Initial path = 0,0,0,0,0,0 >> Return path = 0,0,0,0,0,0 >> Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: >> ERR 3113: MAD completed in error (IB_TIMEOUT) >> Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump: >> base_ver................0x1 >> mgmt_class..............0x81 >> class_ver...............0x1 >> method..................0x1 (SubnGet) >> D bit...................0x0 >> status..................0x0 >> hop_ptr.................0x0 >> hop_count...............0x5 >> trans_id................0x36595 >> attr_id.................0x15 (PortInfo) >> resv....................0x0 >> attr_mod................0x0 >> m_key...................0x0000000000000000 >> dr_slid.................65535 >> dr_dlid.................65535 >> >> Initial path: 0,1,15,15,15,19 >> Return path: 0,0,0,0,0,0 >> Reserved: [0][0][0][0][0][0][0] >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send >> completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) -- >> dropping >> Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP >> Hop Ptr: 0x0 >> Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path: >> Initial path = 0,0,0,0,0,0 >> Return path = 0,0,0,0,0,0 >> Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: >> ERR 3113: MAD completed in error (IB_TIMEOUT) >> Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump: >> base_ver................0x1 >> mgmt_class..............0x81 >> class_ver...............0x1 >> method..................0x1 (SubnGet) >> D bit...................0x0 >> status..................0x0 >> hop_ptr.................0x0 >> hop_count...............0x5 >> trans_id................0x36596 >> attr_id.................0x15 (PortInfo) >> resv....................0x0 >> .... >> > > These errors are transient as you indicate. They mean that some node has > brought the link physically up but there is no SMA at the remote side of the > link. The different paths are paths to the HCAs. This occurs during PXE boot > as the node transitions from the boot ROM to the Linux environment. > > > They are transient.. but sometimes opensm hangs with the same message and > loops this errors messages. > Are you sure OpenSM hangs ? If so, any idea where ? First I was using centos 5.3 kernel with updates .. and the IPoIB stopped > working after these messages. > Any specifics ? Using the "vanilla" centos 5.3 kernel solved this issue. > But SOMETIMES, booting the nodes, these messages appear and dont go away. > In those cases, do the nodes succesfully boot up ? Other than these messages, do things seem to work in terms of the end > nodes ? > > They seem to work with vanilla kernel. Even with the messages, no problems > reaching the nodes so far. > Do your ULPs work (like IPoIB, etc.) ? -- Hal Tks > > Rafael Tinoco > > > -- Hal > > _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dave.olson at qlogic.com Wed Aug 26 08:36:49 2009 From: dave.olson at qlogic.com (Dave Olson) Date: Wed, 26 Aug 2009 08:36:49 -0700 Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards In-Reply-To: <1251277761.28564.45.camel@pyren.uio.no> References: <1251277761.28564.45.camel@pyren.uio.no> Message-ID: On Wed, 26 Aug 2009, Ole Widar Saastad wrote: | I am experiencing problems using the Infinipath cards and the OFED | stack. (details are given below). | | It seems to be a problem somewhere when mpi packet size grows above 2k. | This is what I recall the changeover from one transport mechanism to | another ? The problem is that openmpi prior to fairly recent releases had problems with MTUs that didn't match the config file (mellanox as well as infinipath). Since infinipath cards are defaulted in the config file to 4KB MTU, this showed up most on our cards, if you were running with 2K MTU. So you either need to fix mca-btl-openib-device-params.ini on all nodes to say 2048, or override in your local configs or command lines. More recent versions of openmpi have this fixed (1.3.2 for sure, maybe all 1.3, I don't remember). Dave Olson dave.olson at qlogic.com From poknam at gmail.com Wed Aug 26 08:44:11 2009 From: poknam at gmail.com (PN) Date: Wed, 26 Aug 2009 23:44:11 +0800 Subject: [ofa-general] ofed 1.3.2 opensmd failover In-Reply-To: References: <20090825162517.7955C21C827@f28.poczta.interia.pl> <92daa7bf0908251915m35f9c28fg4aee596db24a544b@mail.gmail.com> Message-ID: <92daa7bf0908260844s7d0d5fat1215283cbc66965e@mail.gmail.com> 2009/8/26 Hal Rosenstock > > > On 8/25/09, PN wrote: >> >> HI, >> >> I can think of a situation in which all servers have dual port IB cards >> and need failover of OpenSM to achieve HA. >> As I know, OpenSM can only bind to 1 port at a time, > > > Yes. > > so do I need to start 2 OpenSM in server A and 2 OpenSM in server B? > > > That would be one valid configuration. I'm assuming all ports are connected > to same subnet. > In some cases, I will use IB bonding. While in another cases, I may use 1 port for calculation and another port to connect the storage. I'm not sure which configuration will provide better performance. > Will they use the same guid2lid file? > > > Depends how the OpenSM configuration is done. > > Do I need to set something in the config file or they will automatically >> communcate each other? > > > What communication are you referring to ? The all need to share the same > subnet prefix. > I mean the handover mechanism. I remember in the previous OpenSM config file (in OFED 1.2.x/1.3.x), there is a field about all the subnet manager in the subnet, but this field is omitted in the new version. I wonder whether all the OpenSM will automatically discover each other and do the handover mechanism right. Thanks. PN > > > >> Do I need to run sldd.sh manually or it will automatically sync with other >> OpenSM? > > > You can either manually copy the guid2lid file around to the appropriate > places. I'm not that familiar with sldd.sh but I think it can either be run > manually or made to run automatically but I'm not familiar with the details. > > -- Hal > > > Thanks a lot. >> >> Regards, >> PN >> >> >> >> >> 2009/8/26 Hal Rosenstock >> >>> >>> >>> On 8/25/09, kovlensky at interia.pl wrote: >>>> >>>> Hi all, >>>> >>>> Quick question - is there a need to run anything except opensmd deamons >>>> to provide failover capability on ib network in ofed 1.3? >>> >>> >>> In terms of SM failover, modulo bugs fixed relative to this feature since >>> OFED 1.3 (there are a couple of things here which may affect your >>> environment if I recall correctly), you only need to run more than 1 SM for >>> this (one will become master, the other standby). >>> >>> I'm aware that when master manager dies standby one comes in and manages >>>> the network, but that does not necessary means that lids are preserved, >>>> especially for nodes joining in. I used to run sldd.sh for distributing lids >>>> list on ofed 1.2.5, but while this script seems to be in place noone >>>> mentions necessity for it. >>> >>> >>> So subnet manager failover is provided by running standby opensm. >>> >>> >>> And how LID preservation is provided? >>> >>> >>> If you want LIDs to be preserved, the guid2lid file needs to be sync'd >>> (copied from the master SM once it's fully assembled to the node which is >>> running the standby SM). That's what the sldd.sh script does. >>> >>> -- Hal >>> >>> Regards, >>>> >>>> Zdenek Kovlensky >>>> >>>> ---------------------------------------------------------------------- >>>> Kup wlasne mieszkanie za 33 tys. zl! >>>> Sprawdz >>> http://link.interia.pl/f22f2 >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> >> >> >> -- >> Best Regards, >> PN Lai >> HPC Specialist >> Galactic Computng Corp. >> Tel: 86-755-26733939 ext 826 >> Mobile: 86-13823161729 >> Fax: 86-755-26733780 >> URL: http://www.galactic.com.hk >> > > -- Best Regards, PN Lai HPC Specialist Galactic Computng Corp. Tel: 86-755-26733939 ext 826 Mobile: 86-13823161729 Fax: 86-755-26733780 URL: http://www.galactic.com.hk -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Wed Aug 26 08:54:47 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 26 Aug 2009 11:54:47 -0400 Subject: [ofa-general] [PATCH] opensm/ib_types.h: Add CounterSelect2 field to PortCounters attribute Message-ID: <20090826155447.GA25235@comcast.net> Per MgtWG RefID #4527 Also, cosmetic commentary change Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index fe3f051..42ec794 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -4377,8 +4377,8 @@ ib_node_info_get_vendor_id(IN const ib_node_info_t * const p_ni) #include typedef struct _ib_node_desc { - // Node String is an array of UTF-8 character that - // describes the node in text format + // Node String is an array of UTF-8 characters + // that describe the node in text format // Note that this string is NOT NULL TERMINATED! uint8_t description[IB_NODE_DESCRIPTION_SIZE]; } PACK_SUFFIX ib_node_desc_t; @@ -7737,9 +7737,9 @@ typedef struct _ib_port_counters { ib_net16_t xmit_discards; uint8_t xmit_constraint_err; uint8_t rcv_constraint_err; - uint8_t res1; + uint8_t counter_select2; uint8_t link_int_buffer_overrun; - ib_net16_t res2; + ib_net16_t resv; ib_net16_t vl15_dropped; ib_net32_t xmit_data; ib_net32_t rcv_data; From hnrose at comcast.net Wed Aug 26 09:12:23 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 26 Aug 2009 12:12:23 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/perfquery.c: Indicate whether PortXmitWait counter is supported Message-ID: <20090826161223.GA30257@comcast.net> Indicate extended v. (normal) port counters in output Also, some cosmetic formatting changes and commentary typo fixed Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 39ae2f6..0fd083e 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -277,8 +278,8 @@ static void output_aggregate_perfcounters_ext(ib_portid_t * portid) mad_dump_perfcounters_ext(buf, sizeof buf, pc, sizeof pc); - printf("# Port counters: %s port %d\n%s", portid2str(portid), ALL_PORTS, - buf); + printf("# Port extended counters: %s port %d\n%s", portid2str(portid), + ALL_PORTS, buf); } static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, @@ -291,7 +292,8 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, IB_GSI_PORT_COUNTERS, srcport)) IBERROR("perfquery"); if (!(cap_mask & 0x1000)) { - /* if PortCounters:PortXmitWait not suppported clear this counter */ + /* if PortCounters:PortXmitWait not supported clear this counter */ + IBWARN("PortXmitWait not indicated so ignore this counter"); perf_count.xmtwait = 0; mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); @@ -316,9 +318,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, sizeof pc); } - if (!aggregate) - printf("# Port counters: %s port %d\n%s", portid2str(portid), - port, buf); + if (!aggregate) { + if (extended) + printf("# Port extended counters: %s port %d\n%s", + portid2str(portid), port, buf); + else + printf("# Port counters: %s port %d\n%s", + portid2str(portid), port, buf); + } } static void reset_counters(int extended, int timeout, int mask, @@ -421,9 +428,8 @@ static int process_opt(void *context, int ch, char *optarg) int main(int argc, char **argv) { - int mgmt_classes[4] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, - IB_PERFORMANCE_CLASS - }; + int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, + IB_PERFORMANCE_CLASS}; ib_portid_t portid = { 0 }; int mask = 0xffff; uint16_t cap_mask; @@ -553,7 +559,6 @@ int main(int argc, char **argv) goto done; do_reset: - if (argc <= 2 && !extended && (cap_mask & 0x1000)) mask |= (1 << 16); /* reset portxmitwait */ From weiny2 at llnl.gov Wed Aug 26 10:29:57 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 26 Aug 2009 10:29:57 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/libibnetdisc: add missing '\n' to error message Message-ID: <20090826102957.bed66987.weiny2@llnl.gov> From: Ira Weiny Date: Fri, 21 Aug 2009 15:01:00 -0700 Subject: [PATCH] infiniband-diags/libibnetdisc: add missing '\n' to error message Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index c69467e..bbb0fbb 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -615,7 +615,7 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port * ibmad_port, if (get_port_info(ibmad_port, fabric, &port_buf, i, path)) { IBND_ERROR - ("can't reach node %s port %d", + ("can't reach node %s port %d\n", portid2str(path), i); continue; } -- 1.5.4.5 From weiny2 at llnl.gov Wed Aug 26 10:31:20 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 26 Aug 2009 10:31:20 -0700 Subject: [ofa-general] Combined DR path with empty DR path, what is the expected behavior? In-Reply-To: References: <20090824185206.39e5e377.weiny2@llnl.gov> <20090825175543.4f929646.weiny2@llnl.gov> Message-ID: <20090826103120.569b5deb.weiny2@llnl.gov> On Wed, 26 Aug 2009 10:55:41 -0400 Hal Rosenstock wrote: > On 8/25/09, Ira Weiny wrote: > > > > On Tue, 25 Aug 2009 19:15:19 -0400 > > Hal Rosenstock wrote: > > > > > On 8/24/09, Ira Weiny wrote: > > > [snip] > > > > > > > > > Not all 4 combinations are supported/known to work. When this was added > > for > > > ibportstate, the only combined routing form that was important was LID > > > routed part followed by a DR part. > > > > > > > When you say "known to work" you mean implemented with the diags? Or known > > to > > work in all hardware? > > > The former with most hardware up to some time ago. Note there is no > compliance testing of combined routing and heavy reliance on this makes some > a little nervous. Ok, Good to know. With this, and the rest of your response, in mind I went ahead and created a patch to libibnetdisc which will go back to LID routing when the Hop Count is returned to 0. Patch to follow. > > > > > > > On the other hand I think strictly this should be supported. > > > > > > > > > In an ideal world yes but are they all required or is it just the one > > form > > > most heavily used ? > > > > That is what I am unclear on. Does the spec require that all 8 > > combinations > > are required to work? I don't see a specific compliance which says that > > and I > > am not sure if C14-9 and C14-13 cover all 8 combinations. > > > I don't think there's any compliance on this. It all appears to be > informative text. Perhaps a shortcoming of the spec. So there's nothing > definitive. It just says there are 8 combinations (2**3 as there are 3 parts > with 2 possibilities in each part) and that only 4 are really useful. Well I agree that only 4 are "useful". It is just the algorithm which libibnetdisc used which resulted in this "weird" case. [snip] > > > > > > If so, what's the initial path at this point (or more specifically index > > 1 > > > of the initial path) ? I think that needs to be port 0 (if a switch) but > > > this is a little weird as I would think it should be handed to the SMA > > which > > > is different cases in the spec. > > > > Yes I think I was wrong on the case. But still wouldn't the SMI detect > > that > > this is the end of the DRPath and simply hand it to the SMA. > > > Yes, that's what should happen. I am going to take this up with the switch vendors and see what their interpretation is. For the time being I think my patch will fix libibnetdisc (iblinkinfo). Thanks again! Ira > > > > > > > > > > Then after processing > > > > > > > > > by the SMA and doing the required returning initialization > > > > > > the SMI should return the packet as specified in C14-13 > > > > item 3 on line 9 page 812. > > > > > > > > > I'm not sure it would use this case in the case of an empty DR pafh on > > > return. > > > > Actually I think it will use this. C14-9 item 3) states "the Hop Pointer > > shall be incremented by 1" Therefore when the response is handed back to > > the > > SMI the Hop pointer will be 1 and the hop count 0. And the SMI uses the > > DRSLID to send the packet back to the requester. > > > It goes up to the SMA and then when the response is to be made it goes > through returning SMI initialization and handling. > > -- Hal > > > > > > Am I wrong? In the end it does not matter as I have to make the software > > > > work > > > > for all the hardware I have; so I will change the software. > > > > > > > > > IMO it does matter as to where the problem lies (SMI or otherwise) and > > how > > > the layers are comprised in the implementation. > > > > Agreed. I am mainly confused because I have 2 different implementations of > > this. My "old" switches seem to handle this case just fine. My "new" > > switches do not. So I am really wondering what is going on. > > > > Here is the above output for the same query which works with an "old" > > switch. > > > > 17:28:04 > ./smpquery -e -c portinfo 7 0 1 > > ... > > trid 1a4329de; HopCount 0; HopPointer 0; slid 2; dlid 65535; 0, drpath->cnt > > 0 > > ... > > > > Aug 25 17:46:40 woprjr0 Madeye:sent SMP > > Aug 25 17:46:40 woprjr0 MAD version....0x1 > > Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP) > > Aug 25 17:46:40 woprjr0 Class version..0x1 > > Aug 25 17:46:40 woprjr0 Method.........0x1 (Get) > > Aug 25 17:46:40 woprjr0 Status.........0x00 > > Aug 25 17:46:40 woprjr0 Hop pointer....0x0 > > Aug 25 17:46:40 woprjr0 Hop counter....0x0 > > Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de > > Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info) > > Aug 25 17:46:40 woprjr0 Attr modifier..0x0001 > > Aug 25 17:46:40 woprjr0 Mkey...........0x0 > > Aug 25 17:46:40 woprjr0 DR SLID........0x02 > > Aug 25 17:46:40 woprjr0 DR DLID........0xffff > > Aug 25 17:46:40 woprjr0 Madeye:recv SMP > > Aug 25 17:46:40 woprjr0 MAD version....0x1 > > Aug 25 17:46:40 woprjr0 Class..........0x81 (Directed route SMP) > > Aug 25 17:46:40 woprjr0 Class version..0x1 > > Aug 25 17:46:40 woprjr0 Method.........0x81 (Get response) > > Aug 25 17:46:40 woprjr0 Status.........0x8000 > > Aug 25 17:46:40 woprjr0 Hop pointer....0x0 > > Aug 25 17:46:40 woprjr0 Hop counter....0x0 > > Aug 25 17:46:40 woprjr0 Trans ID.......0x1ba01a4329de > > Aug 25 17:46:40 woprjr0 Attr ID........0x15 (port info) > > Aug 25 17:46:40 woprjr0 Attr modifier..0x0001 > > Aug 25 17:46:40 woprjr0 Mkey...........0x0 > > Aug 25 17:46:40 woprjr0 DR SLID........0x02 > > Aug 25 17:46:40 woprjr0 DR DLID........0xffff > > > > Hop Pointer and Count are both 0 and things work just fine... > > > > > > > > However, I wonder > > > > where exactly the spec falls on this, because I think it will influence > > > > where > > > > the fix resides. If the spec does not allow this then I think it is > > fine > > > > to > > > > have libibmad return an error since the user specified an invalid > > combined > > > > DR > > > > path. However, if this should be legal I think libibmad should work > > around > > > > the bad hardware out there. > > > > > > > > > Is it hardware or firmware that needs fixing ? I think it may depend on > > the > > > specific workaround for this as to whether it is acceptable as it might > > harm > > > something else or might violate the spec. > > > > I agree, however, if the switch hardware needs fixing I fear it is too late > > for the ones I have. Firmware might be upgradable although I have had > > issues > > with un-managed switches in the past. > > > > So where do we put the fix in software? > > > Ira > > > > > -- Hal > > > > > > > > > Thoughts? > > > > Ira > > > > > > > > -- > > > > Ira Weiny > > > > Math Programmer/Computer Scientist > > > > Lawrence Livermore National Lab > > > > 925-423-8008 > > > > weiny2 at llnl.gov > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://**lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > > http://**openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > -- > > Ira Weiny > > Math Programmer/Computer Scientist > > Lawrence Livermore National Lab > > 925-423-8008 > > weiny2 at llnl.gov > > > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov From weiny2 at llnl.gov Wed Aug 26 10:31:42 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 26 Aug 2009 10:31:42 -0700 Subject: [ofa-general] [PATCH] libibnetdisc: add retract_dpath function Message-ID: <20090826103142.660ac83b.weiny2@llnl.gov> From: Ira Weiny Date: Wed, 26 Aug 2009 09:25:00 -0700 Subject: [PATCH] libibnetdisc: add retract_dpath function When using combined routing some switches do not handle Hop Count of 0 well. Detect when the drpath count is 0 and return to lid based routing in this case. Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 14 ++++++++++++-- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index c69467e..da8572c 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -175,6 +175,16 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport) return path->cnt; } +static int retract_dpath(ib_portid_t * path) +{ + path->drpath.cnt--; /* restore path */ + if (path->drpath.cnt == 0 && path->lid) { + /* return to lid based routing on this path */ + path->drpath.drslid = 0; + path->drpath.drdlid = 0; + } +} + static int extend_dpath(struct ibmad_port *ibmad_port, ibnd_fabric_t * fabric, ib_portid_t * portid, int nextport) { @@ -502,7 +512,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port, if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) { IBND_ERROR("Query remote node (%s) failed, skipping port\n", portid2str(path)); - path->drpath.cnt--; /* restore path */ + retract_dpath(path); return 1; /* positive == non-fatal error */ } @@ -530,7 +540,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port, link_ports(node, port, remotenode, remoteport); error: - path->drpath.cnt--; /* restore path */ + retract_dpath(path); return (rc); } -- 1.5.4.5 From jgunthorpe at obsidianresearch.com Wed Aug 26 11:04:57 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 26 Aug 2009 12:04:57 -0600 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format In-Reply-To: <4A94FB67.6050600@voltaire.com> References: <20090821000431.GA5713@obsidianresearch.com> <4A94FB67.6050600@voltaire.com> Message-ID: <20090826180457.GR406@obsidianresearch.com> On Wed, Aug 26, 2009 at 12:07:51PM +0300, Or Gerlitz wrote: > isn't Jason's approach enough for the bonding case?! I saw that your > patch ("bonding: clean muticast addresses when device changes type" I think working versions of all three patches are required: 1) Fix the bonding driver. Otherwise the right groups might not be joined. 2) Check the address format, to protect against 'ip maddr add' and other wakkyness 3) Fix the timeout handling, so mlid exhaustion and other SA side errors are handled elegantly. All are bugs.. > and maybe also in mainline .31-rcX . However, it has the > down-side-effect of e.g loosing routes already set for the the bond > while adding the underline IPoIB devices, so if Jason's patch is > enough Is this true? That is pretty ugly, but probably manageable.. -- Jason Gunthorpe (780)4406067x832 Chief Technology Officer, Obsidian Research Corp Edmonton, Canada From ralph.campbell at qlogic.com Wed Aug 26 12:01:27 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 26 Aug 2009 12:01:27 -0700 Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards In-Reply-To: <1251277761.28564.45.camel@pyren.uio.no> References: <1251277761.28564.45.camel@pyren.uio.no> Message-ID: <1251313287.3535.237.camel@chromite.mv.qlogic.com> Is your switch configured for 4K MTU? The default openmpi parameter for QLogic is to use a 4K MTU. Try using a 2K MTU with: "mpirun -mca btl_openib_mtu=4 ..." and see if that works. On Wed, 2009-08-26 at 02:09 -0700, Ole Widar Saastad wrote: > I am experiencing problems using the Infinipath cards and the OFED > stack. (details are given below). > > It seems to be a problem somewhere when mpi packet size grows above 2k. > This is what I recall the changeover from one transport mechanism to > another ? > > The test is easy to run and to test, it is just a bandwidth program : > (I got far better latency using the Pathscale stack that the OFED. Is this > something that will be looked up in the newer releases?). > > Two nodes in node.txt file compute-1-0 and compute-1-1. They are connected > to a SilverStorm switch. > > [olews at login-0-2 bandwidth]$ mpirun -np 2 -machinefile ./nodes.txt ./bandwidth.openmpi.x -b o > Resolution (usec): 2.145767 > Benchmark ping-pong > =================== > lenght iterations elapsed time transfer rate latency > (bytes) (count) (seconds) (Mbytes/s) (usec) > -------------------------------------------------------------------------- > 0 10046 0.121 0.000 6.011 > 1 10261 0.124 0.166 6.026 > > 1024 7695 0.140 112.615 9.093 > 1536 6260 0.133 144.469 10.632 > 2048 5275 0.128 168.420 12.160 > [0,1,0][btl_openib_component.c:1375:btl_openib_component_progress] from compute-1-0 to: compute-1-1 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278309104 opcode 1 > -------------------------------------------------------------------------- > The InfiniBand retry count between two MPI processes has been > exceeded. "Retry count" is defined in the InfiniBand spec 1.2 > (section 12.7.38): > > The total number of times that the sender wishes the receiver to > retry timeout, packet sequence, etc. errors before posting a > completion error. > > This error typically means that there is somethin/site/VERSIONS/openmpi-1.2.8.gnu/bin/g awry within the > InfiniBand fabric itself. You should note the hosts on which this > error has occurred; it has been observed that rebooting or removing a > particular host from the job can sometimes resolve this issue. > > Two MCA parameters can be used to control Open MPI's behavior with > respect to the retry count: > > * btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. > -------------------------------------------------------------------------- > mpirun noticed that job rank 1 with PID 9184 on node compute-1-1 exited on signal 15 (Terminated). > [olews at login-0-2 bandwidth]$ > > > Background information : > > > 07:00.0 InfiniBand: QLogic, Corp. InfiniPath PE-800 (rev 02) > Subsystem: QLogic, Corp. InfiniPath PE-800 > Flags: bus master, fast devsel, latency 0, IRQ 66 > Memory at fde00000 (64-bit, non-prefetchable) [size=2M] > Capabilities: [40] Power Management version 2 > Capabilities: [50] Message Signalled Interrupts: 64bit+ > Queue=0/0 Enable+ > Capabilities: [70] Express Endpoint IRQ 0 > > compute-1-0.local# uname -a > Linux compute-1-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 > EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > compute-1-0.local# > > > compute-1-0.local# rpm -qa| grep ofed > libibverbs-utils-1.1.2-1.ofed1.4.2 > librdmacm-utils-1.0.8-1.ofed1.4.2 > libcxgb3-1.2.2-1.ofed1.4.2 > ofed-scripts-1.4.2-0 > libmlx4-1.0-1.ofed1.4.2 > libibverbs-devel-1.1.2-1.ofed1.4.2 > ofed-docs-1.4.2-0 > ibvexdmtools-0.0.1-1.ofed1.4.2 > libmthca-1.0.5-1.ofed1.4.2 > libipathverbs-1.1-1.ofed1.4.2 > mstflint-1.4-1.ofed1.4.2 > libibumad-1.2.3_20090314-1.ofed1.4.2 > libnes-0.6-1.ofed1.4.2 > libibcommon-1.1.2_20090314-1.ofed1.4.2 > libibverbs-1.1.2-1.ofed1.4.2 > librdmacm-1.0.8-1.ofed1.4.2 > qlgc_vnic_daemon-0.0.1-1.ofed1.4.2 > compute-1-0.local# > > OpenMPI is : > openmpi-1.2.8 compiled for gcc. > From ralph.campbell at qlogic.com Wed Aug 26 12:06:37 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 26 Aug 2009 12:06:37 -0700 Subject: [ofa-general] Problems using ofed 1.4.2 and Infinipath cards In-Reply-To: <1251313287.3535.237.camel@chromite.mv.qlogic.com> References: <1251277761.28564.45.camel@pyren.uio.no> <1251313287.3535.237.camel@chromite.mv.qlogic.com> Message-ID: <1251313597.3535.239.camel@chromite.mv.qlogic.com> Sorry, I meant "mpirun -mca btl_openib_mtu 4 ..." (no equal). On Wed, 2009-08-26 at 12:01 -0700, Ralph Campbell wrote: > Is your switch configured for 4K MTU? > The default openmpi parameter for QLogic is to use a 4K MTU. > Try using a 2K MTU with: > "mpirun -mca btl_openib_mtu=4 ..." and see if that works. > > > On Wed, 2009-08-26 at 02:09 -0700, Ole Widar Saastad wrote: > > I am experiencing problems using the Infinipath cards and the OFED > > stack. (details are given below). > > > > It seems to be a problem somewhere when mpi packet size grows above 2k. > > This is what I recall the changeover from one transport mechanism to > > another ? > > > > The test is easy to run and to test, it is just a bandwidth program : > > (I got far better latency using the Pathscale stack that the OFED. Is this > > something that will be looked up in the newer releases?). > > > > Two nodes in node.txt file compute-1-0 and compute-1-1. They are connected > > to a SilverStorm switch. > > > > [olews at login-0-2 bandwidth]$ mpirun -np 2 -machinefile ./nodes.txt ./bandwidth.openmpi.x -b o > > Resolution (usec): 2.145767 > > Benchmark ping-pong > > =================== > > lenght iterations elapsed time transfer rate latency > > (bytes) (count) (seconds) (Mbytes/s) (usec) > > -------------------------------------------------------------------------- > > 0 10046 0.121 0.000 6.011 > > 1 10261 0.124 0.166 6.026 > > > > 1024 7695 0.140 112.615 9.093 > > 1536 6260 0.133 144.469 10.632 > > 2048 5275 0.128 168.420 12.160 > > [0,1,0][btl_openib_component.c:1375:btl_openib_component_progress] from compute-1-0 to: compute-1-1 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278309104 opcode 1 > > -------------------------------------------------------------------------- > > The InfiniBand retry count between two MPI processes has been > > exceeded. "Retry count" is defined in the InfiniBand spec 1.2 > > (section 12.7.38): > > > > The total number of times that the sender wishes the receiver to > > retry timeout, packet sequence, etc. errors before posting a > > completion error. > > > > This error typically means that there is somethin/site/VERSIONS/openmpi-1.2.8.gnu/bin/g awry within the > > InfiniBand fabric itself. You should note the hosts on which this > > error has occurred; it has been observed that rebooting or removing a > > particular host from the job can sometimes resolve this issue. > > > > Two MCA parameters can be used to control Open MPI's behavior with > > respect to the retry count: > > > > * btl_openib_ib_retry_count - The number of times the sender will > > attempt to retry (defaulted to 7, the maximum value). > > > > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > > to 10). The actual timeout value used is calculated as: > > > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. > > -------------------------------------------------------------------------- > > mpirun noticed that job rank 1 with PID 9184 on node compute-1-1 exited on signal 15 (Terminated). > > [olews at login-0-2 bandwidth]$ > > > > > > Background information : > > > > > > 07:00.0 InfiniBand: QLogic, Corp. InfiniPath PE-800 (rev 02) > > Subsystem: QLogic, Corp. InfiniPath PE-800 > > Flags: bus master, fast devsel, latency 0, IRQ 66 > > Memory at fde00000 (64-bit, non-prefetchable) [size=2M] > > Capabilities: [40] Power Management version 2 > > Capabilities: [50] Message Signalled Interrupts: 64bit+ > > Queue=0/0 Enable+ > > Capabilities: [70] Express Endpoint IRQ 0 > > > > compute-1-0.local# uname -a > > Linux compute-1-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 > > EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > > compute-1-0.local# > > > > > > compute-1-0.local# rpm -qa| grep ofed > > libibverbs-utils-1.1.2-1.ofed1.4.2 > > librdmacm-utils-1.0.8-1.ofed1.4.2 > > libcxgb3-1.2.2-1.ofed1.4.2 > > ofed-scripts-1.4.2-0 > > libmlx4-1.0-1.ofed1.4.2 > > libibverbs-devel-1.1.2-1.ofed1.4.2 > > ofed-docs-1.4.2-0 > > ibvexdmtools-0.0.1-1.ofed1.4.2 > > libmthca-1.0.5-1.ofed1.4.2 > > libipathverbs-1.1-1.ofed1.4.2 > > mstflint-1.4-1.ofed1.4.2 > > libibumad-1.2.3_20090314-1.ofed1.4.2 > > libnes-0.6-1.ofed1.4.2 > > libibcommon-1.1.2_20090314-1.ofed1.4.2 > > libibverbs-1.1.2-1.ofed1.4.2 > > librdmacm-1.0.8-1.ofed1.4.2 > > qlgc_vnic_daemon-0.0.1-1.ofed1.4.2 > > compute-1-0.local# > > > > OpenMPI is : > > openmpi-1.2.8 compiled for gcc. > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From weiny2 at llnl.gov Wed Aug 26 16:40:26 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 26 Aug 2009 16:40:26 -0700 Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object.) In-Reply-To: <20090823120609.GG9547@me> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> <20090817083023.da17378b.weiny2@llnl.gov> <20090823120609.GG9547@me> Message-ID: <20090826164026.8dcce4b2.weiny2@llnl.gov> On Sun, 23 Aug 2009 15:06:09 +0300 Sasha Khapyorsky wrote: > Hi Ira, > > On 08:30 Mon 17 Aug , Ira Weiny wrote: > > > > The immediate benefit is coming with the multi-threaded implementation where > > I plan on adding the following function.[*] > > Ok, but could we discuss first how will multithreading architecture be Of course! :-) But first I would like to mention some numbers from the prototype code I have. When running on a small fabric the additional overhead of thread creation actually slows down the scan. :-( Current master: Threaded version: real 0m0.101s 0m0.266s user 0m0.000s 0m0.000s sys 0m0.011s 0m0.014s But, as expected, on a large system (1152 nodes) there is a decent speed up. Current Master: Threaded version: real 0m3.046s 0m1.748s user 0m0.073s 0m0.331s sys 0m0.158s 0m0.822s However, the biggest speed up comes when there are errors on the fabric. This is the same 1152 node cluster with just 14 "bad" ports on the fabric. This is of course because the scan continues "around" the bad ports. Current Master: Threaded version: real 0m33.051s 0m5.609s user 0m0.071s 0m0.353s sys 0m0.156s 0m1.113s Since you are usually running these tools when things are bad I think there is a big gain here. Even running with a faster timeout of 200ms results in a big difference. Current Master: Threaded version: real 0m9.149s 0m2.223s user 0m0.016s 0m0.374s sys 0m0.372s 0m1.056s With that in mind... > implemented with libibnetdisc: goals (in particular is it support for > multithreaded apps or just multithreaded discovery function), interaction > with caller application, etc.? My initial goal was to make the libibnetdisc safe for multithreaded apps and make a multithreaded discovery function. However, since libibmad itself is not thread safe, and you expressed a desire to keep it that way[*], I reduced that goal to just making the discovery function multithreaded (using mad_[send|receive]_via). Although I don't like this restriction I can see it as a valid design decision as long as it is documented that the discover function is not thread safe in regards to the ibmad_port object. This is because the ibnd_discover_fabric uses libibmad calls and would require a complicated API to allow the user app to synchronize with those calls. In order to make things thread safe for the user apps as well as the library I can see 3 options. 1) make libibmad thread safe (which you were hesitant to do) 2) add a thread safe interface to libibmad. User apps will need to know to use this interface while using libibnetdisc and libibnetdisc will use this interface. 3) Create a wrapper lib which is thread safe. In this case the apps and libibnetdisc would call into this wrapper lib and we would have to change the API to libibnetdisc. Right now I have the multithreaded discover code separated out somewhat. I think it would not be hard to extract the multithreaded parts and either create the wrapper lib or extend libibmad with thread safe calls. That said, I personally do not like option 2. I think it further complicates an already overly complex API in libibmad. As far as option 1 vs 3 I can see arguments for and against each. 1 makes things very nice because it would be taken care of for all apps currently using libibmad. On the down side it would add some overhead for single threaded apps. Although I do not believe too much.[$] The downside of 3 is that to be done correctly it would change the libibnetdisc API and apps which use it. > > One of the desired feature of this I could think would be to keep API > simple for single threaded stuff. Agreed. I don't think the API is going to get to complicated. A big reason for adding the context is to allow the API to be flexible without breaking things. Ira [*] http://lists.openfabrics.org/pipermail/general/2009-July/060677.html "madrpc() is too primitive interface for such applications. There would be better to use umad_send/recv() directly or may be mad_send_via(). Example is mcast_storm.c distributed with ibsim." [$] It is my opinion that mad_rpc is _not_ primitive. In my mind it _is_ the wrapper around the primitive umad_send/recv calls. If you are interested perhaps I can try to explain what I wanted to do in the library to make it thread safe more clearly. The point I might not have made clear was that I don't think the library will have to do any threading on it's own, just some locks and storing of responses. Of course the down side to this is the libibmad code would be slightly slower. But I don't think by very much. -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov From jgunthorpe at obsidianresearch.com Wed Aug 26 17:24:20 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 26 Aug 2009 18:24:20 -0600 Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object.) In-Reply-To: <20090826164026.8dcce4b2.weiny2@llnl.gov> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> <20090817083023.da17378b.weiny2@llnl.gov> <20090823120609.GG9547@me> <20090826164026.8dcce4b2.weiny2@llnl.gov> Message-ID: <20090827002420.GT406@obsidianresearch.com> On Wed, Aug 26, 2009 at 04:40:26PM -0700, Ira Weiny wrote: > Of course! :-) But first I would like to mention some numbers from the > prototype code I have. When running on a small fabric the additional overhead > of thread creation actually slows down the scan. :-( It seems strange to me to thread something like this (and alot of hard work).. FSM multiplexing the recv path usually gives much better performance, something like net discovery is quite easy.. main loop: fill tx queue from next list recieve replies and correlate with next list each entry: add to next list additional ports Repeat until dead. Where a 'next list' would be a set of actions along the lines of 'query node' or 'query port' the action on a 'query node' completion is to generate 'query port' next list items for all the ports, and on 'query port' completion is to generate 'query node' items for all enabled ports.. libumad is nonblocking, parallel, etc... Jason From FENKES at de.ibm.com Thu Aug 27 02:44:30 2009 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Thu, 27 Aug 2009 11:44:30 +0200 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: Hal Rosenstock wrote on 26.08.2009 17:15:03: > Thanks for doing this. It looks sane to me. The only issue I recall that > appears to be remaining is a better setting of ClassPortInfo:RespTimeValue > rather than hardcoding. Perhaps using the value from PortInfo is the way to go > (ideally it would be that value from the port to which the the requester is > being redirected to but that might not be so easy to get from this port. I don't think that effort will be necessary or even legal. The requestor will react to the redirection with another Get(ClassPortInfo) to the redirection target, which will reply with its own RespTimeValue, so our driver should speak for itself. Since we don't know when our MAD processing and sending of the response is going to be scheduled (we're not running on real-time constraints here), we play it safe and return 18, which amounts to roughly a second. Make sense? Regards Joachim From vlad at lists.openfabrics.org Thu Aug 27 03:05:15 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 27 Aug 2009 03:05:15 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090827-0200 daily build status Message-ID: <20090827100516.402E4E30266@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090827-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From aaron.knister at gmail.com Thu Aug 27 05:30:52 2009 From: aaron.knister at gmail.com (Aaron Knister) Date: Thu, 27 Aug 2009 08:30:52 -0400 Subject: [ofa-general] IPoIB connected vs datagram Message-ID: <4A967C7C.7080509@gmail.com> Hi! I'm having some strange problems on an InfiniBand fabric at work. We have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012 IB switch. There are also several Sun "thumpers" running solaris that are also connected to the infiniband fabric, however their HCAs are only SDR. There are several 20 odd terabyte nfs mounts exported from the thumpers and mounted to the compute nodes over IPoIB (we're not using NFS RDMA). Opensm is running on the head node and all of the compute nodes for redundancys sake. Things were running OK until yesterday when a user crashed the head node by sucking up all of its memory, and at the time the head node's subnet manager was in the master state. Well, a different node quickly picked up subnet management until the head node was rebooted at which point the head node became the subnet master. Since logging back in to the cluster after rebooting the head node, the nfs mounts from the thumpers have been hanging periodically all over the place. I know that two of the thumpers and their nfs exports are being hit with an aggregate of about 120MB/s of nfs traffic from about 30 or so compute nodes, so I'm sure that's not helping things, however one of the other thumpers that has no active jobs hitting its exports periodically shows nfs server "not responding" message on the clients/compute nodes. I checked the log files for the past week- these nfs server not responding messages all started since the head node crash yesterday. From what I've been told, every time this happens the only fix is to reboot the switch. Of course, any general debugging suggestions would be appreciated, but I have a few specific questions regarding IPoIB and connected vs datagram. All of the compute nodes and the head node (running ofed 1.4) are using "connected mode" for IPoIB -> [root at headnode ~]# cat /sys/class/net/ib0/mode connected and the mtu of the interface is 65520 I don't know how to determine if the solaris (the thumpers) systems are using connected mode, but their MTUs are 2044 which leads me to believe they're probably not. I cannot log into these machines as I don't manage them, but is there a way to determine the IPoIB mtu using an ib* utility? Or am I misunderstanding IPoIB that such information wouldn't be useful. And lastly, I recall that with TCP over ethernet if you have the mtu said to say 9000 and try and sling data to a box with an mtu of 1500 you get some weird performance hits. Is it likely that the compute nodes use of the larger MTU + connected mode paired with the thumpers much smaller MTU + probably datagram mode could be causing timeouts under heavy load? Does anybody think that settings the compute/head nodes to datagram mode and subsequently dropping the mtu to 2044 would help my situation? Again, any suggestions are greatly appreciated, and thanks in advance for any replies! -Aaron From hal.rosenstock at gmail.com Thu Aug 27 06:31:40 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 27 Aug 2009 09:31:40 -0400 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: On 8/27/09, Joachim Fenkes wrote: > > Hal Rosenstock wrote on 26.08.2009 17:15:03: > > > Thanks for doing this. It looks sane to me. The only issue I recall that > > > appears to be remaining is a better setting of > ClassPortInfo:RespTimeValue > > rather than hardcoding. Perhaps using the value from PortInfo is the way > to go > > (ideally it would be that value from the port to which the the requester > is > > being redirected to but that might not be so easy to get from this port. > > I don't think that effort will be necessary or even legal. The requestor > will react to the redirection with another Get(ClassPortInfo) to the > redirection target, which will reply with its own RespTimeValue, so our > driver should speak for itself. I overreached with my comment on how this works. Since we don't know when our MAD > processing and sending of the response is going to be scheduled (we're not > running on real-time constraints here), we play it safe and return 18, > which amounts to roughly a second. > > Make sense? I don't think it should be hard coded. IMO it would be better to default to 18 and somehow able to be adjusted (via a (dynamic) module parameter ?). -- Hal > Regards > Joachim > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niftyompi at niftyegg.com Thu Aug 27 08:26:34 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Thu, 27 Aug 2009 08:26:34 -0700 Subject: [ofa-general] IPoIB connected vs datagram In-Reply-To: <4A967C7C.7080509@gmail.com> References: <4A967C7C.7080509@gmail.com> Message-ID: <20090827152634.GC3272@tosh2egg.ca.sanfran.comcast.net> On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote: > > Hi! > > I'm having some strange problems on an InfiniBand fabric at work. We > have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012 > IB switch. There are also several Sun "thumpers" running solaris that > are also connected to the infiniband fabric, however their HCAs are only > SDR. There are several 20 odd terabyte nfs mounts exported from the > thumpers and mounted to the compute nodes over IPoIB (we're not using > NFS RDMA). Opensm is running on the head node and all of the compute > nodes for redundancys sake. Things were running OK until yesterday when > a user crashed the head node by sucking up all of its memory, and at the > time the head node's subnet manager was in the master state. Well, a > different node quickly picked up subnet management until the head node > was rebooted at which point the head node became the subnet master. > > Since logging back in to the cluster after rebooting the head node, the > nfs mounts from the thumpers have been hanging periodically all over the > place. I know that two of the thumpers and their nfs exports are being > hit with an aggregate of about 120MB/s of nfs traffic from about 30 or > so compute nodes, so I'm sure that's not helping things, however one of > the other thumpers that has no active jobs hitting its exports > periodically shows nfs server "not responding" message on the > clients/compute nodes. I checked the log files for the past week- these > nfs server not responding messages all started since the head node crash > yesterday. From what I've been told, every time this happens the only > fix is to reboot the switch. > > Of course, any general debugging suggestions would be appreciated, but I > have a few specific questions regarding IPoIB and connected vs datagram. > All of the compute nodes and the head node (running ofed 1.4) are using > "connected mode" for IPoIB -> > > [root at headnode ~]# cat /sys/class/net/ib0/mode > connected > > and the mtu of the interface is 65520 > > I don't know how to determine if the solaris (the thumpers) systems are > using connected mode, but their MTUs are 2044 which leads me to believe > they're probably not. I cannot log into these machines as I don't manage > them, but is there a way to determine the IPoIB mtu using an ib* > utility? Or am I misunderstanding IPoIB that such information wouldn't > be useful. > > And lastly, I recall that with TCP over ethernet if you have the mtu > said to say 9000 and try and sling data to a box with an mtu of 1500 you > get some weird performance hits. Is it likely that the compute nodes use > of the larger MTU + connected mode paired with the thumpers much smaller > MTU + probably datagram mode could be causing timeouts under heavy load? > Does anybody think that settings the compute/head nodes to datagram mode > and subsequently dropping the mtu to 2044 would help my situation? > > Again, any suggestions are greatly appreciated, and thanks in advance > for any replies! Look at the MTU choices again. With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited to 2K by the switch firmware. Larger MTUs are thus synthetic and force software to assemble and disassemble the transfers. On a fabric the large MTU for IPoIB works well because the fabric is quite reliable. When data is routed to another network with a smaller MTU software needs to assemble and disassemble the fragments. Fragmentation can be expensive. Dropped bits and fragmentation is a major performance hit. Normal MTU discovery should make fragmentation go away. Ethernet jumbo packets (larger than 1500) are real on the wire. This is not the case on IB where the MTU is fixed. Is the NFS NFS over UDP or TCP ? What are the NFS read/ write sizes set to? Double check routes (traceroute). Dynamic routes and mixed MTUs is a tangle. The minimum MTU for a route can be discovered with ping and the do not fragment flag as long as ICMP packets are not filtered. -- T o m M i t c h e l l Found me a new hat, now what? From aaron.knister at gmail.com Thu Aug 27 08:41:40 2009 From: aaron.knister at gmail.com (Aaron Knister) Date: Thu, 27 Aug 2009 11:41:40 -0400 Subject: [ofa-general] IPoIB connected vs datagram In-Reply-To: <20090827152634.GC3272@tosh2egg.ca.sanfran.comcast.net> References: <4A967C7C.7080509@gmail.com> <20090827152634.GC3272@tosh2egg.ca.sanfran.comcast.net> Message-ID: Thanks for the reply! Good to know about the "true" MTU vs the synthetic mtu. I wasn't aware of that. The NFS is NFS over TCP and the read/write sizes are both set to 32768. I don't have any routes that I know of on the IB fabric- a traceroute seemed to verify this. I used tracepath to show me the mtu information between the two hosts. On the second attempt it looks like it "discovered" the correct MTU - [root at headnode ~]# tracepath thumper1-ib 1: headnode (10.0.1.1) 0.133ms pmtu 65520 1: thumper1-ib (10.0.1.245) 0.161ms reached Resume: pmtu 2044 hops 1 back 1 [root at headnode ~]# tracepath thumper1-ib 1: headnode (10.0.1.1) 0.122ms pmtu 2044 1: thumper1-ib (10.0.1.245) 0.121ms reached Resume: pmtu 2044 hops 1 back 1 We rebooted the infiniband switch which cleared up the NFS issues for now. The one thing I noticed after the reboot was that the solars storage servers were back in the multicast group (saquery -m). It's definitely an odd situation... Thanks again for your help On Thu, Aug 27, 2009 at 11:26 AM, Nifty Tom Mitchell wrote: > On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote: > > > > Hi! > > > > I'm having some strange problems on an InfiniBand fabric at work. We > > have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012 > > IB switch. There are also several Sun "thumpers" running solaris that > > are also connected to the infiniband fabric, however their HCAs are only > > SDR. There are several 20 odd terabyte nfs mounts exported from the > > thumpers and mounted to the compute nodes over IPoIB (we're not using > > NFS RDMA). Opensm is running on the head node and all of the compute > > nodes for redundancys sake. Things were running OK until yesterday when > > a user crashed the head node by sucking up all of its memory, and at the > > time the head node's subnet manager was in the master state. Well, a > > different node quickly picked up subnet management until the head node > > was rebooted at which point the head node became the subnet master. > > > > Since logging back in to the cluster after rebooting the head node, the > > nfs mounts from the thumpers have been hanging periodically all over the > > place. I know that two of the thumpers and their nfs exports are being > > hit with an aggregate of about 120MB/s of nfs traffic from about 30 or > > so compute nodes, so I'm sure that's not helping things, however one of > > the other thumpers that has no active jobs hitting its exports > > periodically shows nfs server "not responding" message on the > > clients/compute nodes. I checked the log files for the past week- these > > nfs server not responding messages all started since the head node crash > > yesterday. From what I've been told, every time this happens the only > > fix is to reboot the switch. > > > > Of course, any general debugging suggestions would be appreciated, but I > > have a few specific questions regarding IPoIB and connected vs datagram. > > All of the compute nodes and the head node (running ofed 1.4) are using > > "connected mode" for IPoIB -> > > > > [root at headnode ~]# cat /sys/class/net/ib0/mode > > connected > > > > and the mtu of the interface is 65520 > > > > I don't know how to determine if the solaris (the thumpers) systems are > > using connected mode, but their MTUs are 2044 which leads me to believe > > they're probably not. I cannot log into these machines as I don't manage > > them, but is there a way to determine the IPoIB mtu using an ib* > > utility? Or am I misunderstanding IPoIB that such information wouldn't > > be useful. > > > > And lastly, I recall that with TCP over ethernet if you have the mtu > > said to say 9000 and try and sling data to a box with an mtu of 1500 you > > get some weird performance hits. Is it likely that the compute nodes use > > of the larger MTU + connected mode paired with the thumpers much smaller > > MTU + probably datagram mode could be causing timeouts under heavy load? > > Does anybody think that settings the compute/head nodes to datagram mode > > and subsequently dropping the mtu to 2044 would help my situation? > > > > Again, any suggestions are greatly appreciated, and thanks in advance > > for any replies! > > Look at the MTU choices again. > With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited > to 2K by the switch firmware. Larger MTUs are thus synthetic and force > software to > assemble and disassemble the transfers. On a fabric the large MTU for > IPoIB > works well because the fabric is quite reliable. When data is routed > to another network with a smaller MTU software needs to assemble and > disassemble the > fragments. Fragmentation can be expensive. Dropped bits and > fragmentation is > a major performance hit. Normal MTU discovery should make fragmentation > go away. > > Ethernet jumbo packets (larger than 1500) are real on the wire. > This is not the case on IB where the MTU is fixed. > > Is the NFS NFS over UDP or TCP ? > What are the NFS read/ write sizes set to? > > Double check routes (traceroute). Dynamic routes and mixed MTUs is a > tangle. > The minimum MTU for a route can be discovered with ping and the do not > fragment flag > as long as ICMP packets are not filtered. > > -- > T o m M i t c h e l l > Found me a new hat, now what? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From monis at Voltaire.COM Thu Aug 27 08:52:46 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 27 Aug 2009 18:52:46 +0300 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format In-Reply-To: <20090826180457.GR406@obsidianresearch.com> References: <20090821000431.GA5713@obsidianresearch.com> <4A94FB67.6050600@voltaire.com> <20090826180457.GR406@obsidianresearch.com> Message-ID: <4A96ABCE.2030204@Voltaire.COM> Jason Gunthorpe wrote: > On Wed, Aug 26, 2009 at 12:07:51PM +0300, Or Gerlitz wrote: > >> isn't Jason's approach enough for the bonding case?! I saw that your >> patch ("bonding: clean muticast addresses when device changes type" > > I think working versions of all three patches are required: > 1) Fix the bonding driver. Otherwise the right groups might not be > joined. > 2) Check the address format, to protect against 'ip maddr add' and > other wakkyness > 3) Fix the timeout handling, so mlid exhaustion and other SA side > errors are handled elegantly. > > All are bugs.. > >> and maybe also in mainline .31-rcX . However, it has the >> down-side-effect of e.g loosing routes already set for the the bond >> while adding the underline IPoIB devices, so if Jason's patch is >> enough > > Is this true? That is pretty ugly, but probably manageable.. > Yes it's true but I'm not sure it's ugly. Changing device type is not a common event and requires device ops change which I think is better to do when the device is closed. Unfortunately, losing routes is a side effect of closing the device but it might be necessary. From weiny2 at llnl.gov Thu Aug 27 09:48:10 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 27 Aug 2009 09:48:10 -0700 Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object.) In-Reply-To: <20090827002420.GT406@obsidianresearch.com> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> <20090817083023.da17378b.weiny2@llnl.gov> <20090823120609.GG9547@me> <20090826164026.8dcce4b2.weiny2@llnl.gov> <20090827002420.GT406@obsidianresearch.com> Message-ID: <20090827094810.6cfe02f5.weiny2@llnl.gov> On Wed, 26 Aug 2009 18:24:20 -0600 Jason Gunthorpe wrote: > On Wed, Aug 26, 2009 at 04:40:26PM -0700, Ira Weiny wrote: > > > Of course! :-) But first I would like to mention some numbers from the > > prototype code I have. When running on a small fabric the additional overhead > > of thread creation actually slows down the scan. :-( > > It seems strange to me to thread something like this (and alot of hard > work).. > > FSM multiplexing the recv path usually gives much better performance, > something like net discovery is quite easy.. Using the original algorithm and data structures lended itself to threading. Now that I am neck deep in all this I have thought that rewriting it all might be easier. > main loop: > fill tx queue from next list > recieve replies and correlate with next list This would still need additional code (or additional synchronization in the API to libibnetdisc) if you wanted a user app to be multi-threaded. Someone has to be in charge of receiving all replies on that ibmad_port object and handing them to the proper owner. Of course one could open multiple ibmad_port objects but how is the app writer to know to do that? Digging through the code to find out that libibnetdisc is consuming all the replies? This is what got me on this in the first place. smp_query_via (_do_madrpc) is not thread safe. Threading was the easy way to deal with multiple blocking queries on the fabric. Changing _do_madrpc to be thread safe allowed a very quick multithreaded implementation on top of the current algorithm which blocked on multiple queries. I did not have to form the queries myself, it was easy... (I had that working months ago.) Given that we don't want to change libibmad things got more complicated and your algorithm seems much better... (except [see below]) Also, I feel that someone down the road might fall into the same trap that I did thinking that smp_query_via is thread safe and I would like to fix that. > > each entry: > add to next list additional ports > > Repeat until dead. > > Where a 'next list' would be a set of actions along the lines of > 'query node' or 'query port' the action on a 'query node' completion > is to generate 'query port' next list items for all the ports, and on > 'query port' completion is to generate 'query node' items for all > enabled ports.. > > libumad is nonblocking, parallel, etc... Yes, and libibmad layers on top of it an easier interface to issue common queries. Why should we ask the user to re-implement that code? For example, mad_rpc now handles redirection. My implementation does not yet. So now I have to handle that on my own as well... :-( Ira > > Jason -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov From jgunthorpe at obsidianresearch.com Thu Aug 27 11:20:56 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 27 Aug 2009 12:20:56 -0600 Subject: [ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object.) In-Reply-To: <20090827094810.6cfe02f5.weiny2@llnl.gov> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> <20090817083023.da17378b.weiny2@llnl.gov> <20090823120609.GG9547@me> <20090826164026.8dcce4b2.weiny2@llnl.gov> <20090827002420.GT406@obsidianresearch.com> <20090827094810.6cfe02f5.weiny2@llnl.gov> Message-ID: <20090827182056.GV406@obsidianresearch.com> On Thu, Aug 27, 2009 at 09:48:10AM -0700, Ira Weiny wrote: > > FSM multiplexing the recv path usually gives much better performance, > > something like net discovery is quite easy.. > > Using the original algorithm and data structures lended itself to > threading. Now that I am neck deep in all this I have thought that > rewriting it all might be easier. Yah. mayhaps.. > > main loop: > > fill tx queue from next list > > recieve replies and correlate with next list > This would still need additional code (or additional synchronization in the > API to libibnetdisc) if you wanted a user app to be multi-threaded. Someone > has to be in charge of receiving all replies on that ibmad_port object and > handing them to the proper owner. Of course one could open multiple > ibmad_port objects but how is the app writer to know to do that? Digging > through the code to find out that libibnetdisc is consuming all the replies? What is the use case here? I thought the app would be something like: main() { foo = libibnetdisc_setup(); libibnetdisc_discover_all(foo,res); // Do interesting things with res. } Where the goal is to have libibnetdisc_discover_all complete expediently. As long as the context 'foo' is re-entrant in all ways with all other libraries and contexts I think useful threaded apps can be created. > This is what got me on this in the first place. smp_query_via > (_do_madrpc) is not thread safe. Sure, the entire library is not thread safe around the ibmad_port context. But who cares? If the caller to libibnetdisc wants to thread that way they need to open another context. > Also, I feel that someone down the road might fall into the same > trap that I did thinking that smp_query_via is thread safe and I > would like to fix that. Well.. How can it be threaded? umad_send/umad_recv are inherently single threaded APIs. You have to layer a TID based threading dispatch mechanism on top of it. Much better to let the kernel do that and open multiple umad fds. > > each entry: > > add to next list additional ports > > > > Repeat until dead. > > > > Where a 'next list' would be a set of actions along the lines of > > 'query node' or 'query port' the action on a 'query node' completion > > is to generate 'query port' next list items for all the ports, and on > > 'query port' completion is to generate 'query node' items for all > > enabled ports.. > > > > libumad is nonblocking, parallel, etc... > > Yes, and libibmad layers on top of it an easier interface to issue common > queries. Why should we ask the user to re-implement that code? Well, the very best way to do this is to have a FSM engine API at the core of the MAD libary: mad_ctx->callback = done_this; mad_post(mad,mad_ctx) done_this(reply): ... > For example, mad_rpc now handles redirection. My implementation > does not yet. So now I have to handle that on my own as well... > :-( To be honest, I don't like the libibmad/libibumad APIs one bit - I'm not surprised they don't work for you.. Frankly, we really need a usable MAD libary with sane APIs, and very high level APIs on top of that. You cannot make an IB application without doing SA queries at a minimum and the current process is HORRID. I see nothing of value in libimad and libibumad to support that :| Jason From rdreier at cisco.com Thu Aug 27 13:34:01 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 27 Aug 2009 13:34:01 -0700 Subject: [ofa-general] Re: [PATCH v2] mlx4_core: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <4A77A430.2020106@sgi.com> (Arputham Benjamin's message of "Mon, 03 Aug 2009 20:00:00 -0700") References: <4A77A430.2020106@sgi.com> Message-ID: Thanks, at long last I applied both the mthca and mlx4 versions of these patches (with some cleanups). - R. From FENKES at de.ibm.com Thu Aug 27 02:44:30 2009 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Thu, 27 Aug 2009 11:44:30 +0200 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: Hal Rosenstock wrote on 26.08.2009 17:15:03: > Thanks for doing this. It looks sane to me. The only issue I recall that > appears to be remaining is a better setting of ClassPortInfo:RespTimeValue > rather than hardcoding. Perhaps using the value from PortInfo is the way to go > (ideally it would be that value from the port to which the the requester is > being redirected to but that might not be so easy to get from this port. I don't think that effort will be necessary or even legal. The requestor will react to the redirection with another Get(ClassPortInfo) to the redirection target, which will reply with its own RespTimeValue, so our driver should speak for itself. Since we don't know when our MAD processing and sending of the response is going to be scheduled (we're not running on real-time constraints here), we play it safe and return 18, which amounts to roughly a second. Make sense? Regards Joachim _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev at lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev From jenos at ncsa.uiuc.edu Thu Aug 27 15:54:00 2009 From: jenos at ncsa.uiuc.edu (Jeremy Enos) Date: Thu, 27 Aug 2009 17:54:00 -0500 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <4A948262.7030508@ncsa.uiuc.edu> References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il> <4A92A0C6.9030501@ncsa.uiuc.edu> <4A948262.7030508@ncsa.uiuc.edu> Message-ID: <4A970E88.2020505@ncsa.uiuc.edu> An HTML attachment was scrubbed... URL: From FENKES at de.ibm.com Thu Aug 27 02:44:30 2009 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Thu, 27 Aug 2009 11:44:30 +0200 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: Hal Rosenstock wrote on 26.08.2009 17:15:03: > Thanks for doing this. It looks sane to me. The only issue I recall that > appears to be remaining is a better setting of ClassPortInfo:RespTimeValue > rather than hardcoding. Perhaps using the value from PortInfo is the way to go > (ideally it would be that value from the port to which the the requester is > being redirected to but that might not be so easy to get from this port. I don't think that effort will be necessary or even legal. The requestor will react to the redirection with another Get(ClassPortInfo) to the redirection target, which will reply with its own RespTimeValue, so our driver should speak for itself. Since we don't know when our MAD processing and sending of the response is going to be scheduled (we're not running on real-time constraints here), we play it safe and return 18, which amounts to roughly a second. Make sense? Regards Joachim _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev at lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev From klakshman03 at hotmail.com Fri Aug 28 00:55:54 2009 From: klakshman03 at hotmail.com (lakshmana swamy) Date: Fri, 28 Aug 2009 13:25:54 +0530 Subject: [ofa-general] QDR IB cards supports card back to back connectivity Message-ID: Dear All, I would like know the QDR Infinibad cards will support to back to back connectivity or not ie with out IB swicth to enable the IB communication between the two machines . Regards laxman _________________________________________________________________ We all see it as it is. But on MSN India, the difference lies in perspective. http://in.msn.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Fri Aug 28 01:07:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 28 Aug 2009 11:07:56 +0300 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables setup flow In-Reply-To: <20090825190141.GG28379@me> References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> Message-ID: <20090828080756.GH28379@me> Simplify (and unify) forwarding tables setup decision flow. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_mgr.c | 7 +------ 1 files changed, 1 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 629f628..8ba78f8 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, } } - set_fwd_tbl_top(p_mgr, p_sw); - if (p_mgr->p_subn->opt.lmc) free_ports_priv(p_mgr); @@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl, p_mgr); - ucast_mgr_pipeline_fwd_tbl(p_mgr); - cl_qlist_remove_all(&p_mgr->port_order_list); return 0; @@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm) osm->routing_engine_used = osm_routing_engine_type(r->name); - if (r->ucast_build_fwd_tables) - osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); + osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); return 0; } -- 1.6.4 From sashak at voltaire.com Fri Aug 28 01:10:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 28 Aug 2009 11:10:02 +0300 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr: better lft setup In-Reply-To: <20090828080756.GH28379@me> References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> Message-ID: <20090828081002.GI28379@me> The function set_next_lft_block() is called in loop with block number incremented, inside it loops by itself in looking for changed block, caller will call this function with original block number incremented so this internal loop could be repeated again and again. This patch cleans this ineffectiveness. Also rename it to set_lft_block() since block number is treated as parameters and *not* next block is processed and merges some code. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_ucast_mgr.h | 1 + opensm/opensm/osm_ucast_mgr.c | 126 +++++++++++---------------------- 2 files changed, 43 insertions(+), 84 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 4ef045c..78a88f0 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -95,6 +95,7 @@ typedef struct osm_ucast_mgr { osm_subn_t *p_subn; osm_log_t *p_log; cl_plock_t *p_lock; + uint16_t max_lid; cl_qlist_t port_order_list; boolean_t is_dor; boolean_t some_hop_count_set; diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 8ba78f8..a111c10 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -336,6 +336,9 @@ static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, IN osm_switch_t * p_sw) CL_ASSERT(p_node); + if (p_mgr->max_lid < p_sw->max_lid_ho) + p_mgr->max_lid = p_sw->max_lid_ho; + p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node, 0)); /* @@ -478,65 +481,13 @@ static void ucast_mgr_process_top(IN cl_map_item_t * p_map_item, set_fwd_tbl_top(p_mgr, p_sw); } -static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm, - IN uint8_t * p_block, - IN osm_dr_path_t * p_path, - IN uint16_t block_id_ho, - IN osm_madw_context_t * p_context) -{ - ib_api_status_t status; - boolean_t sts; - - OSM_LOG_ENTER(p_sm->p_log); - - for (; - (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block)); - block_id_ho++) { - if (!p_sw->need_update && !p_sm->p_subn->need_update && - !memcmp(p_block, - p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, - IB_SMP_DATA_SIZE)) - continue; - - OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, - "Writing FT block %u to switch 0x%" PRIx64 "\n", - block_id_ho, - cl_ntoh64(p_context->lft_context.node_guid)); - - status = osm_req_set(p_sm, p_path, - p_sw->new_lft + - block_id_ho * IB_SMP_DATA_SIZE, - IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL, - cl_hton32(block_id_ho), - CL_DISP_MSGID_NONE, p_context); - - if (status != IB_SUCCESS) - OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 3A05: " - "Sending linear fwd. tbl. block failed (%s)\n", - ib_get_err_str(status)); - break; - } - - OSM_LOG_EXIT(p_sm->p_log); - return sts; -} - -static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw, - IN osm_ucast_mgr_t *p_mgr, - IN uint16_t block_id_ho) +static int set_lft_block(IN osm_switch_t *p_sw, IN osm_ucast_mgr_t *p_mgr, + IN uint16_t block_id_ho) { - osm_dr_path_t *p_path; - osm_madw_context_t context; uint8_t block[IB_SMP_DATA_SIZE]; - boolean_t status; - - OSM_LOG_ENTER(p_mgr->p_log); - - CL_ASSERT(p_sw && p_sw->p_node); - - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, - "Processing switch 0x%" PRIx64 "\n", - cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); + osm_madw_context_t context; + osm_dr_path_t *p_path; + ib_api_status_t status; /* Send linear forwarding table blocks to the switch @@ -547,8 +498,7 @@ static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw, /* any routing should provide the new_lft */ CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && p_mgr->cache_valid && !p_sw->need_update); - status = FALSE; - goto Exit; + return -1; } p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); @@ -556,12 +506,29 @@ static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw, context.lft_context.node_guid = osm_node_get_node_guid(p_sw->p_node); context.lft_context.set_method = TRUE; - status = set_next_lft_block(p_sw, p_mgr->sm, &block[0], p_path, - block_id_ho, &context); + if (!osm_switch_get_lft_block(p_sw, block_id_ho, block) || + (!p_sw->need_update && !p_mgr->p_subn->need_update && + !memcmp(block, p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, + IB_SMP_DATA_SIZE))) + return 0; -Exit: - OSM_LOG_EXIT(p_mgr->p_log); - return status; + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Writing FT block %u to switch 0x%" PRIx64 "\n", block_id_ho, + cl_ntoh64(context.lft_context.node_guid)); + + status = osm_req_set(p_mgr->sm, p_path, + p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, + IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL, + cl_hton32(block_id_ho), + CL_DISP_MSGID_NONE, &context); + if (status != IB_SUCCESS) { + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: " + "Sending linear fwd. tbl. block failed (%s)\n", + ib_get_err_str(status)); + return -1; + } + + return 0; } /********************************************************************** @@ -919,26 +886,15 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t * m) static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * p_mgr) { - cl_qmap_t *p_sw_tbl; - osm_switch_t *p_sw; - uint16_t block_id_ho = 0; - int sws_notdone; - boolean_t sts; - - p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; - while (1) { - p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); - sws_notdone = 0; - while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { - sts = pipeline_next_lft_block(p_sw, p_mgr, block_id_ho); - if (sts) - sws_notdone++; - p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); - } - if (!sws_notdone) - break; - block_id_ho++; - } + cl_qmap_t *tbl; + cl_map_item_t *item; + unsigned i, max_block = p_mgr->max_lid / 64 + 1; + + tbl = &p_mgr->p_subn->sw_guid_tbl; + for (i = 0; i < max_block; i++) + for (item = cl_qmap_head(tbl); item != cl_qmap_end(tbl); + item = cl_qmap_next(item)) + set_lft_block((osm_switch_t *)item, p_mgr, i); } static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) @@ -984,6 +940,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) **********************************************************************/ void osm_ucast_mgr_set_fwd_table(osm_ucast_mgr_t * p_mgr) { + p_mgr->max_lid = 0; + cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_top, p_mgr); -- 1.6.4 From Lars.Paul.Huse at Sun.COM Fri Aug 28 02:47:44 2009 From: Lars.Paul.Huse at Sun.COM (Lars Paul Huse) Date: Fri, 28 Aug 2009 11:47:44 +0200 Subject: [ofa-general] [PATCH] ibdm/ibnl/* ibnl definition files for Sun IB QDR products Message-ID: <4A97A7C0.8010808@Sun.COM> ibnl definition files for Sun IB QDR products: - 48 port QNEM - 36 port Switch - 72 port Switch - 648 port Switch Signed-off-by: Lars Paul Huse --- diff --git a/ibdm/ibnl/SUNBQNEM48.ibnl b/ibdm/ibnl/SUNBQNEM48.ibnl new file mode 100644 index 0000000..e722bfa --- /dev/null +++ b/ibdm/ibnl/SUNBQNEM48.ibnl @@ -0,0 +1,117 @@ +SYSTEM LEAF,LEAF:4x,LEAF:4X + +NODE SW 36 MT48436 U1 +1 -10G-> P1 +2 -10G-> P2 +3 -10G-> P3 +4 -10G-> P4 +5 -10G-> P5 +6 -10G-> P6 +7 -10G-> P7 +8 -10G-> P8 +9 -10G-> P9 +10 -10G-> P10 +11 -10G-> P11 +12 -10G-> P12 +13 -10G-> P13 +14 -10G-> P14 +15 -10G-> P15 +16 -10G-> P16 +17 -10G-> P17 +18 -10G-> P18 +19 -10G-> P19 +20 -10G-> P20 +21 -10G-> P21 +22 -10G-> P22 +23 -10G-> P23 +24 -10G-> P24 +25 -10G-> P25 +26 -10G-> P26 +27 -10G-> P27 +28 -10G-> P28 +29 -10G-> P29 +30 -10G-> P30 +31 -10G-> P31 +32 -10G-> P32 +33 -10G-> P33 +34 -10G-> P34 +35 -10G-> P35 +36 -10G-> P36 + +TOPSYSTEM SUNBQNEM48,SUN-QNEM + +SUBSYSTEM LEAF SW-A + P1 -10G-> C-A0 + P2 -10G-> C-A1 + P3 -10G-> C-A2 + P4 -10G-> C-A3 + P5 -10G-> C-A4 + P6 -10G-> C-A5 + P7 -10G-> C-A6 + P8 -10G-> C-A7 + P9 -10G-> C-A8 + P10 -10G-> C-A9 + P11 -10G-> C-A10 + P12 -10G-> C-A11 + P13 -10G-> C-A12 + P14 -10G-> C-A13 + P15 -10G-> C-A14 + P16 -10G-> P1 + P17 -10G-> P2 + P18 -10G-> P3 + P19 -10G-> P4 + P20 -10G-> P5 + P21 -10G-> P6 + P22 -10G-> P7 + P23 -10G-> P8 + P24 -10G-> P9 + P25 -10G-> P10 + P26 -10G-> P11 + P27 -10G-> P12 + P28 -10G-> SW-B P28 + P29 -10G-> SW-B P29 + P30 -10G-> SW-B P30 + P31 -10G-> SW-B P31 + P32 -10G-> SW-B P32 + P33 -10G-> SW-B P33 + P34 -10G-> SW-B P34 + P35 -10G-> SW-B P35 + P36 -10G-> SW-B P36 + +SUBSYSTEM LEAF SW-B + P1 -10G-> C-B0 + P2 -10G-> C-B1 + P3 -10G-> C-B2 + P4 -10G-> C-B3 + P5 -10G-> C-B4 + P6 -10G-> C-B5 + P7 -10G-> C-B6 + P8 -10G-> C-B7 + P9 -10G-> C-B8 + P10 -10G-> C-B9 + P11 -10G-> C-B10 + P12 -10G-> C-B11 + P13 -10G-> C-B12 + P14 -10G-> C-B13 + P15 -10G-> C-B14 + P16 -10G-> P13 + P17 -10G-> P14 + P18 -10G-> P15 + P19 -10G-> P16 + P20 -10G-> P17 + P21 -10G-> P18 + P22 -10G-> P19 + P23 -10G-> P20 + P24 -10G-> P21 + P25 -10G-> P22 + P26 -10G-> P23 + P27 -10G-> P24 + P28 -10G-> SW-A P28 + P29 -10G-> SW-A P29 + P30 -10G-> SW-A P30 + P31 -10G-> SW-A P31 + P32 -10G-> SW-A P32 + P33 -10G-> SW-A P33 + P34 -10G-> SW-A P34 + P35 -10G-> SW-A P35 + P36 -10G-> SW-A P36 diff --git a/ibdm/ibnl/SUNDCS36QDR.ibnl b/ibdm/ibnl/SUNDCS36QDR.ibnl new file mode 100644 index 0000000..aa33d53 --- /dev/null +++ b/ibdm/ibnl/SUNDCS36QDR.ibnl @@ -0,0 +1,42 @@ + +TOPSYSTEM SUNDCS36QDR,NM2-36P + +U1=isChaBma + +NODE SW 36 SUNDCS36QDR U1 + 1 -> C-17A + 2 -> C-17B + 3 -> C-16A + 4 -> C-16B + 5 -> C-15A + 6 -> C-15B + 7 -> C-14A + 8 -> C-14B + 9 -> C-13A + 10 -> C-13B + 11 -> C-12A + 12 -> C-12B + 13 -> C-9B + 14 -> C-9A + 15 -> C-10B + 16 -> C-10A + 17 -> C-11B + 18 -> C-11A + 19 -> C-0B + 20 -> C-0A + 21 -> C-1B + 22 -> C-1A + 23 -> C-2B + 24 -> C-2A + 25 -> C-3B + 26 -> C-3A + 27 -> C-4B + 28 -> C-4A + 29 -> C-5B + 30 -> C-5A + 31 -> C-8A + 32 -> C-8B + 33 -> C-7A + 34 -> C-7B + 35 -> C-6A + 36 -> C-6B diff --git a/ibdm/ibnl/SUNDCS648QDR.ibnl b/ibdm/ibnl/SUNDCS648QDR.ibnl new file mode 100644 index 0000000..a8b6558 --- /dev/null +++ b/ibdm/ibnl/SUNDCS648QDR.ibnl @@ -0,0 +1,2133 @@ +SYSTEM LEAF,LEAF:4x,LEAF:4X + +NODE SW 36 MT48436 U1 +1 -10G-> P1 +2 -10G-> P2 +3 -10G-> P3 +4 -10G-> P4 +5 -10G-> P5 +6 -10G-> P6 +7 -10G-> P7 +8 -10G-> P8 +9 -10G-> P9 +10 -10G-> P10 +11 -10G-> P11 +12 -10G-> P12 +13 -10G-> P13 +14 -10G-> P14 +15 -10G-> P15 +16 -10G-> P16 +17 -10G-> P17 +18 -10G-> P18 +19 -10G-> P19 +20 -10G-> P20 +21 -10G-> P21 +22 -10G-> P22 +23 -10G-> P23 +24 -10G-> P24 +25 -10G-> P25 +26 -10G-> P26 +27 -10G-> P27 +28 -10G-> P28 +29 -10G-> P29 +30 -10G-> P30 +31 -10G-> P31 +32 -10G-> P32 +33 -10G-> P33 +34 -10G-> P34 +35 -10G-> P35 +36 -10G-> P36 + +SYSTEM SPINE,SPINE:4x,SPINE:4X + +NODE SW 36 MT48436 U1 +1 -10G-> P1 +2 -10G-> P2 +3 -10G-> P3 +4 -10G-> P4 +5 -10G-> P5 +6 -10G-> P6 +7 -10G-> P7 +8 -10G-> P8 +9 -10G-> P9 +10 -10G-> P10 +11 -10G-> P11 +12 -10G-> P12 +13 -10G-> P13 +14 -10G-> P14 +15 -10G-> P15 +16 -10G-> P16 +17 -10G-> P17 +18 -10G-> P18 +19 -10G-> P19 +20 -10G-> P20 +21 -10G-> P21 +22 -10G-> P22 +23 -10G-> P23 +24 -10G-> P24 +25 -10G-> P25 +26 -10G-> P26 +27 -10G-> P27 +28 -10G-> P28 +29 -10G-> P29 +30 -10G-> P30 +31 -10G-> P31 +32 -10G-> P32 +33 -10G-> P33 +34 -10G-> P34 +35 -10G-> P35 +36 -10G-> P36 + +TOPSYSTEM SUNDCS648QDR,SUN-M9-648 + +SUBSYSTEM SPINE fc1A + P1 -10G-> lc1A P13 + P2 -10G-> lc1B P14 + P3 -10G-> lc1C P13 + P4 -10G-> lc1D P14 + P5 -10G-> lc9A P13 + P6 -10G-> lc9C P13 + P7 -10G-> lc9B P14 + P8 -10G-> lc8A P13 + P9 -10G-> lc9D P14 + P10 -10G-> lc8C P13 + P11 -10G-> lc8B P140 + P12 -10G-> lc7A P13 + P13 -10G-> lc6B P14 + P14 -10G-> lc6A P13 + P15 -10G-> lc7D P14 + P16 -10G-> lc7C P13 + P17 -10G-> lc7B P14 + P18 -10G-> lc8D P14 + P19 -10G-> lc2D P14 + P20 -10G-> lc2C P13 + P21 -10G-> lc2B P14 + P22 -10G-> lc2A P13 + P23 -10G-> lc3D P14 + P24 -10G-> lc3B P14 + P25 -10G-> lc3C P13 + P26 -10G-> lc4D P14 + P27 -10G-> lc3A P13 + P28 -10G-> lc4B P14 + P29 -10G-> lc4C P13 + P30 -10G-> lc5D P14 + P31 -10G-> lc6C P13 + P32 -10G-> lc6D P14 + P33 -10G-> lc5A P13 + P34 -10G-> lc5B P14 + P35 -10G-> lc5C P13 + P36 -10G-> lc4A P13 + +SUBSYSTEM SPINE fc1B + P1 -10G-> lc8D P13 + P2 -10G-> lc8A P14 + P3 -10G-> lc8B P13 + P4 -10G-> lc8C P14 + P5 -10G-> lc7D P13 + P6 -10G-> lc7B P13 + P7 -10G-> lc7A P14 + P8 -10G-> lc6D P13 + P9 -10G-> lc7C P14 + P10 -10G-> lc6B P13 + P11 -10G-> lc6A P14 + P12 -10G-> lc5D P13 + P13 -10G-> lc4A P14 + P14 -10G-> lc4D P13 + P15 -10G-> lc5C P14 + P16 -10G-> lc5B P13 + P17 -10G-> lc5A P14 + P18 -10G-> lc6C P14 + P19 -10G-> lc9C P14 + P20 -10G-> lc9B P13 + P21 -10G-> lc9A P14 + P22 -10G-> lc9D P13 + P23 -10G-> lc1C P14 + P24 -10G-> lc1A P14 + P25 -10G-> lc1B P13 + P26 -10G-> lc2C P14 + P27 -10G-> lc1D P13 + P28 -10G-> lc2A P14 + P29 -10G-> lc2B P13 + P30 -10G-> lc3C P14 + P31 -10G-> lc4B P13 + P32 -10G-> lc4C P14 + P33 -10G-> lc3D P13 + P34 -10G-> lc3A P14 + P35 -10G-> lc3B P13 + P36 -10G-> lc2D P13 + +SUBSYSTEM SPINE fc2A + P1 -10G-> lc1A P15 + P2 -10G-> lc1B P16 + P3 -10G-> lc1C P15 + P4 -10G-> lc1D P16 + P5 -10G-> lc9A P15 + P6 -10G-> lc9C P15 + P7 -10G-> lc9B P16 + P8 -10G-> lc8A P15 + P9 -10G-> lc9D P16 + P10 -10G-> lc8C P15 + P11 -10G-> lc8B P16 + P12 -10G-> lc7A P15 + P13 -10G-> lc6B P16 + P14 -10G-> lc6A P15 + P15 -10G-> lc7D P16 + P16 -10G-> lc7C P15 + P17 -10G-> lc7B P16 + P18 -10G-> lc8D P16 + P19 -10G-> lc2D P16 + P20 -10G-> lc2C P15 + P21 -10G-> lc2B P16 + P22 -10G-> lc2A P15 + P23 -10G-> lc3D P16 + P24 -10G-> lc3B P16 + P25 -10G-> lc3C P15 + P26 -10G-> lc4D P16 + P27 -10G-> lc3A P15 + P28 -10G-> lc4B P16 + P29 -10G-> lc4C P15 + P30 -10G-> lc5D P16 + P31 -10G-> lc6C P15 + P32 -10G-> lc6D P16 + P33 -10G-> lc5A P15 + P34 -10G-> lc5B P16 + P35 -10G-> lc5C P15 + P36 -10G-> lc4A P15 + +SUBSYSTEM SPINE fc2B + P1 -10G-> lc8D P15 + P2 -10G-> lc8A P16 + P3 -10G-> lc8B P15 + P4 -10G-> lc8C P16 + P5 -10G-> lc7D P15 + P6 -10G-> lc7B P15 + P7 -10G-> lc7A P16 + P8 -10G-> lc6D P15 + P9 -10G-> lc7C P16 + P10 -10G-> lc6B P15 + P11 -10G-> lc6A P16 + P12 -10G-> lc5D P15 + P13 -10G-> lc4A P16 + P14 -10G-> lc4D P15 + P15 -10G-> lc5C P16 + P16 -10G-> lc5B P15 + P17 -10G-> lc5A P16 + P18 -10G-> lc6C P16 + P19 -10G-> lc9C P16 + P20 -10G-> lc9B P15 + P21 -10G-> lc9A P16 + P22 -10G-> lc9D P15 + P23 -10G-> lc1C P16 + P24 -10G-> lc1A P16 + P25 -10G-> lc1B P15 + P26 -10G-> lc2C P16 + P27 -10G-> lc1D P15 + P28 -10G-> lc2A P16 + P29 -10G-> lc2B P15 + P30 -10G-> lc3C P16 + P31 -10G-> lc4B P15 + P32 -10G-> lc4C P16 + P33 -10G-> lc3D P15 + P34 -10G-> lc3A P16 + P35 -10G-> lc3B P15 + P36 -10G-> lc2D P15 + +SUBSYSTEM SPINE fc3A + P1 -10G-> lc1A P17 + P2 -10G-> lc1B P18 + P3 -10G-> lc1C P17 + P4 -10G-> lc1D P18 + P5 -10G-> lc9A P17 + P6 -10G-> lc9C P17 + P7 -10G-> lc9B P18 + P8 -10G-> lc8A P17 + P9 -10G-> lc9D P18 + P10 -10G-> lc8C P17 + P11 -10G-> lc8B P18 + P12 -10G-> lc7A P17 + P13 -10G-> lc6B P18 + P14 -10G-> lc6A P17 + P15 -10G-> lc7D P18 + P16 -10G-> lc7C P17 + P17 -10G-> lc7B P18 + P18 -10G-> lc8D P18 + P19 -10G-> lc2D P18 + P20 -10G-> lc2C P17 + P21 -10G-> lc2B P18 + P22 -10G-> lc2A P17 + P23 -10G-> lc3D P18 + P24 -10G-> lc3B P18 + P25 -10G-> lc3C P17 + P26 -10G-> lc4D P18 + P27 -10G-> lc3A P17 + P28 -10G-> lc4B P18 + P29 -10G-> lc4C P17 + P30 -10G-> lc5D P18 + P31 -10G-> lc6C P17 + P32 -10G-> lc6D P18 + P33 -10G-> lc5A P17 + P34 -10G-> lc5B P18 + P35 -10G-> lc5C P17 + P36 -10G-> lc4A P17 + +SUBSYSTEM SPINE fc3B + P1 -10G-> lc8D P17 + P2 -10G-> lc8A P18 + P3 -10G-> lc8B P17 + P4 -10G-> lc8C P18 + P5 -10G-> lc7D P17 + P6 -10G-> lc7B P17 + P7 -10G-> lc7A P18 + P8 -10G-> lc6D P17 + P9 -10G-> lc7C P18 + P10 -10G-> lc6B P17 + P11 -10G-> lc6A P18 + P12 -10G-> lc5D P17 + P13 -10G-> lc4A P18 + P14 -10G-> lc4D P17 + P15 -10G-> lc5C P18 + P16 -10G-> lc5B P17 + P17 -10G-> lc5A P18 + P18 -10G-> lc6C P18 + P19 -10G-> lc9C P18 + P20 -10G-> lc9B P17 + P21 -10G-> lc9A P18 + P22 -10G-> lc9D P17 + P23 -10G-> lc1C P18 + P24 -10G-> lc1A P18 + P25 -10G-> lc1B P17 + P26 -10G-> lc2C P18 + P27 -10G-> lc1D P17 + P28 -10G-> lc2A P18 + P29 -10G-> lc2B P17 + P30 -10G-> lc3C P18 + P31 -10G-> lc4B P17 + P32 -10G-> lc4C P18 + P33 -10G-> lc3D P17 + P34 -10G-> lc3A P18 + P35 -10G-> lc3B P17 + P36 -10G-> lc2D P17 + +SUBSYSTEM SPINE fc4A + P1 -10G-> lc1A P12 + P2 -10G-> lc1B P11 + P3 -10G-> lc1C P12 + P4 -10G-> lc1D P11 + P5 -10G-> lc9A P12 + P6 -10G-> lc9C P12 + P7 -10G-> lc9B P11 + P8 -10G-> lc8A P12 + P9 -10G-> lc9D P11 + P10 -10G-> lc8C P12 + P11 -10G-> lc8B P11 + P12 -10G-> lc7A P12 + P13 -10G-> lc6B P11 + P14 -10G-> lc6A P12 + P15 -10G-> lc7D P11 + P16 -10G-> lc7C P12 + P17 -10G-> lc7B P11 + P18 -10G-> lc8D P11 + P19 -10G-> lc2D P11 + P20 -10G-> lc2C P12 + P21 -10G-> lc2B P11 + P22 -10G-> lc2A P12 + P23 -10G-> lc3D P11 + P24 -10G-> lc3B P11 + P25 -10G-> lc3C P12 + P26 -10G-> lc4D P11 + P27 -10G-> lc3A P12 + P28 -10G-> lc4B P11 + P29 -10G-> lc4C P12 + P30 -10G-> lc5D P11 + P31 -10G-> lc6C P12 + P32 -10G-> lc6D P11 + P33 -10G-> lc5A P12 + P34 -10G-> lc5B P11 + P35 -10G-> lc5C P12 + P36 -10G-> lc4A P12 + +SUBSYSTEM SPINE fc4B + P1 -10G-> lc8D P12 + P2 -10G-> lc8A P11 + P3 -10G-> lc8B P12 + P4 -10G-> lc8C P11 + P5 -10G-> lc7D P12 + P6 -10G-> lc7B P12 + P7 -10G-> lc7A P11 + P8 -10G-> lc6D P12 + P9 -10G-> lc7C P11 + P10 -10G-> lc6B P12 + P11 -10G-> lc6A P11 + P12 -10G-> lc5D P12 + P13 -10G-> lc4A P11 + P14 -10G-> lc4D P12 + P15 -10G-> lc5C P11 + P16 -10G-> lc5B P12 + P17 -10G-> lc5A P11 + P18 -10G-> lc6C P11 + P19 -10G-> lc9C P11 + P20 -10G-> lc9B P12 + P21 -10G-> lc9A P11 + P22 -10G-> lc9D P12 + P23 -10G-> lc1C P11 + P24 -10G-> lc1A P11 + P25 -10G-> lc1B P12 + P26 -10G-> lc2C P11 + P27 -10G-> lc1D P12 + P28 -10G-> lc2A P11 + P29 -10G-> lc2B P12 + P30 -10G-> lc3C P11 + P31 -10G-> lc4B P12 + P32 -10G-> lc4C P11 + P33 -10G-> lc3D P12 + P34 -10G-> lc3A P11 + P35 -10G-> lc3B P12 + P36 -10G-> lc2D P12 + +SUBSYSTEM SPINE fc5A + P1 -10G-> lc1A P10 + P2 -10G-> lc1B P9 + P3 -10G-> lc1C P10 + P4 -10G-> lc1D P9 + P5 -10G-> lc9A P10 + P6 -10G-> lc9C P10 + P7 -10G-> lc9B P9 + P8 -10G-> lc8A P10 + P9 -10G-> lc9D P9 + P10 -10G-> lc8C P10 + P11 -10G-> lc8B P9 + P12 -10G-> lc7A P10 + P13 -10G-> lc6B P9 + P14 -10G-> lc6A P10 + P15 -10G-> lc7D P9 + P16 -10G-> lc7C P10 + P17 -10G-> lc7B P9 + P18 -10G-> lc8D P9 + P19 -10G-> lc2D P9 + P20 -10G-> lc2C P10 + P21 -10G-> lc2B P9 + P22 -10G-> lc2A P10 + P23 -10G-> lc3D P9 + P24 -10G-> lc3B P9 + P25 -10G-> lc3C P10 + P26 -10G-> lc4D P9 + P27 -10G-> lc3A P10 + P28 -10G-> lc4B P9 + P29 -10G-> lc4C P10 + P30 -10G-> lc5D P9 + P31 -10G-> lc6C P10 + P32 -10G-> lc6D P9 + P33 -10G-> lc5A P10 + P34 -10G-> lc5B P9 + P35 -10G-> lc5C P10 + P36 -10G-> lc4A P10 + +SUBSYSTEM SPINE fc5B + P1 -10G-> lc8D P10 + P2 -10G-> lc8A P9 + P3 -10G-> lc8B P10 + P4 -10G-> lc8C P9 + P5 -10G-> lc7D P10 + P6 -10G-> lc7B P10 + P7 -10G-> lc7A P9 + P8 -10G-> lc6D P10 + P9 -10G-> lc7C P9 + P10 -10G-> lc6B P10 + P11 -10G-> lc6A P9 + P12 -10G-> lc5D P10 + P13 -10G-> lc4A P9 + P14 -10G-> lc4D P10 + P15 -10G-> lc5C P9 + P16 -10G-> lc5B P10 + P17 -10G-> lc5A P9 + P18 -10G-> lc6C P9 + P19 -10G-> lc9C P9 + P20 -10G-> lc9B P10 + P21 -10G-> lc9A P9 + P22 -10G-> lc9D P10 + P23 -10G-> lc1C P9 + P24 -10G-> lc1A P9 + P25 -10G-> lc1B P10 + P26 -10G-> lc2C P9 + P27 -10G-> lc1D P10 + P28 -10G-> lc2A P9 + P29 -10G-> lc2B P10 + P30 -10G-> lc3C P9 + P31 -10G-> lc4B P10 + P32 -10G-> lc4C P9 + P33 -10G-> lc3D P10 + P34 -10G-> lc3A P9 + P35 -10G-> lc3B P10 + P36 -10G-> lc2D P10 + +SUBSYSTEM SPINE fc6A + P1 -10G-> lc1A P8 + P2 -10G-> lc1B P7 + P3 -10G-> lc1C P8 + P4 -10G-> lc1D P7 + P5 -10G-> lc9A P8 + P6 -10G-> lc9C P8 + P7 -10G-> lc9B P7 + P8 -10G-> lc8A P8 + P9 -10G-> lc9D P7 + P10 -10G-> lc8C P8 + P11 -10G-> lc8B P7 + P12 -10G-> lc7A P8 + P13 -10G-> lc6B P7 + P14 -10G-> lc6A P8 + P15 -10G-> lc7D P7 + P16 -10G-> lc7C P8 + P17 -10G-> lc7B P7 + P18 -10G-> lc8D P7 + P19 -10G-> lc2D P7 + P20 -10G-> lc2C P8 + P21 -10G-> lc2B P7 + P22 -10G-> lc2A P8 + P23 -10G-> lc3D P7 + P24 -10G-> lc3B P7 + P25 -10G-> lc3C P8 + P26 -10G-> lc4D P7 + P27 -10G-> lc3A P8 + P28 -10G-> lc4B P7 + P29 -10G-> lc4C P8 + P30 -10G-> lc5D P7 + P31 -10G-> lc6C P8 + P32 -10G-> lc6D P7 + P33 -10G-> lc5A P8 + P34 -10G-> lc5B P7 + P35 -10G-> lc5C P8 + P36 -10G-> lc4A P8 + +SUBSYSTEM SPINE fc6B + P1 -10G-> lc8D P8 + P2 -10G-> lc8A P7 + P3 -10G-> lc8B P8 + P4 -10G-> lc8C P7 + P5 -10G-> lc7D P8 + P6 -10G-> lc7B P8 + P7 -10G-> lc7A P7 + P8 -10G-> lc6D P8 + P9 -10G-> lc7C P7 + P10 -10G-> lc6B P8 + P11 -10G-> lc6A P7 + P12 -10G-> lc5D P8 + P13 -10G-> lc4A P7 + P14 -10G-> lc4D P8 + P15 -10G-> lc5C P7 + P16 -10G-> lc5B P8 + P17 -10G-> lc5A P7 + P18 -10G-> lc6C P7 + P19 -10G-> lc9C P7 + P20 -10G-> lc9B P8 + P21 -10G-> lc9A P7 + P22 -10G-> lc9D P8 + P23 -10G-> lc1C P7 + P24 -10G-> lc1A P7 + P25 -10G-> lc1B P8 + P26 -10G-> lc2C P7 + P27 -10G-> lc1D P8 + P28 -10G-> lc2A P7 + P29 -10G-> lc2B P8 + P30 -10G-> lc3C P7 + P31 -10G-> lc4B P8 + P32 -10G-> lc4C P7 + P33 -10G-> lc3D P8 + P34 -10G-> lc3A P7 + P35 -10G-> lc3B P8 + P36 -10G-> lc2D P8 + +SUBSYSTEM SPINE fc7A + P1 -10G-> lc1A P6 + P2 -10G-> lc1B P5 + P3 -10G-> lc1C P6 + P4 -10G-> lc1D P5 + P5 -10G-> lc9A P6 + P6 -10G-> lc9C P6 + P7 -10G-> lc9B P5 + P8 -10G-> lc8A P6 + P9 -10G-> lc9D P5 + P10 -10G-> lc8C P6 + P11 -10G-> lc8B P5 + P12 -10G-> lc7A P6 + P13 -10G-> lc6B P5 + P14 -10G-> lc6A P6 + P15 -10G-> lc7D P5 + P16 -10G-> lc7C P6 + P17 -10G-> lc7B P5 + P18 -10G-> lc8D P5 + P19 -10G-> lc2D P5 + P20 -10G-> lc2C P6 + P21 -10G-> lc2B P5 + P22 -10G-> lc2A P6 + P23 -10G-> lc3D P5 + P24 -10G-> lc3B P5 + P25 -10G-> lc3C P6 + P26 -10G-> lc4D P5 + P27 -10G-> lc3A P6 + P28 -10G-> lc4B P5 + P29 -10G-> lc4C P6 + P30 -10G-> lc5D P5 + P31 -10G-> lc6C P6 + P32 -10G-> lc6D P5 + P33 -10G-> lc5A P6 + P34 -10G-> lc5B P5 + P35 -10G-> lc5C P6 + P36 -10G-> lc4A P6 + +SUBSYSTEM SPINE fc7B + P1 -10G-> lc8D P6 + P2 -10G-> lc8A P5 + P3 -10G-> lc8B P6 + P4 -10G-> lc8C P5 + P5 -10G-> lc7D P6 + P6 -10G-> lc7B P6 + P7 -10G-> lc7A P5 + P8 -10G-> lc6D P6 + P9 -10G-> lc7C P5 + P10 -10G-> lc6B P6 + P11 -10G-> lc6A P5 + P12 -10G-> lc5D P6 + P13 -10G-> lc4A P5 + P14 -10G-> lc4D P6 + P15 -10G-> lc5C P5 + P16 -10G-> lc5B P6 + P17 -10G-> lc5A P5 + P18 -10G-> lc6C P5 + P19 -10G-> lc9C P5 + P20 -10G-> lc9B P6 + P21 -10G-> lc9A P5 + P22 -10G-> lc9D P6 + P23 -10G-> lc1C P5 + P24 -10G-> lc1A P5 + P25 -10G-> lc1B P6 + P26 -10G-> lc2C P5 + P27 -10G-> lc1D P6 + P28 -10G-> lc2A P5 + P29 -10G-> lc2B P6 + P30 -10G-> lc3C P5 + P31 -10G-> lc4B P6 + P32 -10G-> lc4C P5 + P33 -10G-> lc3D P6 + P34 -10G-> lc3A P5 + P35 -10G-> lc3B P6 + P36 -10G-> lc2D P6 + +SUBSYSTEM SPINE fc8A + P1 -10G-> lc1A P4 + P2 -10G-> lc1B P3 + P3 -10G-> lc1C P4 + P4 -10G-> lc1D P3 + P5 -10G-> lc9A P4 + P6 -10G-> lc9C P4 + P7 -10G-> lc9B P3 + P8 -10G-> lc8A P4 + P9 -10G-> lc9D P3 + P10 -10G-> lc8C P4 + P11 -10G-> lc8B P3 + P12 -10G-> lc7A P4 + P13 -10G-> lc6B P3 + P14 -10G-> lc6A P4 + P15 -10G-> lc7D P3 + P16 -10G-> lc7C P4 + P17 -10G-> lc7B P3 + P18 -10G-> lc8D P3 + P19 -10G-> lc2D P3 + P20 -10G-> lc2C P4 + P21 -10G-> lc2B P3 + P22 -10G-> lc2A P4 + P23 -10G-> lc3D P3 + P24 -10G-> lc3B P3 + P25 -10G-> lc3C P4 + P26 -10G-> lc4D P3 + P27 -10G-> lc3A P4 + P28 -10G-> lc4B P3 + P29 -10G-> lc4C P4 + P30 -10G-> lc5D P3 + P31 -10G-> lc6C P4 + P32 -10G-> lc6D P3 + P33 -10G-> lc5A P4 + P34 -10G-> lc5B P3 + P35 -10G-> lc5C P4 + P36 -10G-> lc4A P4 + +SUBSYSTEM SPINE fc8B + P1 -10G-> lc8D P4 + P2 -10G-> lc8A P3 + P3 -10G-> lc8B P4 + P4 -10G-> lc8C P3 + P5 -10G-> lc7D P4 + P6 -10G-> lc7B P4 + P7 -10G-> lc7A P3 + P8 -10G-> lc6D P4 + P9 -10G-> lc7C P3 + P10 -10G-> lc6B P4 + P11 -10G-> lc6A P3 + P12 -10G-> lc5D P4 + P13 -10G-> lc4A P3 + P14 -10G-> lc4D P4 + P15 -10G-> lc5C P3 + P16 -10G-> lc5B P4 + P17 -10G-> lc5A P3 + P18 -10G-> lc6C P3 + P19 -10G-> lc9C P3 + P20 -10G-> lc9B P4 + P21 -10G-> lc9A P3 + P22 -10G-> lc9D P4 + P23 -10G-> lc1C P3 + P24 -10G-> lc1A P3 + P25 -10G-> lc1B P4 + P26 -10G-> lc2C P3 + P27 -10G-> lc1D P4 + P28 -10G-> lc2A P3 + P29 -10G-> lc2B P4 + P30 -10G-> lc3C P3 + P31 -10G-> lc4B P4 + P32 -10G-> lc4C P3 + P33 -10G-> lc3D P4 + P34 -10G-> lc3A P3 + P35 -10G-> lc3B P4 + P36 -10G-> lc2D P4 + +SUBSYSTEM SPINE fc9A + P1 -10G-> lc1A P2 + P2 -10G-> lc1B P1 + P3 -10G-> lc1C P2 + P4 -10G-> lc1D P1 + P5 -10G-> lc9A P2 + P6 -10G-> lc9C P2 + P7 -10G-> lc9B P1 + P8 -10G-> lc8A P2 + P9 -10G-> lc9D P1 + P10 -10G-> lc8C P2 + P11 -10G-> lc8B P1 + P12 -10G-> lc7A P2 + P13 -10G-> lc6B P1 + P14 -10G-> lc6A P2 + P15 -10G-> lc7D P1 + P16 -10G-> lc7C P2 + P17 -10G-> lc7B P1 + P18 -10G-> lc8D P1 + P19 -10G-> lc2D P1 + P20 -10G-> lc2C P2 + P21 -10G-> lc2B P1 + P22 -10G-> lc2A P2 + P23 -10G-> lc3D P1 + P24 -10G-> lc3B P1 + P25 -10G-> lc3C P2 + P26 -10G-> lc4D P1 + P27 -10G-> lc3A P2 + P28 -10G-> lc4B P1 + P29 -10G-> lc4C P2 + P30 -10G-> lc5D P1 + P31 -10G-> lc6C P2 + P32 -10G-> lc6D P1 + P33 -10G-> lc5A P2 + P34 -10G-> lc5B P1 + P35 -10G-> lc5C P2 + P36 -10G-> lc4A P2 + +SUBSYSTEM SPINE fc9B + P1 -10G-> lc8D P2 + P2 -10G-> lc8A P1 + P3 -10G-> lc8B P2 + P4 -10G-> lc8C P1 + P5 -10G-> lc7D P2 + P6 -10G-> lc7B P2 + P7 -10G-> lc7A P1 + P8 -10G-> lc6D P2 + P9 -10G-> lc7C P1 + P10 -10G-> lc6B P2 + P11 -10G-> lc6A P1 + P12 -10G-> lc5D P2 + P13 -10G-> lc4A P1 + P14 -10G-> lc4D P2 + P15 -10G-> lc5C P1 + P16 -10G-> lc5B P2 + P17 -10G-> lc5A P1 + P18 -10G-> lc6C P1 + P19 -10G-> lc9C P1 + P20 -10G-> lc9B P2 + P21 -10G-> lc9A P1 + P22 -10G-> lc9D P2 + P23 -10G-> lc1C P1 + P24 -10G-> lc1A P1 + P25 -10G-> lc1B P2 + P26 -10G-> lc2C P1 + P27 -10G-> lc1D P2 + P28 -10G-> lc2A P1 + P29 -10G-> lc2B P2 + P30 -10G-> lc3C P1 + P31 -10G-> lc4B P2 + P32 -10G-> lc4C P1 + P33 -10G-> lc3D P2 + P34 -10G-> lc3A P1 + P35 -10G-> lc3B P2 + P36 -10G-> lc2D P2 + +SUBSYSTEM LEAF lc1A + P1 -10G-> fc9B P24 + P2 -10G-> fc9A P1 + P3 -10G-> fc8B P24 + P4 -10G-> fc8A P1 + P5 -10G-> fc7B P24 + P6 -10G-> fc7A P1 + P7 -10G-> fc6B P24 + P8 -10G-> fc6A P1 + P9 -10G-> fc5B P24 + P10 -10G-> fc5A P1 + P11 -10G-> fc4B P24 + P12 -10G-> fc4A P1 + P13 -10G-> fc1A P1 + P14 -10G-> fc1B P24 + P15 -10G-> fc2A P1 + P16 -10G-> fc2B P24 + P17 -10G-> fc3A P1 + P18 -10G-> fc3B P24 + P19 -10G-> lc1-0A/P3 + P20 -10G-> lc1-0B/P3 + P21 -10G-> lc1-0B/P2 + P22 -10G-> lc1-0B/P1 + P23 -10G-> lc1-0A/P2 + P24 -10G-> lc1-0A/P1 + P25 -10G-> lc1-1A/P3 + P26 -10G-> lc1-1B/P3 + P27 -10G-> lc1-1B/P2 + P28 -10G-> lc1-1B/P1 + P29 -10G-> lc1-1A/P2 + P30 -10G-> lc1-1A/P1 + P31 -10G-> lc1-2A/P1 + P32 -10G-> lc1-2A/P2 + P33 -10G-> lc1-2B/P1 + P34 -10G-> lc1-2B/P2 + P35 -10G-> lc1-2B/P3 + P36 -10G-> lc1-2A/P3 + +SUBSYSTEM LEAF lc1B + P1 -10G-> fc9A P2 + P2 -10G-> fc9B P25 + P3 -10G-> fc8A P2 + P4 -10G-> fc8B P25 + P5 -10G-> fc7A P2 + P6 -10G-> fc7B P25 + P7 -10G-> fc6A P2 + P8 -10G-> fc6B P25 + P9 -10G-> fc5A P2 + P10 -10G-> fc5B P25 + P11 -10G-> fc4A P2 + P12 -10G-> fc4B P25 + P13 -10G-> fc1B P25 + P14 -10G-> fc1A P2 + P15 -10G-> fc2B P25 + P16 -10G-> fc2A P2 + P17 -10G-> fc3B P25 + P18 -10G-> fc3A P2 + P19 -10G-> lc1-3A/P3 + P20 -10G-> lc1-3B/P3 + P21 -10G-> lc1-3B/P2 + P22 -10G-> lc1-3B/P1 + P23 -10G-> lc1-3A/P2 + P24 -10G-> lc1-3A/P1 + P25 -10G-> lc1-4A/P3 + P26 -10G-> lc1-4B/P3 + P27 -10G-> lc1-4B/P2 + P28 -10G-> lc1-4B/P1 + P29 -10G-> lc1-4A/P2 + P30 -10G-> lc1-4A/P1 + P31 -10G-> lc1-5A/P1 + P32 -10G-> lc1-5A/P2 + P33 -10G-> lc1-5B/P1 + P34 -10G-> lc1-5B/P2 + P35 -10G-> lc1-5B/P3 + P36 -10G-> lc1-5A/P3 + +SUBSYSTEM LEAF lc1C + P1 -10G-> fc9B P23 + P2 -10G-> fc9A P3 + P3 -10G-> fc8B P23 + P4 -10G-> fc8A P3 + P5 -10G-> fc7B P23 + P6 -10G-> fc7A P3 + P7 -10G-> fc6B P23 + P8 -10G-> fc6A P3 + P9 -10G-> fc5B P23 + P10 -10G-> fc5A P3 + P11 -10G-> fc4B P23 + P12 -10G-> fc4A P3 + P13 -10G-> fc1A P3 + P14 -10G-> fc1B P23 + P15 -10G-> fc2A P3 + P16 -10G-> fc2B P23 + P17 -10G-> fc3A P3 + P18 -10G-> fc3B P23 + P19 -10G-> lc1-6A/P3 + P20 -10G-> lc1-6B/P3 + P21 -10G-> lc1-6B/P2 + P22 -10G-> lc1-6B/P1 + P23 -10G-> lc1-6A/P2 + P24 -10G-> lc1-6A/P1 + P25 -10G-> lc1-7A/P3 + P26 -10G-> lc1-7B/P3 + P27 -10G-> lc1-7B/P2 + P28 -10G-> lc1-7B/P1 + P29 -10G-> lc1-7A/P2 + P30 -10G-> lc1-7A/P1 + P31 -10G-> lc1-8A/P1 + P32 -10G-> lc1-8A/P2 + P33 -10G-> lc1-8B/P1 + P34 -10G-> lc1-8B/P2 + P35 -10G-> lc1-8B/P3 + P36 -10G-> lc1-8A/P3 + +SUBSYSTEM LEAF lc1D + P1 -10G-> fc9A P4 + P2 -10G-> fc9B P27 + P3 -10G-> fc8A P4 + P4 -10G-> fc8B P27 + P5 -10G-> fc7A P4 + P6 -10G-> fc7B P27 + P7 -10G-> fc6A P4 + P8 -10G-> fc6B P27 + P9 -10G-> fc5A P4 + P10 -10G-> fc5B P27 + P11 -10G-> fc4A P4 + P12 -10G-> fc4B P27 + P13 -10G-> fc1B P27 + P14 -10G-> fc1A P4 + P15 -10G-> fc2B P27 + P16 -10G-> fc2A P4 + P17 -10G-> fc3B P27 + P18 -10G-> fc3A P4 + P19 -10G-> lc1-9A/P3 + P20 -10G-> lc1-9B/P3 + P21 -10G-> lc1-9B/P2 + P22 -10G-> lc1-9B/P1 + P23 -10G-> lc1-9A/P2 + P24 -10G-> lc1-9A/P1 + P25 -10G-> lc1-10A/P3 + P26 -10G-> lc1-10B/P3 + P27 -10G-> lc1-10B/P2 + P28 -10G-> lc1-10B/P1 + P29 -10G-> lc1-10A/P2 + P30 -10G-> lc1-10A/P1 + P31 -10G-> lc1-11A/P1 + P32 -10G-> lc1-11A/P2 + P33 -10G-> lc1-11B/P1 + P34 -10G-> lc1-11B/P2 + P35 -10G-> lc1-11B/P3 + P36 -10G-> lc1-11A/P3 + +SUBSYSTEM LEAF lc2A + P1 -10G-> fc9B P28 + P2 -10G-> fc9A P22 + P3 -10G-> fc8B P28 + P4 -10G-> fc8A P22 + P5 -10G-> fc7B P28 + P6 -10G-> fc7A P22 + P7 -10G-> fc6B P28 + P8 -10G-> fc6A P22 + P9 -10G-> fc5B P28 + P10 -10G-> fc5A P22 + P11 -10G-> fc4B P28 + P12 -10G-> fc4A P22 + P13 -10G-> fc1A P22 + P14 -10G-> fc1B P28 + P15 -10G-> fc2A P22 + P16 -10G-> fc2B P28 + P17 -10G-> fc3A P22 + P18 -10G-> fc3B P28 + P19 -10G-> lc2-0A/P3 + P20 -10G-> lc2-0B/P3 + P21 -10G-> lc2-0B/P2 + P22 -10G-> lc2-0B/P1 + P23 -10G-> lc2-0A/P2 + P24 -10G-> lc2-0A/P1 + P25 -10G-> lc2-1A/P3 + P26 -10G-> lc2-1B/P3 + P27 -10G-> lc2-1B/P2 + P28 -10G-> lc2-1B/P1 + P29 -10G-> lc2-1A/P2 + P30 -10G-> lc2-1A/P1 + P31 -10G-> lc2-2A/P1 + P32 -10G-> lc2-2A/P2 + P33 -10G-> lc2-2B/P1 + P34 -10G-> lc2-2B/P2 + P35 -10G-> lc2-2B/P3 + P36 -10G-> lc2-2A/P3 + +SUBSYSTEM LEAF lc2B + P1 -10G-> fc9A P21 + P2 -10G-> fc9B P29 + P3 -10G-> fc8A P21 + P4 -10G-> fc8B P29 + P5 -10G-> fc7A P21 + P6 -10G-> fc7B P29 + P7 -10G-> fc6A P21 + P8 -10G-> fc6B P29 + P9 -10G-> fc5A P21 + P10 -10G-> fc5B P29 + P11 -10G-> fc4A P21 + P12 -10G-> fc4B P29 + P13 -10G-> fc1B P29 + P14 -10G-> fc1A P21 + P15 -10G-> fc2B P29 + P16 -10G-> fc2A P21 + P17 -10G-> fc3B P29 + P18 -10G-> fc3A P21 + P19 -10G-> lc2-3A/P3 + P20 -10G-> lc2-3B/P3 + P21 -10G-> lc2-3B/P2 + P22 -10G-> lc2-3B/P1 + P23 -10G-> lc2-3A/P2 + P24 -10G-> lc2-3A/P1 + P25 -10G-> lc2-4A/P3 + P26 -10G-> lc2-4B/P3 + P27 -10G-> lc2-4B/P2 + P28 -10G-> lc2-4B/P1 + P29 -10G-> lc2-4A/P2 + P30 -10G-> lc2-4A/P1 + P31 -10G-> lc2-5A/P1 + P32 -10G-> lc2-5A/P2 + P33 -10G-> lc2-5B/P1 + P34 -10G-> lc2-5B/P2 + P35 -10G-> lc2-5B/P3 + P36 -10G-> lc2-5A/P3 + +SUBSYSTEM LEAF lc2C + P1 -10G-> fc9B P26 + P2 -10G-> fc9A P20 + P3 -10G-> fc8B P26 + P4 -10G-> fc8A P20 + P5 -10G-> fc7B P26 + P6 -10G-> fc7A P20 + P7 -10G-> fc6B P26 + P8 -10G-> fc6A P20 + P9 -10G-> fc5B P26 + P10 -10G-> fc5A P20 + P11 -10G-> fc4B P26 + P12 -10G-> fc4A P20 + P13 -10G-> fc1A P20 + P14 -10G-> fc1B P26 + P15 -10G-> fc2A P20 + P16 -10G-> fc2B P26 + P17 -10G-> fc3A P20 + P18 -10G-> fc3B P26 + P19 -10G-> lc2-6A/P3 + P20 -10G-> lc2-6B/P3 + P21 -10G-> lc2-6B/P2 + P22 -10G-> lc2-6B/P1 + P23 -10G-> lc2-6A/P2 + P24 -10G-> lc2-6A/P1 + P25 -10G-> lc2-7A/P3 + P26 -10G-> lc2-7B/P3 + P27 -10G-> lc2-7B/P2 + P28 -10G-> lc2-7B/P1 + P29 -10G-> lc2-7A/P2 + P30 -10G-> lc2-7A/P1 + P31 -10G-> lc2-8A/P1 + P32 -10G-> lc2-8A/P2 + P33 -10G-> lc2-8B/P1 + P34 -10G-> lc2-8B/P2 + P35 -10G-> lc2-8B/P3 + P36 -10G-> lc2-8A/P3 + +SUBSYSTEM LEAF lc2D + P1 -10G-> fc9A P19 + P2 -10G-> fc9B P36 + P3 -10G-> fc8A P19 + P4 -10G-> fc8B P36 + P5 -10G-> fc7A P19 + P6 -10G-> fc7B P36 + P7 -10G-> fc6A P19 + P8 -10G-> fc6B P36 + P9 -10G-> fc5A P19 + P10 -10G-> fc5B P36 + P11 -10G-> fc4A P19 + P12 -10G-> fc4B P36 + P13 -10G-> fc1B P36 + P14 -10G-> fc1A P19 + P15 -10G-> fc2B P36 + P16 -10G-> fc2A P19 + P17 -10G-> fc3B P36 + P18 -10G-> fc3A P19 + P19 -10G-> lc2-9A/P3 + P20 -10G-> lc2-9B/P3 + P21 -10G-> lc2-9B/P2 + P22 -10G-> lc2-9B/P1 + P23 -10G-> lc2-9A/P2 + P24 -10G-> lc2-9A/P1 + P25 -10G-> lc2-10A/P3 + P26 -10G-> lc2-10B/P3 + P27 -10G-> lc2-10B/P2 + P28 -10G-> lc2-10B/P1 + P29 -10G-> lc2-10A/P2 + P30 -10G-> lc2-10A/P1 + P31 -10G-> lc2-11A/P1 + P32 -10G-> lc2-11A/P2 + P33 -10G-> lc2-11B/P1 + P34 -10G-> lc2-11B/P2 + P35 -10G-> lc2-11B/P3 + P36 -10G-> lc2-11A/P3 + +SUBSYSTEM LEAF lc3A + P1 -10G-> fc9B P34 + P2 -10G-> fc9A P27 + P3 -10G-> fc8B P34 + P4 -10G-> fc8A P27 + P5 -10G-> fc7B P34 + P6 -10G-> fc7A P27 + P7 -10G-> fc6B P34 + P8 -10G-> fc6A P27 + P9 -10G-> fc5B P34 + P10 -10G-> fc5A P27 + P11 -10G-> fc4B P34 + P12 -10G-> fc4A P27 + P13 -10G-> fc1A P27 + P14 -10G-> fc1B P34 + P15 -10G-> fc2A P27 + P16 -10G-> fc2B P34 + P17 -10G-> fc3A P27 + P18 -10G-> fc3B P34 + P19 -10G-> lc3-0A/P3 + P20 -10G-> lc3-0B/P3 + P21 -10G-> lc3-0B/P2 + P22 -10G-> lc3-0B/P1 + P23 -10G-> lc3-0A/P2 + P24 -10G-> lc3-0A/P1 + P25 -10G-> lc3-1A/P3 + P26 -10G-> lc3-1B/P3 + P27 -10G-> lc3-1B/P2 + P28 -10G-> lc3-1B/P1 + P29 -10G-> lc3-1A/P2 + P30 -10G-> lc3-1A/P1 + P31 -10G-> lc3-2A/P1 + P32 -10G-> lc3-2A/P2 + P33 -10G-> lc3-2B/P1 + P34 -10G-> lc3-2B/P2 + P35 -10G-> lc3-2B/P3 + P36 -10G-> lc3-2A/P3 + +SUBSYSTEM LEAF lc3B + P1 -10G-> fc9A P24 + P2 -10G-> fc9B P35 + P3 -10G-> fc8A P24 + P4 -10G-> fc8B P35 + P5 -10G-> fc7A P24 + P6 -10G-> fc7B P35 + P7 -10G-> fc6A P24 + P8 -10G-> fc6B P35 + P9 -10G-> fc5A P24 + P10 -10G-> fc5B P35 + P11 -10G-> fc4A P24 + P12 -10G-> fc4B P35 + P13 -10G-> fc1B P35 + P14 -10G-> fc1A P24 + P15 -10G-> fc2B P35 + P16 -10G-> fc2A P24 + P17 -10G-> fc3B P35 + P18 -10G-> fc3A P24 + P19 -10G-> lc3-3A/P3 + P20 -10G-> lc3-3B/P3 + P21 -10G-> lc3-3B/P2 + P22 -10G-> lc3-3B/P1 + P23 -10G-> lc3-3A/P2 + P24 -10G-> lc3-3A/P1 + P25 -10G-> lc3-4A/P3 + P26 -10G-> lc3-4B/P3 + P27 -10G-> lc3-4B/P2 + P28 -10G-> lc3-4B/P1 + P29 -10G-> lc3-4A/P2 + P30 -10G-> lc3-4A/P1 + P31 -10G-> lc3-5A/P1 + P32 -10G-> lc3-5A/P2 + P33 -10G-> lc3-5B/P1 + P34 -10G-> lc3-5B/P2 + P35 -10G-> lc3-5B/P3 + P36 -10G-> lc3-5A/P3 + +SUBSYSTEM LEAF lc3C + P1 -10G-> fc9B P30 + P2 -10G-> fc9A P25 + P3 -10G-> fc8B P30 + P4 -10G-> fc8A P25 + P5 -10G-> fc7B P30 + P6 -10G-> fc7A P25 + P7 -10G-> fc6B P30 + P8 -10G-> fc6A P25 + P9 -10G-> fc5B P30 + P10 -10G-> fc5A P25 + P11 -10G-> fc4B P30 + P12 -10G-> fc4A P25 + P13 -10G-> fc1A P25 + P14 -10G-> fc1B P30 + P15 -10G-> fc2A P25 + P16 -10G-> fc2B P30 + P17 -10G-> fc3A P25 + P18 -10G-> fc3B P30 + P19 -10G-> lc3-6A/P3 + P20 -10G-> lc3-6B/P3 + P21 -10G-> lc3-6B/P2 + P22 -10G-> lc3-6B/P1 + P23 -10G-> lc3-6A/P2 + P24 -10G-> lc3-6A/P1 + P25 -10G-> lc3-7A/P3 + P26 -10G-> lc3-7B/P3 + P27 -10G-> lc3-7B/P2 + P28 -10G-> lc3-7B/P1 + P29 -10G-> lc3-7A/P2 + P30 -10G-> lc3-7A/P1 + P31 -10G-> lc3-8A/P1 + P32 -10G-> lc3-8A/P2 + P33 -10G-> lc3-8B/P1 + P34 -10G-> lc3-8B/P2 + P35 -10G-> lc3-8B/P3 + P36 -10G-> lc3-8A/P3 + +SUBSYSTEM LEAF lc3D + P1 -10G-> fc9A P23 + P2 -10G-> fc9B P33 + P3 -10G-> fc8A P23 + P4 -10G-> fc8B P33 + P5 -10G-> fc7A P23 + P6 -10G-> fc7B P33 + P7 -10G-> fc6A P23 + P8 -10G-> fc6B P33 + P9 -10G-> fc5A P23 + P10 -10G-> fc5B P33 + P11 -10G-> fc4A P23 + P12 -10G-> fc4B P33 + P13 -10G-> fc1B P33 + P14 -10G-> fc1A P23 + P15 -10G-> fc2B P33 + P16 -10G-> fc2A P23 + P17 -10G-> fc3B P33 + P18 -10G-> fc3A P23 + P19 -10G-> lc3-9A/P3 + P20 -10G-> lc3-9B/P3 + P21 -10G-> lc3-9B/P2 + P22 -10G-> lc3-9B/P1 + P23 -10G-> lc3-9A/P2 + P24 -10G-> lc3-9A/P1 + P25 -10G-> lc3-10A/P3 + P26 -10G-> lc3-10B/P3 + P27 -10G-> lc3-10B/P2 + P28 -10G-> lc3-10B/P1 + P29 -10G-> lc3-10A/P2 + P30 -10G-> lc3-10A/P1 + P31 -10G-> lc3-11A/P1 + P32 -10G-> lc3-11A/P2 + P33 -10G-> lc3-11B/P1 + P34 -10G-> lc3-11B/P2 + P35 -10G-> lc3-11B/P3 + P36 -10G-> lc3-11A/P3 + +SUBSYSTEM LEAF lc4A + P1 -10G-> fc9B P13 + P2 -10G-> fc9A P36 + P3 -10G-> fc8B P13 + P4 -10G-> fc8A P36 + P5 -10G-> fc7B P13 + P6 -10G-> fc7A P36 + P7 -10G-> fc6B P13 + P8 -10G-> fc6A P36 + P9 -10G-> fc5B P13 + P10 -10G-> fc5A P36 + P11 -10G-> fc4B P13 + P12 -10G-> fc4A P36 + P13 -10G-> fc1A P36 + P14 -10G-> fc1B P13 + P15 -10G-> fc2A P36 + P16 -10G-> fc2B P13 + P17 -10G-> fc3A P36 + P18 -10G-> fc3B P13 + P19 -10G-> lc4-0A/P3 + P20 -10G-> lc4-0B/P3 + P21 -10G-> lc4-0B/P2 + P22 -10G-> lc4-0B/P1 + P23 -10G-> lc4-0A/P2 + P24 -10G-> lc4-0A/P1 + P25 -10G-> lc4-1A/P3 + P26 -10G-> lc4-1B/P3 + P27 -10G-> lc4-1B/P2 + P28 -10G-> lc4-1B/P1 + P29 -10G-> lc4-1A/P2 + P30 -10G-> lc4-1A/P1 + P31 -10G-> lc4-2A/P1 + P32 -10G-> lc4-2A/P2 + P33 -10G-> lc4-2B/P1 + P34 -10G-> lc4-2B/P2 + P35 -10G-> lc4-2B/P3 + P36 -10G-> lc4-2A/P3 + +SUBSYSTEM LEAF lc4B + P1 -10G-> fc9A P28 + P2 -10G-> fc9B P31 + P3 -10G-> fc8A P28 + P4 -10G-> fc8B P31 + P5 -10G-> fc7A P28 + P6 -10G-> fc7B P31 + P7 -10G-> fc6A P28 + P8 -10G-> fc6B P31 + P9 -10G-> fc5A P28 + P10 -10G-> fc5B P31 + P11 -10G-> fc4A P28 + P12 -10G-> fc4B P31 + P13 -10G-> fc1B P31 + P14 -10G-> fc1A P28 + P15 -10G-> fc2B P31 + P16 -10G-> fc2A P28 + P17 -10G-> fc3B P31 + P18 -10G-> fc3A P28 + P19 -10G-> lc4-3A/P3 + P20 -10G-> lc4-3B/P3 + P21 -10G-> lc4-3B/P2 + P22 -10G-> lc4-3B/P1 + P23 -10G-> lc4-3A/P2 + P24 -10G-> lc4-3A/P1 + P25 -10G-> lc4-4A/P3 + P26 -10G-> lc4-4B/P3 + P27 -10G-> lc4-4B/P2 + P28 -10G-> lc4-4B/P1 + P29 -10G-> lc4-4A/P2 + P30 -10G-> lc4-4A/P1 + P31 -10G-> lc4-5A/P1 + P32 -10G-> lc4-5A/P2 + P33 -10G-> lc4-5B/P1 + P34 -10G-> lc4-5B/P2 + P35 -10G-> lc4-5B/P3 + P36 -10G-> lc4-5A/P3 + +SUBSYSTEM LEAF lc4C + P1 -10G-> fc9B P32 + P2 -10G-> fc9A P29 + P3 -10G-> fc8B P32 + P4 -10G-> fc8A P29 + P5 -10G-> fc7B P32 + P6 -10G-> fc7A P29 + P7 -10G-> fc6B P32 + P8 -10G-> fc6A P29 + P9 -10G-> fc5B P32 + P10 -10G-> fc5A P29 + P11 -10G-> fc4B P32 + P12 -10G-> fc4A P29 + P13 -10G-> fc1A P29 + P14 -10G-> fc1B P32 + P15 -10G-> fc2A P29 + P16 -10G-> fc2B P32 + P17 -10G-> fc3A P29 + P18 -10G-> fc3B P32 + P19 -10G-> lc4-6A/P3 + P20 -10G-> lc4-6B/P3 + P21 -10G-> lc4-6B/P2 + P22 -10G-> lc4-6B/P1 + P23 -10G-> lc4-6A/P2 + P24 -10G-> lc4-6A/P1 + P25 -10G-> lc4-7A/P3 + P26 -10G-> lc4-7B/P3 + P27 -10G-> lc4-7B/P2 + P28 -10G-> lc4-7B/P1 + P29 -10G-> lc4-7A/P2 + P30 -10G-> lc4-7A/P1 + P31 -10G-> lc4-8A/P1 + P32 -10G-> lc4-8A/P2 + P33 -10G-> lc4-8B/P1 + P34 -10G-> lc4-8B/P2 + P35 -10G-> lc4-8B/P3 + P36 -10G-> lc4-8A/P3 + +SUBSYSTEM LEAF lc4D + P1 -10G-> fc9A P26 + P2 -10G-> fc9B P14 + P3 -10G-> fc8A P26 + P4 -10G-> fc8B P14 + P5 -10G-> fc7A P26 + P6 -10G-> fc7B P14 + P7 -10G-> fc6A P26 + P8 -10G-> fc6B P14 + P9 -10G-> fc5A P26 + P10 -10G-> fc5B P14 + P11 -10G-> fc4A P26 + P12 -10G-> fc4B P14 + P13 -10G-> fc1B P14 + P14 -10G-> fc1A P26 + P15 -10G-> fc2B P14 + P16 -10G-> fc2A P26 + P17 -10G-> fc3B P14 + P18 -10G-> fc3A P26 + P19 -10G-> lc4-9A/P3 + P20 -10G-> lc4-9B/P3 + P21 -10G-> lc4-9B/P2 + P22 -10G-> lc4-9B/P1 + P23 -10G-> lc4-9A/P2 + P24 -10G-> lc4-9A/P1 + P25 -10G-> lc4-10A/P3 + P26 -10G-> lc4-10B/P3 + P27 -10G-> lc4-10B/P2 + P28 -10G-> lc4-10B/P1 + P29 -10G-> lc4-10A/P2 + P30 -10G-> lc4-10A/P1 + P31 -10G-> lc4-11A/P1 + P32 -10G-> lc4-11A/P2 + P33 -10G-> lc4-11B/P1 + P34 -10G-> lc4-11B/P2 + P35 -10G-> lc4-11B/P3 + P36 -10G-> lc4-11A/P3 + +SUBSYSTEM LEAF lc5A + P1 -10G-> fc9B P17 + P2 -10G-> fc9A P33 + P3 -10G-> fc8B P17 + P4 -10G-> fc8A P33 + P5 -10G-> fc7B P17 + P6 -10G-> fc7A P33 + P7 -10G-> fc6B P17 + P8 -10G-> fc6A P33 + P9 -10G-> fc5B P17 + P10 -10G-> fc5A P33 + P11 -10G-> fc4B P17 + P12 -10G-> fc4A P33 + P13 -10G-> fc1A P33 + P14 -10G-> fc1B P17 + P15 -10G-> fc2A P33 + P16 -10G-> fc2B P17 + P17 -10G-> fc3A P33 + P18 -10G-> fc3B P17 + P19 -10G-> lc5-0A/P3 + P20 -10G-> lc5-0B/P3 + P21 -10G-> lc5-0B/P2 + P22 -10G-> lc5-0B/P1 + P23 -10G-> lc5-0A/P2 + P24 -10G-> lc5-0A/P1 + P25 -10G-> lc5-1A/P3 + P26 -10G-> lc5-1B/P3 + P27 -10G-> lc5-1B/P2 + P28 -10G-> lc5-1B/P1 + P29 -10G-> lc5-1A/P2 + P30 -10G-> lc5-1A/P1 + P31 -10G-> lc5-2A/P1 + P32 -10G-> lc5-2A/P2 + P33 -10G-> lc5-2B/P1 + P34 -10G-> lc5-2B/P2 + P35 -10G-> lc5-2B/P3 + P36 -10G-> lc5-2A/P3 + +SUBSYSTEM LEAF lc5B + P1 -10G-> fc9A P34 + P2 -10G-> fc9B P16 + P3 -10G-> fc8A P34 + P4 -10G-> fc8B P16 + P5 -10G-> fc7A P34 + P6 -10G-> fc7B P16 + P7 -10G-> fc6A P34 + P8 -10G-> fc6B P16 + P9 -10G-> fc5A P34 + P10 -10G-> fc5B P16 + P11 -10G-> fc4A P34 + P12 -10G-> fc4B P16 + P13 -10G-> fc1B P16 + P14 -10G-> fc1A P34 + P15 -10G-> fc2B P16 + P16 -10G-> fc2A P34 + P17 -10G-> fc3B P16 + P18 -10G-> fc3A P34 + P19 -10G-> lc5-3A/P3 + P20 -10G-> lc5-3B/P3 + P21 -10G-> lc5-3B/P2 + P22 -10G-> lc5-3B/P1 + P23 -10G-> lc5-3A/P2 + P24 -10G-> lc5-3A/P1 + P25 -10G-> lc5-4A/P3 + P26 -10G-> lc5-4B/P3 + P27 -10G-> lc5-4B/P2 + P28 -10G-> lc5-4B/P1 + P29 -10G-> lc5-4A/P2 + P30 -10G-> lc5-4A/P1 + P31 -10G-> lc5-5A/P1 + P32 -10G-> lc5-5A/P2 + P33 -10G-> lc5-5B/P1 + P34 -10G-> lc5-5B/P2 + P35 -10G-> lc5-5B/P3 + P36 -10G-> lc5-5A/P3 + +SUBSYSTEM LEAF lc5C + P1 -10G-> fc9B P15 + P2 -10G-> fc9A P35 + P3 -10G-> fc8B P15 + P4 -10G-> fc8A P35 + P5 -10G-> fc7B P15 + P6 -10G-> fc7A P35 + P7 -10G-> fc6B P15 + P8 -10G-> fc6A P35 + P9 -10G-> fc5B P15 + P10 -10G-> fc5A P35 + P11 -10G-> fc4B P15 + P12 -10G-> fc4A P35 + P13 -10G-> fc1A P35 + P14 -10G-> fc1B P15 + P15 -10G-> fc2A P35 + P16 -10G-> fc2B P15 + P17 -10G-> fc3A P35 + P18 -10G-> fc3B P15 + P19 -10G-> lc5-6A/P3 + P20 -10G-> lc5-6B/P3 + P21 -10G-> lc5-6B/P2 + P22 -10G-> lc5-6B/P1 + P23 -10G-> lc5-6A/P2 + P24 -10G-> lc5-6A/P1 + P25 -10G-> lc5-7A/P3 + P26 -10G-> lc5-7B/P3 + P27 -10G-> lc5-7B/P2 + P28 -10G-> lc5-7B/P1 + P29 -10G-> lc5-7A/P2 + P30 -10G-> lc5-7A/P1 + P31 -10G-> lc5-8A/P1 + P32 -10G-> lc5-8A/P2 + P33 -10G-> lc5-8B/P1 + P34 -10G-> lc5-8B/P2 + P35 -10G-> lc5-8B/P3 + P36 -10G-> lc5-8A/P3 + +SUBSYSTEM LEAF lc5D + P1 -10G-> fc9A P30 + P2 -10G-> fc9B P12 + P3 -10G-> fc8A P30 + P4 -10G-> fc8B P12 + P5 -10G-> fc7A P30 + P6 -10G-> fc7B P12 + P7 -10G-> fc6A P30 + P8 -10G-> fc6B P12 + P9 -10G-> fc5A P30 + P10 -10G-> fc5B P12 + P11 -10G-> fc4A P30 + P12 -10G-> fc4B P12 + P13 -10G-> fc1B P12 + P14 -10G-> fc1A P30 + P15 -10G-> fc2B P12 + P16 -10G-> fc2A P30 + P17 -10G-> fc3B P12 + P18 -10G-> fc3A P30 + P19 -10G-> lc5-9A/P3 + P20 -10G-> lc5-9B/P3 + P21 -10G-> lc5-9B/P2 + P22 -10G-> lc5-9B/P1 + P23 -10G-> lc5-9A/P2 + P24 -10G-> lc5-9A/P1 + P25 -10G-> lc5-10A/P3 + P26 -10G-> lc5-10B/P3 + P27 -10G-> lc5-10B/P2 + P28 -10G-> lc5-10B/P1 + P29 -10G-> lc5-10A/P2 + P30 -10G-> lc5-10A/P1 + P31 -10G-> lc5-11A/P1 + P32 -10G-> lc5-11A/P2 + P33 -10G-> lc5-11B/P1 + P34 -10G-> lc5-11B/P2 + P35 -10G-> lc5-11B/P3 + P36 -10G-> lc5-11A/P3 + +SUBSYSTEM LEAF lc6A + P1 -10G-> fc9B P11 + P2 -10G-> fc9A P14 + P3 -10G-> fc8B P11 + P4 -10G-> fc8A P14 + P5 -10G-> fc7B P11 + P6 -10G-> fc7A P14 + P7 -10G-> fc6B P11 + P8 -10G-> fc6A P14 + P9 -10G-> fc5B P11 + P10 -10G-> fc5A P14 + P11 -10G-> fc4B P11 + P12 -10G-> fc4A P14 + P13 -10G-> fc1A P14 + P14 -10G-> fc1B P11 + P15 -10G-> fc2A P14 + P16 -10G-> fc2B P11 + P17 -10G-> fc3A P14 + P18 -10G-> fc3B P11 + P19 -10G-> lc6-0A/P3 + P20 -10G-> lc6-0B/P3 + P21 -10G-> lc6-0B/P2 + P22 -10G-> lc6-0B/P1 + P23 -10G-> lc6-0A/P2 + P24 -10G-> lc6-0A/P1 + P25 -10G-> lc6-1A/P3 + P26 -10G-> lc6-1B/P3 + P27 -10G-> lc6-1B/P2 + P28 -10G-> lc6-1B/P1 + P29 -10G-> lc6-1A/P2 + P30 -10G-> lc6-1A/P1 + P31 -10G-> lc6-2A/P1 + P32 -10G-> lc6-2A/P2 + P33 -10G-> lc6-2B/P1 + P34 -10G-> lc6-2B/P2 + P35 -10G-> lc6-2B/P3 + P36 -10G-> lc6-2A/P3 + +SUBSYSTEM LEAF lc6B + P1 -10G-> fc9A P13 + P2 -10G-> fc9B P10 + P3 -10G-> fc8A P13 + P4 -10G-> fc8B P10 + P5 -10G-> fc7A P13 + P6 -10G-> fc7B P10 + P7 -10G-> fc6A P13 + P8 -10G-> fc6B P10 + P9 -10G-> fc5A P13 + P10 -10G-> fc5B P10 + P11 -10G-> fc4A P13 + P12 -10G-> fc4B P10 + P13 -10G-> fc1B P10 + P14 -10G-> fc1A P13 + P15 -10G-> fc2B P10 + P16 -10G-> fc2A P13 + P17 -10G-> fc3B P10 + P18 -10G-> fc3A P13 + P19 -10G-> lc6-3A/P3 + P20 -10G-> lc6-3B/P3 + P21 -10G-> lc6-3B/P2 + P22 -10G-> lc6-3B/P1 + P23 -10G-> lc6-3A/P2 + P24 -10G-> lc6-3A/P1 + P25 -10G-> lc6-4A/P3 + P26 -10G-> lc6-4B/P3 + P27 -10G-> lc6-4B/P2 + P28 -10G-> lc6-4B/P1 + P29 -10G-> lc6-4A/P2 + P30 -10G-> lc6-4A/P1 + P31 -10G-> lc6-5A/P1 + P32 -10G-> lc6-5A/P2 + P33 -10G-> lc6-5B/P1 + P34 -10G-> lc6-5B/P2 + P35 -10G-> lc6-5B/P3 + P36 -10G-> lc6-5A/P3 + +SUBSYSTEM LEAF lc6C + P1 -10G-> fc9B P18 + P2 -10G-> fc9A P31 + P3 -10G-> fc8B P18 + P4 -10G-> fc8A P31 + P5 -10G-> fc7B P18 + P6 -10G-> fc7A P31 + P7 -10G-> fc6B P18 + P8 -10G-> fc6A P31 + P9 -10G-> fc5B P18 + P10 -10G-> fc5A P31 + P11 -10G-> fc4B P18 + P12 -10G-> fc4A P31 + P13 -10G-> fc1A P31 + P14 -10G-> fc1B P18 + P15 -10G-> fc2A P31 + P16 -10G-> fc2B P18 + P17 -10G-> fc3A P31 + P18 -10G-> fc3B P18 + P19 -10G-> lc6-6A/P3 + P20 -10G-> lc6-6B/P3 + P21 -10G-> lc6-6B/P2 + P22 -10G-> lc6-6B/P1 + P23 -10G-> lc6-6A/P2 + P24 -10G-> lc6-6A/P1 + P25 -10G-> lc6-7A/P3 + P26 -10G-> lc6-7B/P3 + P27 -10G-> lc6-7B/P2 + P28 -10G-> lc6-7B/P1 + P29 -10G-> lc6-7A/P2 + P30 -10G-> lc6-7A/P1 + P31 -10G-> lc6-8A/P1 + P32 -10G-> lc6-8A/P2 + P33 -10G-> lc6-8B/P1 + P34 -10G-> lc6-8B/P2 + P35 -10G-> lc6-8B/P3 + P36 -10G-> lc6-8A/P3 + +SUBSYSTEM LEAF lc6D + P1 -10G-> fc9A P32 + P2 -10G-> fc9B P8 + P3 -10G-> fc8A P32 + P4 -10G-> fc8B P8 + P5 -10G-> fc7A P32 + P6 -10G-> fc7B P8 + P7 -10G-> fc6A P32 + P8 -10G-> fc6B P8 + P9 -10G-> fc5A P32 + P10 -10G-> fc5B P8 + P11 -10G-> fc4A P32 + P12 -10G-> fc4B P8 + P13 -10G-> fc1B P8 + P14 -10G-> fc1A P32 + P15 -10G-> fc2B P8 + P16 -10G-> fc2A P32 + P17 -10G-> fc3B P8 + P18 -10G-> fc3A P32 + P19 -10G-> lc6-9A/P3 + P20 -10G-> lc6-9B/P3 + P21 -10G-> lc6-9B/P2 + P22 -10G-> lc6-9B/P1 + P23 -10G-> lc6-9A/P2 + P24 -10G-> lc6-9A/P1 + P25 -10G-> lc6-10A/P3 + P26 -10G-> lc6-10B/P3 + P27 -10G-> lc6-10B/P2 + P28 -10G-> lc6-10B/P1 + P29 -10G-> lc6-10A/P2 + P30 -10G-> lc6-10A/P1 + P31 -10G-> lc6-11A/P1 + P32 -10G-> lc6-11A/P2 + P33 -10G-> lc6-11B/P1 + P34 -10G-> lc6-11B/P2 + P35 -10G-> lc6-11B/P3 + P36 -10G-> lc6-11A/P3 + +SUBSYSTEM LEAF lc7A + P1 -10G-> fc9B P7 + P2 -10G-> fc9A P12 + P3 -10G-> fc8B P7 + P4 -10G-> fc8A P12 + P5 -10G-> fc7B P7 + P6 -10G-> fc7A P12 + P7 -10G-> fc6B P7 + P8 -10G-> fc6A P12 + P9 -10G-> fc5B P7 + P10 -10G-> fc5A P12 + P11 -10G-> fc4B P7 + P12 -10G-> fc4A P12 + P13 -10G-> fc1A P12 + P14 -10G-> fc1B P7 + P15 -10G-> fc2A P12 + P16 -10G-> fc2B P7 + P17 -10G-> fc3A P12 + P18 -10G-> fc3B P7 + P19 -10G-> lc7-0A/P3 + P20 -10G-> lc7-0B/P3 + P21 -10G-> lc7-0B/P2 + P22 -10G-> lc7-0B/P1 + P23 -10G-> lc7-0A/P2 + P24 -10G-> lc7-0A/P1 + P25 -10G-> lc7-1A/P3 + P26 -10G-> lc7-1B/P3 + P27 -10G-> lc7-1B/P2 + P28 -10G-> lc7-1B/P1 + P29 -10G-> lc7-1A/P2 + P30 -10G-> lc7-1A/P1 + P31 -10G-> lc7-2A/P1 + P32 -10G-> lc7-2A/P2 + P33 -10G-> lc7-2B/P1 + P34 -10G-> lc7-2B/P2 + P35 -10G-> lc7-2B/P3 + P36 -10G-> lc7-2A/P3 + +SUBSYSTEM LEAF lc7B + P1 -10G-> fc9A P17 + P2 -10G-> fc9B P6 + P3 -10G-> fc8A P17 + P4 -10G-> fc8B P6 + P5 -10G-> fc7A P17 + P6 -10G-> fc7B P6 + P7 -10G-> fc6A P17 + P8 -10G-> fc6B P6 + P9 -10G-> fc5A P17 + P10 -10G-> fc5B P6 + P11 -10G-> fc4A P17 + P12 -10G-> fc4B P6 + P13 -10G-> fc1B P6 + P14 -10G-> fc1A P17 + P15 -10G-> fc2B P6 + P16 -10G-> fc2A P17 + P17 -10G-> fc3B P6 + P18 -10G-> fc3A P17 + P19 -10G-> lc7-3A/P3 + P20 -10G-> lc7-3B/P3 + P21 -10G-> lc7-3B/P2 + P22 -10G-> lc7-3B/P1 + P23 -10G-> lc7-3A/P2 + P24 -10G-> lc7-3A/P1 + P25 -10G-> lc7-4A/P3 + P26 -10G-> lc7-4B/P3 + P27 -10G-> lc7-4B/P2 + P28 -10G-> lc7-4B/P1 + P29 -10G-> lc7-4A/P2 + P30 -10G-> lc7-4A/P1 + P31 -10G-> lc7-5A/P1 + P32 -10G-> lc7-5A/P2 + P33 -10G-> lc7-5B/P1 + P34 -10G-> lc7-5B/P2 + P35 -10G-> lc7-5B/P3 + P36 -10G-> lc7-5A/P3 + +SUBSYSTEM LEAF lc7C + P1 -10G-> fc9B P9 + P2 -10G-> fc9A P16 + P3 -10G-> fc8B P9 + P4 -10G-> fc8A P16 + P5 -10G-> fc7B P9 + P6 -10G-> fc7A P16 + P7 -10G-> fc6B P9 + P8 -10G-> fc6A P16 + P9 -10G-> fc5B P9 + P10 -10G-> fc5A P16 + P11 -10G-> fc4B P9 + P12 -10G-> fc4A P16 + P13 -10G-> fc1A P16 + P14 -10G-> fc1B P9 + P15 -10G-> fc2A P16 + P16 -10G-> fc2B P9 + P17 -10G-> fc3A P16 + P18 -10G-> fc3B P9 + P19 -10G-> lc7-6A/P3 + P20 -10G-> lc7-6B/P3 + P21 -10G-> lc7-6B/P2 + P22 -10G-> lc7-6B/P1 + P23 -10G-> lc7-6A/P2 + P24 -10G-> lc7-6A/P1 + P25 -10G-> lc7-7A/P3 + P26 -10G-> lc7-7B/P3 + P27 -10G-> lc7-7B/P2 + P28 -10G-> lc7-7B/P1 + P29 -10G-> lc7-7A/P2 + P30 -10G-> lc7-7A/P1 + P31 -10G-> lc7-8A/P1 + P32 -10G-> lc7-8A/P2 + P33 -10G-> lc7-8B/P1 + P34 -10G-> lc7-8B/P2 + P35 -10G-> lc7-8B/P3 + P36 -10G-> lc7-8A/P3 + +SUBSYSTEM LEAF lc7D + P1 -10G-> fc9A P15 + P2 -10G-> fc9B P5 + P3 -10G-> fc8A P15 + P4 -10G-> fc8B P5 + P5 -10G-> fc7A P15 + P6 -10G-> fc7B P5 + P7 -10G-> fc6A P15 + P8 -10G-> fc6B P5 + P9 -10G-> fc5A P15 + P10 -10G-> fc5B P5 + P11 -10G-> fc4A P15 + P12 -10G-> fc4B P5 + P13 -10G-> fc1B P5 + P14 -10G-> fc1A P15 + P15 -10G-> fc2B P5 + P16 -10G-> fc2A P15 + P17 -10G-> fc3B P5 + P18 -10G-> fc3A P15 + P19 -10G-> lc7-9A/P3 + P20 -10G-> lc7-9B/P3 + P21 -10G-> lc7-9B/P2 + P22 -10G-> lc7-9B/P1 + P23 -10G-> lc7-9A/P2 + P24 -10G-> lc7-9A/P1 + P25 -10G-> lc7-10A/P3 + P26 -10G-> lc7-10B/P3 + P27 -10G-> lc7-10B/P2 + P28 -10G-> lc7-10B/P1 + P29 -10G-> lc7-10A/P2 + P30 -10G-> lc7-10A/P1 + P31 -10G-> lc7-11A/P1 + P32 -10G-> lc7-11A/P2 + P33 -10G-> lc7-11B/P1 + P34 -10G-> lc7-11B/P2 + P35 -10G-> lc7-11B/P3 + P36 -10G-> lc7-11A/P3 + +SUBSYSTEM LEAF lc8A + P1 -10G-> fc9B P2 + P2 -10G-> fc9A P8 + P3 -10G-> fc8B P2 + P4 -10G-> fc8A P8 + P5 -10G-> fc7B P2 + P6 -10G-> fc7A P8 + P7 -10G-> fc6B P2 + P8 -10G-> fc6A P8 + P9 -10G-> fc5B P2 + P10 -10G-> fc5A P8 + P11 -10G-> fc4B P2 + P12 -10G-> fc4A P8 + P13 -10G-> fc1A P8 + P14 -10G-> fc1B P2 + P15 -10G-> fc2A P8 + P16 -10G-> fc2B P2 + P17 -10G-> fc3A P8 + P18 -10G-> fc3B P2 + P19 -10G-> lc8-0A/P3 + P20 -10G-> lc8-0B/P3 + P21 -10G-> lc8-0B/P2 + P22 -10G-> lc8-0B/P1 + P23 -10G-> lc8-0A/P2 + P24 -10G-> lc8-0A/P1 + P25 -10G-> lc8-1A/P3 + P26 -10G-> lc8-1B/P3 + P27 -10G-> lc8-1B/P2 + P28 -10G-> lc8-1B/P1 + P29 -10G-> lc8-1A/P2 + P30 -10G-> lc8-1A/P1 + P31 -10G-> lc8-2A/P1 + P32 -10G-> lc8-2A/P2 + P33 -10G-> lc8-2B/P1 + P34 -10G-> lc8-2B/P2 + P35 -10G-> lc8-2B/P3 + P36 -10G-> lc8-2A/P3 + +SUBSYSTEM LEAF lc8B + P1 -10G-> fc9A P11 + P2 -10G-> fc9B P3 + P3 -10G-> fc8A P11 + P4 -10G-> fc8B P3 + P5 -10G-> fc7A P11 + P6 -10G-> fc7B P3 + P7 -10G-> fc6A P11 + P8 -10G-> fc6B P3 + P9 -10G-> fc5A P11 + P10 -10G-> fc5B P3 + P11 -10G-> fc4A P11 + P12 -10G-> fc4B P3 + P13 -10G-> fc1B P3 + P14 -10G-> fc1A P11 + P15 -10G-> fc2B P3 + P16 -10G-> fc2A P11 + P17 -10G-> fc3B P3 + P18 -10G-> fc3A P11 + P19 -10G-> lc8-3A/P3 + P20 -10G-> lc8-3B/P3 + P21 -10G-> lc8-3B/P2 + P22 -10G-> lc8-3B/P1 + P23 -10G-> lc8-3A/P2 + P24 -10G-> lc8-3A/P1 + P25 -10G-> lc8-4A/P3 + P26 -10G-> lc8-4B/P3 + P27 -10G-> lc8-4B/P2 + P28 -10G-> lc8-4B/P1 + P29 -10G-> lc8-4A/P2 + P30 -10G-> lc8-4A/P1 + P31 -10G-> lc8-5A/P1 + P32 -10G-> lc8-5A/P2 + P33 -10G-> lc8-5B/P1 + P34 -10G-> lc8-5B/P2 + P35 -10G-> lc8-5B/P3 + P36 -10G-> lc8-5A/P3 + +SUBSYSTEM LEAF lc8C + P1 -10G-> fc9B P4 + P2 -10G-> fc9A P10 + P3 -10G-> fc8B P4 + P4 -10G-> fc8A P10 + P5 -10G-> fc7B P4 + P6 -10G-> fc7A P10 + P7 -10G-> fc6B P4 + P8 -10G-> fc6A P10 + P9 -10G-> fc5B P4 + P10 -10G-> fc5A P10 + P11 -10G-> fc4B P4 + P12 -10G-> fc4A P10 + P13 -10G-> fc1A P10 + P14 -10G-> fc1B P4 + P15 -10G-> fc2A P10 + P16 -10G-> fc2B P4 + P17 -10G-> fc3A P10 + P18 -10G-> fc3B P4 + P19 -10G-> lc8-6A/P3 + P20 -10G-> lc8-6B/P3 + P21 -10G-> lc8-6B/P2 + P22 -10G-> lc8-6B/P1 + P23 -10G-> lc8-6A/P2 + P24 -10G-> lc8-6A/P1 + P25 -10G-> lc8-7A/P3 + P26 -10G-> lc8-7B/P3 + P27 -10G-> lc8-7B/P2 + P28 -10G-> lc8-7B/P1 + P29 -10G-> lc8-7A/P2 + P30 -10G-> lc8-7A/P1 + P31 -10G-> lc8-8A/P1 + P32 -10G-> lc8-8A/P2 + P33 -10G-> lc8-8B/P1 + P34 -10G-> lc8-8B/P2 + P35 -10G-> lc8-8B/P3 + P36 -10G-> lc8-8A/P3 + +SUBSYSTEM LEAF lc8D + P1 -10G-> fc9A P18 + P2 -10G-> fc9B P1 + P3 -10G-> fc8A P18 + P4 -10G-> fc8B P1 + P5 -10G-> fc7A P18 + P6 -10G-> fc7B P1 + P7 -10G-> fc6A P18 + P8 -10G-> fc6B P1 + P9 -10G-> fc5A P18 + P10 -10G-> fc5B P1 + P11 -10G-> fc4A P18 + P12 -10G-> fc4B P1 + P13 -10G-> fc1B P1 + P14 -10G-> fc1A P18 + P15 -10G-> fc2B P1 + P16 -10G-> fc2A P18 + P17 -10G-> fc3B P1 + P18 -10G-> fc3A P18 + P19 -10G-> lc8-9A/P3 + P20 -10G-> lc8-9B/P3 + P21 -10G-> lc8-9B/P2 + P22 -10G-> lc8-9B/P1 + P23 -10G-> lc8-9A/P2 + P24 -10G-> lc8-9A/P1 + P25 -10G-> lc8-10A/P3 + P26 -10G-> lc8-10B/P3 + P27 -10G-> lc8-10B/P2 + P28 -10G-> lc8-10B/P1 + P29 -10G-> lc8-10A/P2 + P30 -10G-> lc8-10A/P1 + P31 -10G-> lc8-11A/P1 + P32 -10G-> lc8-11A/P2 + P33 -10G-> lc8-11B/P1 + P34 -10G-> lc8-11B/P2 + P35 -10G-> lc8-11B/P3 + P36 -10G-> lc8-11A/P3 + +SUBSYSTEM LEAF lc9A + P1 -10G-> fc9B P21 + P2 -10G-> fc9A P5 + P3 -10G-> fc8B P21 + P4 -10G-> fc8A P5 + P5 -10G-> fc7B P21 + P6 -10G-> fc7A P5 + P7 -10G-> fc6B P21 + P8 -10G-> fc6A P5 + P9 -10G-> fc5B P21 + P10 -10G-> fc5A P5 + P11 -10G-> fc4B P21 + P12 -10G-> fc4A P5 + P13 -10G-> fc1A P5 + P14 -10G-> fc1B P21 + P15 -10G-> fc2A P5 + P16 -10G-> fc2B P21 + P17 -10G-> fc3A P5 + P18 -10G-> fc3B P21 + P19 -10G-> lc9-0A/P3 + P20 -10G-> lc9-0B/P3 + P21 -10G-> lc9-0B/P2 + P22 -10G-> lc9-0B/P1 + P23 -10G-> lc9-0A/P2 + P24 -10G-> lc9-0A/P1 + P25 -10G-> lc9-1A/P3 + P26 -10G-> lc9-1B/P3 + P27 -10G-> lc9-1B/P2 + P28 -10G-> lc9-1B/P1 + P29 -10G-> lc9-1A/P2 + P30 -10G-> lc9-1A/P1 + P31 -10G-> lc9-2A/P1 + P32 -10G-> lc9-2A/P2 + P33 -10G-> lc9-2B/P1 + P34 -10G-> lc9-2B/P2 + P35 -10G-> lc9-2B/P3 + P36 -10G-> lc9-2A/P3 + +SUBSYSTEM LEAF lc9B + P1 -10G-> fc9A P7 + P2 -10G-> fc9B P20 + P3 -10G-> fc8A P7 + P4 -10G-> fc8B P20 + P5 -10G-> fc7A P7 + P6 -10G-> fc7B P20 + P7 -10G-> fc6A P7 + P8 -10G-> fc6B P20 + P9 -10G-> fc5A P7 + P10 -10G-> fc5B P20 + P11 -10G-> fc4A P7 + P12 -10G-> fc4B P20 + P13 -10G-> fc1B P20 + P14 -10G-> fc1A P7 + P15 -10G-> fc2B P20 + P16 -10G-> fc2A P7 + P17 -10G-> fc3B P20 + P18 -10G-> fc3A P7 + P19 -10G-> lc9-3A/P3 + P20 -10G-> lc9-3B/P3 + P21 -10G-> lc9-3B/P2 + P22 -10G-> lc9-3B/P1 + P23 -10G-> lc9-3A/P2 + P24 -10G-> lc9-3A/P1 + P25 -10G-> lc9-4A/P3 + P26 -10G-> lc9-4B/P3 + P27 -10G-> lc9-4B/P2 + P28 -10G-> lc9-4B/P1 + P29 -10G-> lc9-4A/P2 + P30 -10G-> lc9-4A/P1 + P31 -10G-> lc9-5A/P1 + P32 -10G-> lc9-5A/P2 + P33 -10G-> lc9-5B/P1 + P34 -10G-> lc9-5B/P2 + P35 -10G-> lc9-5B/P3 + P36 -10G-> lc9-5A/P3 + +SUBSYSTEM LEAF lc9C + P1 -10G-> fc9B P19 + P2 -10G-> fc9A P6 + P3 -10G-> fc8B P19 + P4 -10G-> fc8A P6 + P5 -10G-> fc7B P19 + P6 -10G-> fc7A P6 + P7 -10G-> fc6B P19 + P8 -10G-> fc6A P6 + P9 -10G-> fc5B P19 + P10 -10G-> fc5A P6 + P11 -10G-> fc4B P19 + P12 -10G-> fc4A P6 + P13 -10G-> fc1A P6 + P14 -10G-> fc1B P19 + P15 -10G-> fc2A P6 + P16 -10G-> fc2B P19 + P17 -10G-> fc3A P6 + P18 -10G-> fc3B P19 + P19 -10G-> lc9-6A/P3 + P20 -10G-> lc9-6B/P3 + P21 -10G-> lc9-6B/P2 + P22 -10G-> lc9-6B/P1 + P23 -10G-> lc9-6A/P2 + P24 -10G-> lc9-6A/P1 + P25 -10G-> lc9-7A/P3 + P26 -10G-> lc9-7B/P3 + P27 -10G-> lc9-7B/P2 + P28 -10G-> lc9-7B/P1 + P29 -10G-> lc9-7A/P2 + P30 -10G-> lc9-7A/P1 + P31 -10G-> lc9-8A/P1 + P32 -10G-> lc9-8A/P2 + P33 -10G-> lc9-8B/P1 + P34 -10G-> lc9-8B/P2 + P35 -10G-> lc9-8B/P3 + P36 -10G-> lc9-8A/P3 + +SUBSYSTEM LEAF lc9D + P1 -10G-> fc9A P9 + P2 -10G-> fc9B P22 + P3 -10G-> fc8A P9 + P4 -10G-> fc8B P22 + P5 -10G-> fc7A P9 + P6 -10G-> fc7B P22 + P7 -10G-> fc6A P9 + P8 -10G-> fc6B P22 + P9 -10G-> fc5A P9 + P10 -10G-> fc5B P22 + P11 -10G-> fc4A P9 + P12 -10G-> fc4B P22 + P13 -10G-> fc1B P22 + P14 -10G-> fc1A P9 + P15 -10G-> fc2B P22 + P16 -10G-> fc2A P9 + P17 -10G-> fc3B P22 + P18 -10G-> fc3A P9 + P19 -10G-> lc9-9A/P3 + P20 -10G-> lc9-9B/P3 + P21 -10G-> lc9-9B/P2 + P22 -10G-> lc9-9B/P1 + P23 -10G-> lc9-9A/P2 + P24 -10G-> lc9-9A/P1 + P25 -10G-> lc9-10A/P3 + P26 -10G-> lc9-10B/P3 + P27 -10G-> lc9-10B/P2 + P28 -10G-> lc9-10B/P1 + P29 -10G-> lc9-10A/P2 + P30 -10G-> lc9-10A/P1 + P31 -10G-> lc9-11A/P1 + P32 -10G-> lc9-11A/P2 + P33 -10G-> lc9-11B/P1 + P34 -10G-> lc9-11B/P2 + P35 -10G-> lc9-11B/P3 + P36 -10G-> lc9-11A/P3 diff --git a/ibdm/ibnl/SUNDCS72QDR.ibnl b/ibdm/ibnl/SUNDCS72QDR.ibnl new file mode 100644 index 0000000..1907ec3 --- /dev/null +++ b/ibdm/ibnl/SUNDCS72QDR.ibnl @@ -0,0 +1,311 @@ +SYSTEM LEAF,LEAF:4x,LEAF:4X + +NODE SW 36 MT48436 U1 +1 -10G-> P1 +2 -10G-> P2 +3 -10G-> P3 +4 -10G-> P4 +5 -10G-> P5 +6 -10G-> P6 +7 -10G-> P7 +8 -10G-> P8 +9 -10G-> P9 +10 -10G-> P10 +11 -10G-> P11 +12 -10G-> P12 +13 -10G-> P13 +14 -10G-> P14 +15 -10G-> P15 +16 -10G-> P16 +17 -10G-> P17 +18 -10G-> P18 +19 -10G-> P19 +20 -10G-> P20 +21 -10G-> P21 +22 -10G-> P22 +23 -10G-> P23 +24 -10G-> P24 +25 -10G-> P25 +26 -10G-> P26 +27 -10G-> P27 +28 -10G-> P28 +29 -10G-> P29 +30 -10G-> P30 +31 -10G-> P31 +32 -10G-> P32 +33 -10G-> P33 +34 -10G-> P34 +35 -10G-> P35 +36 -10G-> P36 + +SYSTEM SPINE,SPINE:4x,SPINE:4X + +NODE SW 36 MT48436 U1 +1 -10G-> P1 +2 -10G-> P2 +3 -10G-> P3 +4 -10G-> P4 +5 -10G-> P5 +6 -10G-> P6 +7 -10G-> P7 +8 -10G-> P8 +9 -10G-> P9 +10 -10G-> P10 +11 -10G-> P11 +12 -10G-> P12 +13 -10G-> P13 +14 -10G-> P14 +15 -10G-> P15 +16 -10G-> P16 +17 -10G-> P17 +18 -10G-> P18 +19 -10G-> P19 +20 -10G-> P20 +21 -10G-> P21 +22 -10G-> P22 +23 -10G-> P23 +24 -10G-> P24 +25 -10G-> P25 +26 -10G-> P26 +27 -10G-> P27 +28 -10G-> P28 +29 -10G-> P29 +30 -10G-> P30 +31 -10G-> P31 +32 -10G-> P32 +33 -10G-> P33 +34 -10G-> P34 +35 -10G-> P35 +36 -10G-> P36 + +TOPSYSTEM SUNDCS72QDR,NM2-72P + +SUBSYSTEM SPINE SW-F + P1 -10G-> SW-C P9 + P2 -10G-> SW-A P8 + P4 -10G-> SW-C P6 + P3 -10G-> SW-A P7 + P5 -10G-> SW-A P5 + P6 -10G-> SW-C P4 + P7 -10G-> SW-A P3 + P8 -10G-> SW-A P2 + P9 -10G-> SW-C P1 + P10 -10G-> SW-B P13 + P11 -10G-> SW-D P14 + P12 -10G-> SW-D P15 + P13 -10G-> SW-B P10 + P14 -10G-> SW-D P11 + P15 -10G-> SW-D P12 + P16 -10G-> SW-B P18 + P17 -10G-> SW-D P17 + P18 -10G-> SW-B P16 + P19 -10G-> SW-A P10 + P20 -10G-> SW-C P11 + P21 -10G-> SW-C P12 + P22 -10G-> SW-A P18 + P23 -10G-> SW-C P17 + P24 -10G-> SW-A P16 + P25 -10G-> SW-C P15 + P26 -10G-> SW-C P14 + P27 -10G-> SW-A P13 + P28 -10G-> SW-D P1 + P29 -10G-> SW-B P2 + P30 -10G-> SW-B P3 + P31 -10G-> SW-D P9 + P32 -10G-> SW-B P8 + P33 -10G-> SW-B P7 + P34 -10G-> SW-D P6 + P35 -10G-> SW-B P5 + P36 -10G-> SW-D P4 + + +SUBSYSTEM SPINE SW-E + P1 -10G-> SW-A P9 + P2 -10G-> SW-C P8 + P3 -10G-> SW-C P7 + P4 -10G-> SW-A P6 + P5 -10G-> SW-C P5 + P6 -10G-> SW-A P4 + P7 -10G-> SW-C P3 + P8 -10G-> SW-C P2 + P9 -10G-> SW-A P1 + P10 -10G-> SW-D P13 + P11 -10G-> SW-B P14 + P12 -10G-> SW-B P15 + P13 -10G-> SW-D P10 + P14 -10G-> SW-B P11 + P15 -10G-> SW-B P12 + P16 -10G-> SW-D P18 + P17 -10G-> SW-B P17 + P18 -10G-> SW-D P16 + P19 -10G-> SW-C P10 + P20 -10G-> SW-A P11 + P21 -10G-> SW-A P12 + P22 -10G-> SW-C P18 + P23 -10G-> SW-A P17 + P24 -10G-> SW-C P16 + P25 -10G-> SW-A P15 + P26 -10G-> SW-A P14 + P27 -10G-> SW-C P13 + P28 -10G-> SW-B P1 + P29 -10G-> SW-D P2 + P30 -10G-> SW-D P3 + P31 -10G-> SW-B P9 + P32 -10G-> SW-D P8 + P33 -10G-> SW-D P7 + P34 -10G-> SW-B P6 + P35 -10G-> SW-D P5 + P36 -10G-> SW-B P4 + +SUBSYSTEM LEAF SW-D + P1 -10G-> SW-F P28 + P2 -10G-> SW-E P29 + P3 -10G-> SW-E P30 + P4 -10G-> SW-F P36 + P5 -10G-> SW-E P35 + P6 -10G-> SW-F P34 + P7 -10G-> SW-E P33 + P8 -10G-> SW-E P32 + P9 -10G-> SW-F P31 + P10 -10G-> SW-E P13 + P11 -10G-> SW-F P14 + P12 -10G-> SW-F P15 + P13 -10G-> SW-E P10 + P14 -10G-> SW-F P11 + P15 -10G-> SW-F P12 + P16 -10G-> SW-E P18 + P17 -10G-> SW-F P17 + P18 -10G-> SW-E P16 + P19 -10G-> C-9A/P3 + P20 -10G-> C-9B/P3 + P21 -10G-> C-9B/P2 + P22 -10G-> C-9B/P1 + P23 -10G-> C-9A/P2 + P24 -10G-> C-9A/P1 + P25 -10G-> C-10A/P3 + P26 -10G-> C-10B/P3 + P27 -10G-> C-10B/P2 + P28 -10G-> C-10B/P1 + P29 -10G-> C-10A/P2 + P30 -10G-> C-10A/P1 + P31 -10G-> C-11A/P1 + P32 -10G-> C-11A/P2 + P33 -10G-> C-11B/P1 + P34 -10G-> C-11B/P2 + P35 -10G-> C-11B/P3 + P36 -10G-> C-11A/P3 + +SUBSYSTEM LEAF SW-C + P1 -10G-> SW-F P9 + P2 -10G-> SW-E P8 + P3 -10G-> SW-E P7 + P4 -10G-> SW-F P6 + P5 -10G-> SW-E P5 + P6 -10G-> SW-F P4 + P7 -10G-> SW-E P3 + P8 -10G-> SW-E P2 + P9 -10G-> SW-F P1 + P10 -10G-> SW-E P19 + P11 -10G-> SW-F P20 + P12 -10G-> SW-F P21 + P13 -10G-> SW-E P27 + P14 -10G-> SW-F P26 + P15 -10G-> SW-F P25 + P16 -10G-> SW-E P24 + P17 -10G-> SW-F P23 + P18 -10G-> SW-E P22 + P19 -10G-> C-6A/P3 + P20 -10G-> C-6B/P3 + P21 -10G-> C-6B/P2 + P22 -10G-> C-6B/P1 + P23 -10G-> C-6A/P2 + P24 -10G-> C-6A/P1 + P25 -10G-> C-7A/P3 + P26 -10G-> C-7B/P3 + P27 -10G-> C-7B/P2 + P28 -10G-> C-7B/P1 + P29 -10G-> C-7A/P2 + P30 -10G-> C-7A/P1 + P31 -10G-> C-8A/P1 + P32 -10G-> C-8A/P2 + P33 -10G-> C-8B/P1 + P34 -10G-> C-8B/P2 + P35 -10G-> C-8B/P3 + P36 -10G-> C-8A/P3 + +SUBSYSTEM LEAF SW-B + P1 -10G-> SW-E P28 + P2 -10G-> SW-F P29 + P3 -10G-> SW-F P30 + P4 -10G-> SW-E P36 + P5 -10G-> SW-F P35 + P6 -10G-> SW-E P34 + P7 -10G-> SW-F P33 + P8 -10G-> SW-F P32 + P9 -10G-> SW-E P31 + P10 -10G-> SW-F P13 + P11 -10G-> SW-E P14 + P12 -10G-> SW-E P15 + P13 -10G-> SW-F P10 + P14 -10G-> SW-E P11 + P15 -10G-> SW-E P12 + P16 -10G-> SW-F P18 + P17 -10G-> SW-E P17 + P18 -10G-> SW-F P16 + P19 -10G-> C-3A/P3 + P20 -10G-> C-3B/P3 + P21 -10G-> C-3B/P2 + P22 -10G-> C-3B/P1 + P23 -10G-> C-3A/P2 + P24 -10G-> C-3A/P1 + P25 -10G-> C-4A/P3 + P26 -10G-> C-4B/P3 + P27 -10G-> C-4B/P2 + P28 -10G-> C-4B/P1 + P29 -10G-> C-4A/P2 + P30 -10G-> C-4A/P1 + P31 -10G-> C-5A/P1 + P32 -10G-> C-5A/P2 + P33 -10G-> C-5B/P1 + P34 -10G-> C-5B/P2 + P35 -10G-> C-5B/P3 + P36 -10G-> C-5A/P3 + +SUBSYSTEM LEAF SW-A + P1 -10G-> SW-E P9 + P2 -10G-> SW-F P8 + P3 -10G-> SW-F P7 + P4 -10G-> SW-E P6 + P5 -10G-> SW-F P5 + P6 -10G-> SW-E P4 + P7 -10G-> SW-F P3 + P8 -10G-> SW-F P2 + P9 -10G-> SW-E P1 + P10 -10G-> SW-F P19 + P11 -10G-> SW-E P20 + P12 -10G-> SW-E P21 + P13 -10G-> SW-F P27 + P14 -10G-> SW-E P26 + P15 -10G-> SW-E P25 + P16 -10G-> SW-F P24 + P17 -10G-> SW-E P23 + P18 -10G-> SW-F P22 + P19 -10G-> C-0A/P3 + P20 -10G-> C-0B/P3 + P21 -10G-> C-0B/P2 + P22 -10G-> C-0B/P1 + P23 -10G-> C-0A/P2 + P24 -10G-> C-0A/P1 + P25 -10G-> C-1A/P3 + P26 -10G-> C-1B/P3 + P27 -10G-> C-1B/P2 + P28 -10G-> C-1B/P1 + P29 -10G-> C-1A/P2 + P30 -10G-> C-1A/P1 + P31 -10G-> C-2A/P1 + P32 -10G-> C-2A/P2 + P33 -10G-> C-2B/P1 + P34 -10G-> C-2B/P2 + P35 -10G-> C-2B/P3 + P36 -10G-> C-2A/P3 + From vlad at lists.openfabrics.org Fri Aug 28 03:06:15 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 28 Aug 2009 03:06:15 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090828-0200 daily build status Message-ID: <20090828100615.54101E61E5D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090828-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From FENKES at de.ibm.com Fri Aug 28 05:58:49 2009 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Fri, 28 Aug 2009 14:58:49 +0200 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: Hal Rosenstock wrote on 27.08.2009 15:31:40: > I don't think it should be hard coded. IMO it would be better to default to 18 > and somehow able to be adjusted (via a (dynamic) module parameter ?). I don't see how making this a parameter would benefit any end user, while on the other hand it clutters up our parameter list. Changing RespTimeValue won't influence the IB performance or user-visible behavior of our driver in any way, and in fact, all RespTimeValue says is "Please use a timeout of one second for all future MADs you send me", only there won't be any more MADs in the future because we just redirected the client to someone else. So, the RespTimeValue field is a don't care in the redirection scenario. Setting it to an arbitrary, but legal value isn't much more than a concession towards any broken clients that may be out there. Given that you seem to like the rest of the code and Jason hasn't spoken up yet, I think we can have Roland merge this patch. Roland, what do you think? Regards, Joachim From hnrose at comcast.net Fri Aug 28 06:44:53 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 28 Aug 2009 09:44:53 -0400 Subject: [ofa-general] [PATCH] opensm/osm_helper.c: Add SM priority changed into trap 144 description Message-ID: <20090828134452.GA20014@comcast.net> Per MgtWG RefID #4503 Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 3692474..1b83a9e 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -531,7 +531,7 @@ const char *ib_get_trap_str(ib_net16_t trap_num) return "Flow Control Update watchdog timer expired"; case 144: return - "CapabilityMask, NodeDescription, Link [Width|Speed] Enabled changed"; + "CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed"; case 145: return "System Image GUID changed"; case 256: From halves at linux.vnet.ibm.com Fri Aug 28 08:37:30 2009 From: halves at linux.vnet.ibm.com (Higor Aparecido Vieira Alves) Date: Fri, 28 Aug 2009 12:37:30 -0300 Subject: [ofa-general] OFED 1.5-alpha 4 and RHEL 5.3 GA Message-ID: <1251473851.10055.3.camel@halves-ltc> Hi Guys, I tried build OFED1.5 on RHEL 5.3 GA and got an error to build ofa_kernel. Build log attached. Regards, -- Higor Aparecido Vieira Alves Software Engineer Linux Technology Center IBM Systems & Technology Group -------------- next part -------------- A non-text attachment was scrubbed... Name: ofa_kernel.rpmbuild.log Type: text/x-log Size: 772137 bytes Desc: not available URL: From hal.rosenstock at gmail.com Fri Aug 28 09:03:47 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 28 Aug 2009 12:03:47 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables setup flow In-Reply-To: <20090828080756.GH28379@me> References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> Message-ID: On 8/28/09, Sasha Khapyorsky wrote: > > > Simplify (and unify) forwarding tables setup decision flow. Seems to work for all engines but I got a failure for a test case where lash fell back to min hop: lash_core: ERR 4D02: Lane requirements (9) exceed available lanes (8) with starting lane (0) ucast_mgr_route: lash: cannot build fwd tables. osm_ucast_mgr_process: minhop tables configured on all switches ERR 331D: LFT of switch 0xguid is not up to date. Prior to this change, the LFTs were pushed for this fallback case (and no ERR 331D occured). -- Hal > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_mgr.c | 7 +------ > 1 files changed, 1 insertions(+), 6 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 629f628..8ba78f8 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * > p_map_item, > } > } > > - set_fwd_tbl_top(p_mgr, p_sw); > - > if (p_mgr->p_subn->opt.lmc) > free_ports_priv(p_mgr); > > @@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * > p_mgr) > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, > ucast_mgr_process_tbl, > p_mgr); > > - ucast_mgr_pipeline_fwd_tbl(p_mgr); > - > cl_qlist_remove_all(&p_mgr->port_order_list); > > return 0; > @@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine > *r, osm_opensm_t * osm) > > osm->routing_engine_used = osm_routing_engine_type(r->name); > > - if (r->ucast_build_fwd_tables) > - osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); > + osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); > > return 0; > } > -- > 1.6.4 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Aug 28 09:27:35 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Aug 2009 09:27:35 -0700 Subject: [ofa-general] QDR IB cards supports card back to back connectivity In-Reply-To: (lakshmana swamy's message of "Fri, 28 Aug 2009 13:25:54 +0530") References: Message-ID: > I would like know the QDR Infinibad cards will support to back to > back connectivity or not ie with out IB swicth to enable the IB > communication between the two machines . Yes, any IB port should be able to connect to any other IB port. You do need a subnet manager (SM) on every IB fabric, so in your case of two HCAs connected back-to-back, and SM must be running on one of the HCA ports. From rdreier at cisco.com Fri Aug 28 09:28:29 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Aug 2009 09:28:29 -0700 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: (Joachim Fenkes's message of "Fri, 28 Aug 2009 14:58:49 +0200") References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: > Given that you seem to like the rest of the code and Jason hasn't spoken > up yet, I think we can have Roland merge this patch. Roland, what do you > think? I don't see any problem with the idea and this does sound like a step forward, so I am planning on merging this (pending review). From rdreier at cisco.com Fri Aug 28 10:30:36 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Aug 2009 10:30:36 -0700 Subject: [ofa-general] Re: Opinions on moving Linux InfiniBand/RDMA mailing list to vger? In-Reply-To: <20090820.160800.50693597.davem@davemloft.net> (David Miller's message of "Thu, 20 Aug 2009 16:08:00 -0700 (PDT)") References: <20090820.160800.50693597.davem@davemloft.net> Message-ID: It seems we only had positive responses to moving from general@ to a new linux-rdma at vger.kernel.org list, so I'll work on a transition plan. For now, please continue to use general at lists.openfabrics.org. However, you may want to subscribe to the vger list to be ready for the transition; for information on that, see http://vger.kernel.org. - Roland From rdreier at cisco.com Fri Aug 28 10:55:16 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Aug 2009 10:55:16 -0700 Subject: [ofa-general] Re: [PATCH V2] mlx4: Do not allow ib userspace open while device is being removed In-Reply-To: <200908111021.01612.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 11 Aug 2009 10:21:01 +0300") References: <200908111021.01612.jackm@dev.mellanox.co.il> Message-ID: checkpatch output: WARNING: suspect code indent for conditional statements (8, 12) #88: FILE: drivers/infiniband/hw/mlx4/main.c:345: + if (!dev->ib_active) + return ERR_PTR(-EAGAIN); ERROR: code indent should use tabs where possible #107: FILE: drivers/infiniband/hw/mlx4/main.c:737: + ^I^Iibdev->ib_active = 0;$ total: 1 errors, 1 warnings, 31 lines checked not great for a patch this small. Please clean up. From rdreier at cisco.com Fri Aug 28 10:58:43 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Aug 2009 10:58:43 -0700 Subject: [ofa-general] Re: [PATCH] mthca: Do not allow ib userspace open following device internal error In-Reply-To: <200908121215.46221.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Wed, 12 Aug 2009 12:15:46 +0300") References: <200908121215.46221.jackm@dev.mellanox.co.il> Message-ID: thanks, applied (and thanks for the detailed changelog, that really makes things easier) From hal.rosenstock at gmail.com Fri Aug 28 11:34:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 28 Aug 2009 14:34:36 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr: better lft setup In-Reply-To: <20090828081002.GI28379@me> References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> <20090828081002.GI28379@me> Message-ID: On 8/28/09, Sasha Khapyorsky wrote: > > > The function set_next_lft_block() is called in loop with block number > incremented, inside it loops by itself in looking for changed block, > caller will call this function with original block number incremented > so this internal loop could be repeated again and again. This patch > cleans this ineffectiveness. > > Also rename it to set_lft_block() since block number is treated as > parameters and *not* next block is processed and merges some code. > > Signed-off-by: Sasha Khapyorsky Acked-by: Hal Rosenstock hal.rosenstock at gmail.com > --- > opensm/include/opensm/osm_ucast_mgr.h | 1 + > opensm/opensm/osm_ucast_mgr.c | 126 > +++++++++++---------------------- > 2 files changed, 43 insertions(+), 84 deletions(-) > > diff --git a/opensm/include/opensm/osm_ucast_mgr.h > b/opensm/include/opensm/osm_ucast_mgr.h > index 4ef045c..78a88f0 100644 > --- a/opensm/include/opensm/osm_ucast_mgr.h > +++ b/opensm/include/opensm/osm_ucast_mgr.h > @@ -95,6 +95,7 @@ typedef struct osm_ucast_mgr { > osm_subn_t *p_subn; > osm_log_t *p_log; > cl_plock_t *p_lock; > + uint16_t max_lid; > cl_qlist_t port_order_list; > boolean_t is_dor; > boolean_t some_hop_count_set; > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 8ba78f8..a111c10 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -336,6 +336,9 @@ static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, > IN osm_switch_t * p_sw) > > CL_ASSERT(p_node); > > + if (p_mgr->max_lid < p_sw->max_lid_ho) > + p_mgr->max_lid = p_sw->max_lid_ho; > + > p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node, > 0)); > > /* > @@ -478,65 +481,13 @@ static void ucast_mgr_process_top(IN cl_map_item_t * > p_map_item, > set_fwd_tbl_top(p_mgr, p_sw); > } > > -static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * > p_sm, > - IN uint8_t * p_block, > - IN osm_dr_path_t * p_path, > - IN uint16_t block_id_ho, > - IN osm_madw_context_t * p_context) > -{ > - ib_api_status_t status; > - boolean_t sts; > - > - OSM_LOG_ENTER(p_sm->p_log); > - > - for (; > - (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block)); > - block_id_ho++) { > - if (!p_sw->need_update && !p_sm->p_subn->need_update && > - !memcmp(p_block, > - p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, > - IB_SMP_DATA_SIZE)) > - continue; > - > - OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, > - "Writing FT block %u to switch 0x%" PRIx64 "\n", > - block_id_ho, > - cl_ntoh64(p_context->lft_context.node_guid)); > - > - status = osm_req_set(p_sm, p_path, > - p_sw->new_lft + > - block_id_ho * IB_SMP_DATA_SIZE, > - IB_SMP_DATA_SIZE, > IB_MAD_ATTR_LIN_FWD_TBL, > - cl_hton32(block_id_ho), > - CL_DISP_MSGID_NONE, p_context); > - > - if (status != IB_SUCCESS) > - OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 3A05: " > - "Sending linear fwd. tbl. block failed > (%s)\n", > - ib_get_err_str(status)); > - break; > - } > - > - OSM_LOG_EXIT(p_sm->p_log); > - return sts; > -} > - > -static boolean_t pipeline_next_lft_block(IN osm_switch_t *p_sw, > - IN osm_ucast_mgr_t *p_mgr, > - IN uint16_t block_id_ho) > +static int set_lft_block(IN osm_switch_t *p_sw, IN osm_ucast_mgr_t *p_mgr, > + IN uint16_t block_id_ho) > { > - osm_dr_path_t *p_path; > - osm_madw_context_t context; > uint8_t block[IB_SMP_DATA_SIZE]; > - boolean_t status; > - > - OSM_LOG_ENTER(p_mgr->p_log); > - > - CL_ASSERT(p_sw && p_sw->p_node); > - > - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > - "Processing switch 0x%" PRIx64 "\n", > - cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); > + osm_madw_context_t context; > + osm_dr_path_t *p_path; > + ib_api_status_t status; > > /* > Send linear forwarding table blocks to the switch > @@ -547,8 +498,7 @@ static boolean_t pipeline_next_lft_block(IN > osm_switch_t *p_sw, > /* any routing should provide the new_lft */ > CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && > p_mgr->cache_valid && !p_sw->need_update); > - status = FALSE; > - goto Exit; > + return -1; > } > > p_path = > osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0)); > @@ -556,12 +506,29 @@ static boolean_t pipeline_next_lft_block(IN > osm_switch_t *p_sw, > context.lft_context.node_guid = > osm_node_get_node_guid(p_sw->p_node); > context.lft_context.set_method = TRUE; > > - status = set_next_lft_block(p_sw, p_mgr->sm, &block[0], p_path, > - block_id_ho, &context); > + if (!osm_switch_get_lft_block(p_sw, block_id_ho, block) || > + (!p_sw->need_update && !p_mgr->p_subn->need_update && > + !memcmp(block, p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, > + IB_SMP_DATA_SIZE))) > + return 0; > > -Exit: > - OSM_LOG_EXIT(p_mgr->p_log); > - return status; > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > + "Writing FT block %u to switch 0x%" PRIx64 "\n", > block_id_ho, > + cl_ntoh64(context.lft_context.node_guid)); > + > + status = osm_req_set(p_mgr->sm, p_path, > + p_sw->new_lft + block_id_ho * > IB_SMP_DATA_SIZE, > + IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL, > + cl_hton32(block_id_ho), > + CL_DISP_MSGID_NONE, &context); > + if (status != IB_SUCCESS) { > + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: " > + "Sending linear fwd. tbl. block failed (%s)\n", > + ib_get_err_str(status)); > + return -1; > + } > + > + return 0; > } > > /********************************************************************** > @@ -919,26 +886,15 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t > * m) > > static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * p_mgr) > { > - cl_qmap_t *p_sw_tbl; > - osm_switch_t *p_sw; > - uint16_t block_id_ho = 0; > - int sws_notdone; > - boolean_t sts; > - > - p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; > - while (1) { > - p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > - sws_notdone = 0; > - while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { > - sts = pipeline_next_lft_block(p_sw, p_mgr, > block_id_ho); > - if (sts) > - sws_notdone++; > - p_sw = (osm_switch_t *) > cl_qmap_next(&p_sw->map_item); > - } > - if (!sws_notdone) > - break; > - block_id_ho++; > - } > + cl_qmap_t *tbl; > + cl_map_item_t *item; > + unsigned i, max_block = p_mgr->max_lid / 64 + 1; > + > + tbl = &p_mgr->p_subn->sw_guid_tbl; > + for (i = 0; i < max_block; i++) > + for (item = cl_qmap_head(tbl); item != cl_qmap_end(tbl); > + item = cl_qmap_next(item)) > + set_lft_block((osm_switch_t *)item, p_mgr, i); > } > > static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) > @@ -984,6 +940,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * > p_mgr) > **********************************************************************/ > void osm_ucast_mgr_set_fwd_table(osm_ucast_mgr_t * p_mgr) > { > + p_mgr->max_lid = 0; > + > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, > ucast_mgr_process_top, p_mgr); > > -- > 1.6.4 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgunthorpe at obsidianresearch.com Fri Aug 28 12:02:51 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 28 Aug 2009 13:02:51 -0600 Subject: [ofa-general] [PATCH] Remove duplicated umad_get_mad.3 from Makefile.am Message-ID: <20090828190251.GA8633@obsidianresearch.com> Fixes builds on FC11. Signed-off-by: Jason Gunthorpe --- libibumad/Makefile.am | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am index 50222df..27c6ff2 100644 --- a/libibumad/Makefile.am +++ b/libibumad/Makefile.am @@ -9,7 +9,7 @@ man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ man/umad_open_port.3 man/umad_close_port.3 man/umad_size.3 \ man/umad_status.3 man/umad_alloc.3 man/umad_free.3 \ man/umad_dump.3 man/umad_addr_dump.3 man/umad_get_fd.3 \ - man/umad_get_mad.3 man/umad_get_mad_addr.3 \ + man/umad_get_mad_addr.3 \ man/umad_set_grh_net.3 man/umad_set_grh.3 \ man/umad_set_addr_net.3 man/umad_set_addr.3 man/umad_set_pkey.3 \ man/umad_get_pkey.3 \ -- 1.6.0.4 From jaschut at sandia.gov Fri Aug 28 12:08:14 2009 From: jaschut at sandia.gov (Jim Schutt) Date: Fri, 28 Aug 2009 13:08:14 -0600 Subject: [ofa-general] [PATCH 0/2] opensm: release references to persistent routing engine private data Message-ID: <1251486496-24812-1-git-send-email-jaschut@sandia.gov> Hi, LASH uses osm_switch_t:priv to reference private data that persists between calls to the routing engine. The first patch fixes a use-after-free bug that occurs due to this reference, when a switch is removed from a fabric that LASH is routing. The second patch applies the same methodology to osm_port_t:priv. Even though no routing engine currently uses it to hold references to persistent private data, it seems appropriate to make the priv member for osm_switch_t and osm_port_t have the same behavior. -- Jim From jaschut at sandia.gov Fri Aug 28 12:08:15 2009 From: jaschut at sandia.gov (Jim Schutt) Date: Fri, 28 Aug 2009 13:08:15 -0600 Subject: [ofa-general] [PATCH 1/2] opensm: avoid LASH use-after-free when switch is deleted from fabric. In-Reply-To: <1251486496-24812-1-git-send-email-jaschut@sandia.gov> References: <1251486496-24812-1-git-send-email-jaschut@sandia.gov> Message-ID: <1251486496-24812-2-git-send-email-jaschut@sandia.gov> When LASH is run against ibsim, valgrind reports the following (on x86_64) after a switch is removed from the fabric: ==15699== Invalid write of size 8 ==15699== at 0x45FD8A: switch_delete (osm_ucast_lash.c:648) ==15699== by 0x461483: lash_cleanup (osm_ucast_lash.c:1123) ==15699== by 0x461848: lash_process (osm_ucast_lash.c:1230) ==15699== by 0x45C043: ucast_mgr_route (osm_ucast_mgr.c:1016) ==15699== by 0x45C1A0: osm_ucast_mgr_process (osm_ucast_mgr.c:1057) ==15699== by 0x44F11B: do_sweep (osm_state_mgr.c:1283) ==15699== by 0x44F539: osm_state_mgr_process (osm_state_mgr.c:1398) ==15699== by 0x447296: sm_process (osm_sm.c:90) ==15699== by 0x4473FE: sm_sweeper (osm_sm.c:130) ==15699== by 0x5023505: __cl_thread_wrapper (cl_thread.c:57) ==15699== by 0x37AC006366: start_thread (in /lib64/libpthread-2.5.so) ==15699== by 0x37AB4D30AC: clone (in /lib64/libc-2.5.so) ==15699== Address 0x9B28198 is 152 bytes inside a block of size 160 free'd ==15699== at 0x4A0541E: free (vg_replace_malloc.c:233) ==15699== by 0x453866: osm_switch_delete (osm_switch.c:97) ==15699== by 0x4116AA: drop_mgr_remove_switch (osm_drop_mgr.c:290) ==15699== by 0x411820: drop_mgr_process_node (osm_drop_mgr.c:339) ==15699== by 0x411D0C: osm_drop_mgr_process (osm_drop_mgr.c:465) ==15699== by 0x44EF97: do_sweep (osm_state_mgr.c:1231) ==15699== by 0x44F539: osm_state_mgr_process (osm_state_mgr.c:1398) ==15699== by 0x447296: sm_process (osm_sm.c:90) ==15699== by 0x4473FE: sm_sweeper (osm_sm.c:130) ==15699== by 0x5023505: __cl_thread_wrapper (cl_thread.c:57) ==15699== by 0x37AC006366: start_thread (in /lib64/libpthread-2.5.so) ==15699== by 0x37AB4D30AC: clone (in /lib64/libc-2.5.so) The root cause is that in order to perform SL lookup for path record queries, LASH needs to keep persistent data between calls to the routing engine. LASH uses the osm_switch_t:priv member to speed lookup of the LASH switch_t objects it needs to perform SL lookup, and has a corresponding switch_t:p_sw member to point to the corresponding osm_switch_t object. When a switch is deleted from the fabric, the switch_t:p_sw value becomes invalid, but LASH's switch_delete() uses it to clear the corresponding osm_switch_t:priv value. Solve this problem by adding a priv_release function pointer that is set when osm_switch_t:priv is set. This allows the opensm core to clean up after any routing engine that is using priv to access persistent data (LASH seems to be the only one so far), without knowing the details of how to do so. When multiple routing engines are configured, it also allows a routing engine using osm_switch_t:priv to clean up if some other routing engine using priv fails in an unexpected way. With this addition, the rules for using osm_switch_t:priv become: 1) Never assign to priv without also assigning to priv_release. 2) Always use priv_release() before assigning to priv; this prevents memory issues due to unexpected errors in a routing engine using priv. 3) Always use priv_release() to clean up after a use of priv. Since updn uses osm_switch_t:priv, fix it up to follow the above rules as well, for consistency. Signed-off-by: Jim Schutt --- opensm/include/opensm/osm_switch.h | 1 + opensm/opensm/osm_switch.c | 2 ++ opensm/opensm/osm_ucast_lash.c | 24 ++++++++++++++++++++---- opensm/opensm/osm_ucast_updn.c | 15 +++++++++++---- 4 files changed, 34 insertions(+), 8 deletions(-) diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 7ce28c5..d48f8c6 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -106,6 +106,7 @@ typedef struct osm_switch { unsigned endport_links; unsigned need_update; void *priv; + void (*priv_release)(struct osm_switch *p_sw); } osm_switch_t; /* * FIELDS diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index ce1ca63..fbf3973 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -94,6 +94,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) free(p_sw->hops[i]); free(p_sw->hops); } + if (p_sw->priv_release) + p_sw->priv_release(p_sw); free(*pp_sw); *pp_sw = NULL; } diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 0a567b3..ceae7d8 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -603,6 +603,17 @@ static int balance_virtual_lanes(lash_t * p_lash, unsigned lanes_needed) return 0; } +static void lash_switch_priv_release(osm_switch_t *osm_sw) +{ + switch_t *sw = osm_sw->priv; + + osm_sw->priv_release = NULL; + osm_sw->priv = NULL; + + if (sw && sw->p_sw == osm_sw) + sw->p_sw = NULL; +} + static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw) { unsigned num_switches = p_lash->num_switches; @@ -628,8 +639,12 @@ static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw } sw->p_sw = p_sw; - if (p_sw) + if (p_sw) { + if (p_sw->priv_release) + p_sw->priv_release(p_sw); p_sw->priv = sw; + p_sw->priv_release = lash_switch_priv_release; + } if (osm_mesh_node_create(p_lash, sw)) { free(sw->dij_channels); @@ -644,8 +659,8 @@ static void switch_delete(lash_t *p_lash, switch_t * sw) { if (sw->dij_channels) free(sw->dij_channels); - if (sw->p_sw) - sw->p_sw->priv = NULL; + if (sw->p_sw && sw->p_sw->priv_release) + sw->p_sw->priv_release(sw->p_sw); free(sw); } @@ -1113,7 +1128,8 @@ static void lash_cleanup(lash_t * p_lash) while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl)) { p_sw = p_next_sw; p_next_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); - p_sw->priv = NULL; + if (p_sw->priv_release) + p_sw->priv_release(p_sw); } if (p_lash->switches) { diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c index bb9ccda..dc5f459 100644 --- a/opensm/opensm/osm_ucast_updn.c +++ b/opensm/opensm/osm_ucast_updn.c @@ -404,10 +404,13 @@ static struct updn_node *create_updn_node(osm_switch_t * sw) return u; } -static void delete_updn_node(struct updn_node *u) +static void updn_sw_priv_release(osm_switch_t *sw) { - u->sw->priv = NULL; - free(u); + if (sw->priv) + free(sw->priv); + + sw->priv_release = NULL; + sw->priv = NULL; } /********************************************************************** @@ -589,6 +592,8 @@ static int updn_lid_matrices(void *ctx) item != cl_qmap_end(&p_updn->p_osm->subn.sw_guid_tbl); item = cl_qmap_next(item)) { p_sw = (osm_switch_t *)item; + if (p_sw->priv_release) + p_sw->priv_release(p_sw); p_sw->priv = create_updn_node(p_sw); if (!p_sw->priv) { OSM_LOG(&(p_updn->p_osm->log), OSM_LOG_ERROR, "ERR AA0C: " @@ -596,6 +601,7 @@ static int updn_lid_matrices(void *ctx) OSM_LOG_EXIT(&p_updn->p_osm->log); return -1; } + p_sw->priv_release = updn_sw_priv_release; } /* First setup root nodes */ @@ -653,7 +659,8 @@ static int updn_lid_matrices(void *ctx) item != cl_qmap_end(&p_updn->p_osm->subn.sw_guid_tbl); item = cl_qmap_next(item)) { p_sw = (osm_switch_t *) item; - delete_updn_node(p_sw->priv); + if (p_sw->priv_release) + p_sw->priv_release(p_sw); } OSM_LOG_EXIT(&p_updn->p_osm->log); -- 1.5.6.GIT From jaschut at sandia.gov Fri Aug 28 12:08:16 2009 From: jaschut at sandia.gov (Jim Schutt) Date: Fri, 28 Aug 2009 13:08:16 -0600 Subject: [ofa-general] [PATCH 2/2] opensm: Add priv_release() function pointer member to osm_port_t. In-Reply-To: <1251486496-24812-1-git-send-email-jaschut@sandia.gov> References: <1251486496-24812-1-git-send-email-jaschut@sandia.gov> Message-ID: <1251486496-24812-3-git-send-email-jaschut@sandia.gov> Although no routing engine currently uses osm_port_t:priv to reference routing engine data that is persistent between calls to the engine, one may be added in the future. Since this type of bug was just fixed for osm_switch_t:priv, fix up osm_port_t to use the same mechanism. Signed-off-by: Jim Schutt --- opensm/include/opensm/osm_port.h | 1 + opensm/opensm/osm_port.c | 2 ++ opensm/opensm/osm_ucast_mgr.c | 19 ++++++++++++++----- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h index 7079e74..21379b2 100644 --- a/opensm/include/opensm/osm_port.h +++ b/opensm/include/opensm/osm_port.h @@ -1162,6 +1162,7 @@ typedef struct osm_port { cl_qlist_t mcm_list; int flag; void *priv; + void (*priv_release)(struct osm_port *p_pt); } osm_port_t; /* * FIELDS diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 751c0f0..519d8bd 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -132,6 +132,8 @@ void osm_physp_init(IN osm_physp_t * p_physp, IN const ib_net64_t port_guid, **********************************************************************/ void osm_port_delete(IN OUT osm_port_t ** pp_port) { + if ((*pp_port)->priv_release) + (*pp_port)->priv_release(*pp_port); /* cleanup all mcm recs attached */ osm_port_remove_all_mgrp(*pp_port); free(*pp_port); diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 629f628..1bf367d 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -385,6 +385,15 @@ static int set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr, IN osm_switch_t * p_sw) /********************************************************************** **********************************************************************/ +static void minhop_port_priv_release(osm_port_t *port) +{ + if (port->priv) + free(port->priv); + + port->priv_release = NULL; + port->priv = NULL; +} + static void alloc_ports_priv(osm_ucast_mgr_t * mgr) { cl_qmap_t *port_tbl = &mgr->p_subn->port_guid_tbl; @@ -396,6 +405,8 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr) for (item = cl_qmap_head(port_tbl); item != cl_qmap_end(port_tbl); item = cl_qmap_next(item)) { port = (osm_port_t *) item; + if (port->priv_release) + port->priv_release(port); lmc = ib_port_info_get_lmc(&port->p_physp->port_info); if (!lmc) continue; @@ -404,11 +415,11 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr) OSM_LOG(mgr->p_log, OSM_LOG_ERROR, "ERR 3A09: " "cannot allocate memory to track remote" " systems for lmc > 0\n"); - port->priv = NULL; continue; } memset(r, 0, sizeof(*r) + sizeof(r->guids[0]) * (1 << lmc)); port->priv = r; + port->priv_release = minhop_port_priv_release; } } @@ -420,10 +431,8 @@ static void free_ports_priv(osm_ucast_mgr_t * mgr) for (item = cl_qmap_head(port_tbl); item != cl_qmap_end(port_tbl); item = cl_qmap_next(item)) { port = (osm_port_t *) item; - if (port->priv) { - free(port->priv); - port->priv = NULL; - } + if (port->priv_release) + port->priv_release(port); } } -- 1.5.6.GIT From vlad at lists.openfabrics.org Sat Aug 29 03:11:53 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 29 Aug 2009 03:11:53 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090829-0200 daily build status Message-ID: <20090829101154.174E6E2820F@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090829-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hnrose at comcast.net Sat Aug 29 08:30:10 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sat, 29 Aug 2009 11:30:10 -0400 Subject: [ofa-general] [PATCH] opensm: Reduce heap consumption by unicast routing tables (LFTs) Message-ID: <20090829153010.GA4272@comcast.net> Heap memory consumption by the unicast and multicast routing tables can be reduced. Using valgrind --tool=massif (for heap profiling), there are couple of places that consume most of the heap memory: ->38.75% (11,206,656B) 0x43267E: osm_switch_new (osm_switch.c:134) ->12.89% (3,728,256B) 0x40F8C9: osm_mcast_tbl_init (osm_mcast_tbl.c:96) osm_switch_new (osm_switch.c:108): p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); >From ib_types.h #define IB_LID_UCAST_END_HO 0xBFFF The LFT can be allocated in smaller chunks. If there is a LID that exeeds the current LFT size, LFT is reallocated with an increased size. This reduces performance and increases memory fragmentation, so this tradeoff is made optional based on new build and config options (see below). Using a 4K chunk as the minimal LFT block reduces the memory used by the LFTs by a factor of 12. For a larger (than 4K) fabric, 4K is added each time the existing LFT size is insufficient. So it looks like for cluster of 2-4K withan LMC of 0 about 40% (!!!) of the heap memory can be saved: - 39% used by LFTs, each with 48K entries - SM can allocate 4K entries instead. There is a new build option to specify whether to include the FT heap optimization code or not. It defaults to off and not include the new code (basically just the code that exists today). A new config option specifies whether to optimize FT allocation and defaults to off. Another new config option will specify the LFT allocation chunk and defaults to 4K. These chunks will be used as the initial minimum allocation and increased in increments of the chunk using realloc. LFTs are only be increased in size and are never reduced in size. If a realloc for an LFT fails, it results in an exit. A similar subsequent change will do this for MFTs. Signed-off-by: Hal Rosenstock --- diff --git a/opensm/config/osmvsel.m4 b/opensm/config/osmvsel.m4 index c24930b..1c7c8a2 100644 --- a/opensm/config/osmvsel.m4 +++ b/opensm/config/osmvsel.m4 @@ -232,6 +232,25 @@ fi # --- END OPENIB_OSM_PERF_MGR_SEL --- ]) dnl OPENIB_OSM_PERF_MGR_SEL +dnl Check if they want the FT heap optimization +AC_DEFUN([OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL], [ +# --- BEGIN OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL --- + +dnl enable the FT heap optimization +AC_ARG_ENABLE(ft-heap-optimize, +[ --enable-ft-heap-optimize Enable FT heap optimization (default no)], + [case $enableval in + yes) ft_heap_optimize=yes ;; + no) ft_heap_optimize=no ;; + esac], + ft_heap_optimize=no) +if test $ft_heap_optimize = yes; then + AC_DEFINE(ENABLE_OSM_FT_HEAP_OPTIMIZATION, + 1, + [Define as 1 if you want to enable the FT heap optimization]) +fi +# --- END OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL --- +]) dnl OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL dnl Check if they want the event plugin AC_DEFUN([OPENIB_OSM_DEFAULT_EVENT_PLUGIN_SEL], [ diff --git a/opensm/configure.in b/opensm/configure.in index 8a6b4c0..9b5ec00 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -87,6 +87,9 @@ OPENIB_OSM_CONSOLE_SOCKET_SEL dnl select performance manager or not OPENIB_OSM_PERF_MGR_SEL +dnl select FT heap optimization or not +OPENIB_OSM_FT_OPTIMIZE_HEAP_SEL + dnl resolve config dir. conf_dir_tmp1="`eval echo ${sysconfdir} | sed 's/^NONE/$ac_default_prefix/'`" SYS_CONFIG_DIR="`eval echo $conf_dir_tmp1`" diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index 0537002..89b125c 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. * @@ -449,6 +449,18 @@ BEGIN_C_DECLS */ #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4 /***********/ +/****d* OpenSM: Base/OSM_DEFAULT_LFT_CHUNKS +* NAME +* OSM_DEFAULT_LFT_CHUNKS +* +* DESCRIPTION +* Specifies the default number of 64 entry (byte) chunks in LFT +* related memory (re)allocation. Default is 64 (4K bytes). +* +* SYNOPSIS +*/ +#define OSM_DEFAULT_LFT_CHUNKS 64 +/***********/ /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE * NAME * OSM_SM_DEFAULT_QP0_RCV_SIZE diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 6c20de8..be90ce4 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved. @@ -218,6 +218,10 @@ typedef struct osm_subn_opt { uint32_t perfmgr_max_outstanding_queries; char *event_db_dump_file; #endif /* ENABLE_OSM_PERF_MGR */ +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION + boolean_t ft_heap_optimization; + uint32_t lft_chunks; +#endif /* ENABLE_OSM_FT_HEAP_OPTIMIZATION */ char *event_plugin_name; char *node_name_map_name; char *prefix_routes_file; @@ -437,6 +441,12 @@ typedef struct osm_subn_opt { * perfmgr_sweep_time_s * Define the period (in seconds) of PerfMgr sweeps * +* ft_heap_optimization +* Enable or disable forwarding table (FT) heap optimization +* +* lft_chunks +* Number of 64 entry (byte) chunks used in LFT (re)allocation +* * event_db_dump_file * File to dump the event database to * diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 7ce28c5..2c60fb6 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -102,6 +102,8 @@ typedef struct osm_switch { osm_port_profile_t *p_prof; uint8_t *lft; uint8_t *new_lft; + uint16_t lft_size; + uint16_t new_lft_size; osm_mcast_tbl_t mcast_tbl; unsigned endport_links; unsigned need_update; @@ -219,7 +221,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw); * SYNOPSIS */ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, - IN const osm_madw_t * const p_madw); + IN const osm_madw_t * const p_madw, + IN osm_subn_t * const p_subn); /* * PARAMETERS * p_node @@ -227,7 +230,10 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, * * p_madw * [in] Pointer to the MAD Wrapper containing the switch's -* SwitchInfo attribute. +* SwitchInfo attribute +* +* p_subn +* [in] Pointer to the subnet object * * RETURN VALUES * Pointer to the new initialized switch object. @@ -408,7 +414,7 @@ static inline uint8_t osm_switch_get_port_by_lid(IN const osm_switch_t * const p_sw, IN const uint16_t lid_ho) { - if (lid_ho == 0 || lid_ho > IB_LID_UCAST_END_HO) + if (lid_ho == 0 || lid_ho >= p_sw->lft_size) return OSM_NO_PATH; return p_sw->lft[lid_ho]; } @@ -575,6 +581,44 @@ osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw) * Switch object *********/ +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION +/****f* OpenSM: Switch/osm_switch_set_new_lft_entry +* NAME +* osm_switch_set_new_lft_entry +* +* DESCRIPTION +* Set a LID entry in the switch's new_lft. +* +* SYNOPSIS +* +*/ +boolean_t +osm_switch_set_new_lft_entry(IN osm_switch_t * const p_sw, + IN uint16_t lid, IN uint8_t port, + IN const osm_subn_t * const p_subn); +/* +* PARAMETERS +* p_sw +* [in] Pointer to an osm_switch_t object. +* +* lid +* [in] LID. +* +* port +* [in] port number. +* +* p_subn +* [in] Pointer to an osm_subn_t object. +* +* RETURN VALUES +* TRUE if success and FALSE if failure. +* +* NOTES +* +* SEE ALSO +*********/ +#endif + /****f* OpenSM: Switch/osm_switch_get_lft_block * NAME * osm_switch_get_lft_block @@ -586,6 +630,7 @@ osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw) */ boolean_t osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, + IN const osm_subn_t * const p_subn, IN const uint16_t block_id, OUT uint8_t * const p_block); /* @@ -593,6 +638,9 @@ osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, * p_sw * [in] Pointer to an osm_switch_t object. * +* p_subn +* [in] Pointer to an osm_subn_t object. +* * block_ID * [in] The block_id to retrieve. * @@ -714,16 +762,40 @@ osm_switch_count_path(IN osm_switch_t * const p_sw, IN const uint8_t port) static inline ib_api_status_t osm_switch_set_lft_block(IN osm_switch_t * const p_sw, IN const uint8_t * const p_block, - IN const uint32_t block_num) + IN const uint32_t block_num, + IN osm_subn_t * const p_subn) { uint16_t lid_start = (uint16_t) (block_num * IB_SMP_DATA_SIZE); +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION + uint8_t *lft; + size_t size; +#endif + CL_ASSERT(p_sw); if (lid_start + IB_SMP_DATA_SIZE > IB_LID_UCAST_END_HO) return IB_INVALID_PARAMETER; +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE); +#else + if (!p_subn->opt.ft_heap_optimization) + memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE); + else { + if (lid_start + IB_SMP_DATA_SIZE > p_sw->lft_size) { + size = (lid_start + (1 + p_subn->opt.lft_chunks) * IB_SMP_DATA_SIZE - 1) / IB_SMP_DATA_SIZE * IB_SMP_DATA_SIZE; + lft = realloc(p_sw->lft, size); + if (!lft) + return IB_INSUFFICIENT_MEMORY; + memset(lft + p_sw->lft_size, OSM_NO_PATH, + size - p_sw->lft_size); + p_sw->lft = lft; + p_sw->lft_size = size; + } + memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE); + } +#endif return IB_SUCCESS; } /* @@ -735,7 +807,10 @@ osm_switch_set_lft_block(IN osm_switch_t * const p_sw, * [in] Pointer to the forwarding table block. * * block_num -* [in] Block number for this block +* [in] Block number for this block. +* +* p_subn +* [in] Pointer to the subnet object. * * RETURN VALUE * None. diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index c541804..e1fb073 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -21,6 +21,13 @@ %define _disable_event_plugin --disable-default-event-plugin %endif +%if %{?_with_ft_heap_optimize:1}%{!?_with_ft_heap_optimize:0} +%define _enable_ft_heap_optimize --enable-ft-heap-optimize +%endif +%if %{?_without_ft_heap_optimize:1}%{!?_without_ft_heap_optimize:0} +%define _disable_ft_heap_optimize --disable-ft-heap-optimize +%endif + %if %{?_with_node_name_map:1}%{!?_with_node_name_map:0} %define _enable_node_name_map --with-node-name-map%{?_with_node_name_map} %endif @@ -83,6 +90,8 @@ Static version of the opensm libraries %{?_disable_console_socket} \ %{?_enable_perf_mgr} \ %{?_disable_perf_mgr} \ + %{?_enable_ft_heap_optimize} \ + %{?_disable_ft_heap_optimize} \ %{?_enable_event_plugin} \ %{?_disable_event_plugin} \ %{?_enable_node_name_map} diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c index ae40b0d..6f05bd7 100644 --- a/opensm/opensm/osm_lin_fwd_rcv.c +++ b/opensm/opensm/osm_lin_fwd_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -87,7 +87,8 @@ void osm_lft_rcv_process(IN void *context, IN void *data) "LFT received for nonexistent node " "0x%" PRIx64 "\n", cl_ntoh64(node_guid)); } else { - status = osm_switch_set_lft_block(p_sw, p_block, block_num); + status = osm_switch_set_lft_block(p_sw, p_block, block_num, + sm->p_subn); if (status != IB_SUCCESS) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0402: " "Setting forwarding table block failed (%s)" diff --git a/opensm/opensm/osm_sa_lft_record.c b/opensm/opensm/osm_sa_lft_record.c index d092129..b84bf6c 100644 --- a/opensm/opensm/osm_sa_lft_record.c +++ b/opensm/opensm/osm_sa_lft_record.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -99,7 +99,7 @@ static ib_api_status_t lftr_rcv_new_lftr(IN osm_sa_t * sa, p_rec_item->rec.block_num = cl_hton16(block); /* copy the lft block */ - osm_switch_get_lft_block(p_sw, block, p_rec_item->rec.lft); + osm_switch_get_lft_block(p_sw, sa->p_subn, block, p_rec_item->rec.lft); cl_qlist_insert_tail(p_list, &p_rec_item->list_item); diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 185c700..1423c11 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. * @@ -1011,7 +1011,8 @@ static void cleanup_switch(cl_map_item_t * item, void *log) if (!sw->new_lft) return; - if (memcmp(sw->lft, sw->new_lft, IB_LID_UCAST_END_HO + 1)) + if (sw->new_lft_size != sw->lft_size || + memcmp(sw->lft, sw->new_lft, sw->lft_size)) osm_log(log, OSM_LOG_ERROR, "ERR 331D: " "LFT of switch 0x%016" PRIx64 " is not up to date.\n", cl_ntoh64(sw->p_node->node_info.node_guid)); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 8d63a75..5189229 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved. @@ -359,6 +359,10 @@ static const opt_rec_t opt_tbl[] = { { "perfmgr_max_outstanding_queries", OPT_OFFSET(perfmgr_max_outstanding_queries), opts_parse_uint32, NULL, 0 }, { "event_db_dump_file", OPT_OFFSET(event_db_dump_file), opts_parse_charp, NULL, 0 }, #endif /* ENABLE_OSM_PERF_MGR */ +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION + { "ft_heap_optimization", OPT_OFFSET(ft_heap_optimization), opts_parse_boolean, NULL, 0 }, + { "lft_chunks", OPT_OFFSET(lft_chunks), opts_parse_uint32, NULL, 1 }, +#endif /* ENABLE_OSM_FT_HEAP_OPTIMIZATION */ { "event_plugin_name", OPT_OFFSET(event_plugin_name), opts_parse_charp, NULL, 0 }, { "node_name_map_name", OPT_OFFSET(node_name_map_name), opts_parse_charp, NULL, 0 }, { "qos_max_vls", OPT_OFFSET(qos_options.max_vls), opts_parse_uint32, NULL, 1 }, @@ -723,6 +727,10 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) OSM_PERFMGR_DEFAULT_MAX_OUTSTANDING_QUERIES; p_opt->event_db_dump_file = NULL; /* use default */ #endif /* ENABLE_OSM_PERF_MGR */ +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION + p_opt->ft_heap_optimization = FALSE; + p_opt->lft_chunks = OSM_DEFAULT_LFT_CHUNKS; +#endif /* ENABLE_OSM_FT_HEAP_OPTIMIZATION */ p_opt->event_plugin_name = NULL; p_opt->node_name_map_name = NULL; @@ -1141,6 +1149,18 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) } #endif +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION + if (p_opts->ft_heap_optimization) { + if (p_opts->lft_chunks < 1 || p_opts->lft_chunks > 768) { + log_report(" Invalid Cached Option Value:" + "lft_chunks = %u" + " Using Default:%u\n", + p_opts->lft_chunks, OSM_DEFAULT_LFT_CHUNKS); + p_opts->lft_chunks = OSM_DEFAULT_LFT_CHUNKS; + } + } +#endif + return 0; } @@ -1465,6 +1485,21 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) "# SA database file name\nsa_db_file %s\n\n", p_opts->sa_db_file ? p_opts->sa_db_file : null_str); +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION + fprintf(out, + "# Forwarding table (LFT) heap optimization\n" + "ft_heap_optimization %s\n\n", + p_opts->ft_heap_optimization ? "TRUE" : "FALSE"); + + fprintf(out, + "# Number of 64 entry (byte) chunks used when (re)allocating " + "LFTs\n" + "# Values go from 1 (highest granularity) to 786 " + "(allocate all the LFT in a single chunk)\n" + "lft_chunks %d\n\n", + p_opts->lft_chunks); +#endif /* ENABLE_OSM_FT_HEAP_OPTIMIZATION */ + fprintf(out, "#\n# HANDOVER - MULTIPLE SMs OPTIONS\n#\n" "# SM priority used for deciding who is the master\n" diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c index c335263..9861525 100644 --- a/opensm/opensm/osm_sw_info_rcv.c +++ b/opensm/opensm/osm_sw_info_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -211,7 +211,7 @@ static void si_rcv_process_new(IN osm_sm_t * sm, IN osm_node_t * p_node, osm_dump_switch_info(sm->p_log, p_si, OSM_LOG_DEBUG); - p_sw = osm_switch_new(p_node, p_madw); + p_sw = osm_switch_new(p_node, p_madw, sm->p_subn); if (p_sw == NULL) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 3608: " "Unable to allocate new switch object\n"); diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index ce1ca63..0d725e8 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. * @@ -51,6 +51,11 @@ #include #include +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION +static uint8_t no_path_block[IB_SMP_DATA_SIZE] = + {[0 ... IB_SMP_DATA_SIZE-1] = OSM_NO_PATH }; +#endif + /********************************************************************** **********************************************************************/ cl_status_t @@ -101,7 +106,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) /********************************************************************** **********************************************************************/ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, - IN const osm_madw_t * const p_madw) + IN const osm_madw_t * const p_madw, + IN osm_subn_t * const p_subn) { osm_switch_t *p_sw; ib_switch_info_t *p_si; @@ -132,11 +138,22 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, p_sw->num_ports = num_ports; p_sw->need_update = 2; - p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION + p_sw->lft_size = IB_LID_UCAST_END_HO + 1; +#else + if (!p_subn->opt.ft_heap_optimization) + p_sw->lft_size = IB_LID_UCAST_END_HO + 1; + else + p_sw->lft_size = p_subn->opt.lft_chunks * IB_SMP_DATA_SIZE; +#endif + + p_sw->lft = malloc(p_sw->lft_size); if (!p_sw->lft) goto err; - memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_sw->lft, OSM_NO_PATH, p_sw->lft_size); + + p_sw->new_lft_size = p_sw->lft_size; p_sw->p_prof = malloc(sizeof(*p_sw->p_prof) * num_ports); if (!p_sw->p_prof) @@ -158,10 +175,43 @@ err: return NULL; } +#ifdef ENABLE_OSM_FT_HEAP_OPTIMIZATION +/********************************************************************** + **********************************************************************/ +boolean_t +osm_switch_set_new_lft_entry(IN osm_switch_t * const p_sw, + IN uint16_t lid, IN uint8_t port, + IN const osm_subn_t * const p_subn) +{ + size_t size; + uint8_t *new_lft; + + if (!p_subn->opt.ft_heap_optimization) + p_sw->new_lft[lid] = port; + else { + if (lid >= p_sw->new_lft_size) { + size = (lid + p_subn->opt.lft_chunks * IB_SMP_DATA_SIZE - 1) / IB_SMP_DATA_SIZE * IB_SMP_DATA_SIZE; + if (size == p_sw->new_lft_size) + size += p_subn->opt.lft_chunks * IB_SMP_DATA_SIZE; + new_lft = realloc(p_sw->new_lft, size); + if (!new_lft) + return FALSE; + memset(new_lft + p_sw->new_lft_size, OSM_NO_PATH, + size - p_sw->new_lft_size); + p_sw->new_lft = new_lft; + p_sw->new_lft_size = size; + } + p_sw->new_lft[lid] = port; + } + return TRUE; +} +#endif + /********************************************************************** **********************************************************************/ boolean_t osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, + IN const osm_subn_t * const p_subn, IN const uint16_t block_id, OUT uint8_t * const p_block) { @@ -174,7 +224,19 @@ osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, return FALSE; CL_ASSERT(base_lid_ho + IB_SMP_DATA_SIZE <= IB_LID_UCAST_END_HO); +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION memcpy(p_block, &(p_sw->lft[base_lid_ho]), IB_SMP_DATA_SIZE); +#else + if (!p_subn->opt.ft_heap_optimization) + memcpy(p_block, &(p_sw->lft[base_lid_ho]), IB_SMP_DATA_SIZE); + else { + if (base_lid_ho + IB_SMP_DATA_SIZE > p_sw->lft_size) + memcpy(p_block, &no_path_block[0], IB_SMP_DATA_SIZE); + else + memcpy(p_block, &(p_sw->lft[base_lid_ho]), + IB_SMP_DATA_SIZE); + } +#endif return TRUE; } @@ -517,10 +579,10 @@ osm_switch_prepare_path_rebuild(IN osm_switch_t * p_sw, IN uint16_t max_lids) osm_switch_clear_hops(p_sw); if (!p_sw->new_lft && - !(p_sw->new_lft = malloc(IB_LID_UCAST_END_HO + 1))) + !(p_sw->new_lft = malloc(p_sw->new_lft_size))) return IB_INSUFFICIENT_MEMORY; - memset(p_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size); if (!p_sw->hops) { hops = malloc((max_lids + 1) * sizeof(hops[0])); diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index 30a3c1d..73f1ce0 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -73,6 +73,7 @@ typedef struct cache_switch { uint16_t num_hops; uint8_t **hops; uint8_t *lft; + uint16_t lft_size; uint8_t num_ports; cache_port_t ports[0]; } cache_switch_t; @@ -349,6 +350,7 @@ cache_restore_ucast_info(osm_ucast_mgr_t * p_mgr, if (p_sw->new_lft) free(p_sw->new_lft); p_sw->new_lft = p_cache_sw->lft; + p_sw->new_lft_size = p_cache_sw->lft_size; p_cache_sw->lft = NULL; p_sw->num_hops = p_cache_sw->num_hops; @@ -1023,10 +1025,12 @@ void osm_ucast_cache_add_node(osm_ucast_mgr_t * p_mgr, osm_node_t * p_node) /* LFT buffer exists - we use it, because it is more updated than the switch's LFT */ p_cache_sw->lft = p_node->sw->new_lft; + p_cache_sw->lft_size = p_node->sw->new_lft_size; p_node->sw->new_lft = NULL; } else { /* no LFT buffer, so we use the switch's LFT */ p_cache_sw->lft = p_node->sw->lft; + p_cache_sw->lft_size = p_node->sw->lft_size; p_node->sw->lft = NULL; } p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho; @@ -1079,10 +1083,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr) /* no new routing was recently calculated for this switch, but the LFT needs to be updated anyway */ p_sw->new_lft = p_sw->lft; - p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); + p_sw->new_lft_size = p_sw->lft_size; + p_sw->lft = malloc(p_sw->lft_size); if (!p_sw->lft) return IB_INSUFFICIENT_MEMORY; - memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_sw->lft, OSM_NO_PATH, p_sw->lft_size); } } diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index 5b73ca5..136e0de 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -92,7 +92,18 @@ static void add_path(osm_opensm_t * p_osm, new_lid); } +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_sw->new_lft[new_lid] = port_num; +#else + if (!osm_switch_set_new_lft_entry(p_sw, new_lid, port_num, + &p_osm->subn)) { + OSM_LOG(&p_osm->log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(&p_osm->log, OSM_LOG_ERROR, "ERR 630F: " + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif if (!(p_osm->subn.opt.port_profile_switch_nodes && port_guid && osm_get_switch_by_guid(&p_osm->subn, port_guid))) osm_switch_count_path(p_sw, port_num); @@ -193,8 +204,7 @@ static int do_ucast_file_load(void *context) cl_ntoh64(sw_guid)); continue; } - memset(p_sw->new_lft, OSM_NO_PATH, - IB_LID_UCAST_END_HO + 1); + memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size); } else if (p_sw && !strncmp(p, "0x", 2)) { p += 2; lid = (uint16_t) strtoul(p, &q, 16); diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 6ec6bc7..e37abb4 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -566,7 +566,7 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree, return NULL; /* initialize lft buffer */ - memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_osm_sw->new_lft, OSM_NO_PATH, p_osm_sw->new_lft_size); p_sw->hops = malloc((p_osm_sw->max_lid_ho + 1) * sizeof(*(p_sw->hops))); if (p_sw->hops == NULL) return NULL; @@ -2236,8 +2236,22 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* setting fwd tbl port only if this is real LID */ if (is_real_lid) { +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_remote_sw->p_osm_sw->new_lft[target_lid] = p_min_port->remote_port_num; +#else + if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw, + target_lid, + p_min_port->remote_port_num, + &p_ftree->p_osm->subn)) { + + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB15: osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CA LID %u through port %u\n", tuple_to_str(p_remote_sw->tuple), @@ -2459,6 +2473,7 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, /* We update the LFT only if this LID isn't already present. */ /* skip if target lid has been already set on remote switch fwd tbl (with a bigger hop count) */ +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION if ((p_remote_sw->p_osm_sw->new_lft[target_lid] == OSM_NO_PATH) || @@ -2470,6 +2485,28 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_remote_sw->p_osm_sw->new_lft[target_lid] = p_min_port->remote_port_num; +#else + if (target_lid >= p_remote_sw->p_osm_sw->new_lft_size || + ((p_remote_sw->p_osm_sw->new_lft[target_lid] == + OSM_NO_PATH) || + ((p_remote_sw->p_osm_sw->new_lft[target_lid] != + OSM_NO_PATH) && + (current_hops + 1 < + sw_get_least_hops(p_remote_sw, target_lid))))) { + + if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw, + target_lid, + p_min_port->remote_port_num, + &p_ftree->p_osm->subn)) { + OSM_LOG(&p_ftree->p_osm->log, + OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(&p_ftree->p_osm->log, + OSM_LOG_ERROR, "ERR AB16: " + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CA LID %u through port %u\n", tuple_to_str(p_remote_sw->tuple), @@ -2540,7 +2577,12 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_remote_sw = p_group->remote_hca_or_sw.p_sw; /* skip if target lid has been already set on remote switch fwd tbl (with a bigger hop count) */ +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION if (p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH) +#else + if (target_lid < p_remote_sw->p_osm_sw->new_lft_size && + p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH) +#endif if (current_hops + 1 >= sw_get_least_hops(p_remote_sw, target_lid)) continue; @@ -2576,8 +2618,21 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_port = p_min_port; //cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port); +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_remote_sw->p_osm_sw->new_lft[target_lid] = p_port->remote_port_num; +#else + if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw, + target_lid, + p_port->remote_port_num, + &p_ftree->p_osm->subn)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB17: osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif /* On the remote switch that is pointed by the p_group, set hops for ALL the ports in the remote group. */ @@ -2609,7 +2664,12 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_remote_sw = p_group->remote_hca_or_sw.p_sw; /* skip if target lid has been already set on remote switch fwd tbl (with a bigger hop count) */ +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION if (p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH) +#else + if (target_lid < p_remote_sw->p_osm_sw->new_lft_size && + p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH) +#endif if (current_hops + 1 >= sw_get_least_hops(p_remote_sw, target_lid)) continue; @@ -2645,9 +2705,21 @@ fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_port = p_min_port; //cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port); +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_remote_sw->p_osm_sw->new_lft[target_lid] = p_port->remote_port_num; - +#else + if (!osm_switch_set_new_lft_entry(p_remote_sw->p_osm_sw, + target_lid, + p_port->remote_port_num, + &p_ftree->p_osm->subn)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB18: osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif /* On the remote switch that is pointed by the p_group, set hops for ALL the ports in the remote group. */ @@ -2771,7 +2843,20 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) /* set local LFT(LID) to the port that is connected to HCA */ cl_ptr_vector_at(&p_leaf_port_group->ports, 0, (void *)&p_port); +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_sw->p_osm_sw->new_lft[hca_lid] = p_port->port_num; +#else + if (!osm_switch_set_new_lft_entry(p_sw->p_osm_sw, + hca_lid, + p_port->port_num, + &p_ftree->p_osm->subn)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc error - exiting\n"); + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB19: osm_switch_set_new_lft_entry realloc error - exiting\n"); + exit(1); + } +#endif OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CN LID %u through port %u\n", @@ -2883,7 +2968,20 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) cl_ptr_vector_at(&p_hca_port_group->ports, 0, (void *)&p_hca_port); port_num_on_switch = p_hca_port->remote_port_num; +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_sw->p_osm_sw->new_lft[hca_lid] = port_num_on_switch; +#else + if (!osm_switch_set_new_lft_entry(p_sw->p_osm_sw, + hca_lid, + port_num_on_switch, + &p_ftree->p_osm->subn)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc error - exiting\n"); + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB1A: osm_switch_set_new_lft_entry realloc error - exiting\n"); + exit(1); + } +#endif OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to non-CN HCA LID %u through port %u\n", @@ -2941,7 +3039,19 @@ static void fabric_route_to_switches(IN ftree_fabric_t * p_ftree) p_next_sw = (ftree_sw_t *) cl_qmap_next(&p_sw->map_item); /* set local LFT(LID) to 0 (route to itself) */ +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_sw->p_osm_sw->new_lft[p_sw->base_lid] = 0; +#else + if (!osm_switch_set_new_lft_entry(p_sw->p_osm_sw, + p_sw->base_lid, 0, + &p_ftree->p_osm->subn)) { + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc error - exiting\n"); + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "ERR AB1B: osm_switch_set_new_lft_entry realloc error - exiting\n"); + exit(1); + } +#endif OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s (LID %u): routing switch-to-switch paths\n", diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 0a567b3..18088a9 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -1009,7 +1009,7 @@ static void populate_fwd_tbls(lash_t * p_lash) current_guid = p_sw->p_node->node_info.port_guid; sw = p_sw->priv; - memset(p_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size); for (lid = 1; lid <= max_lid_ho; lid++) { port = cl_ptr_vector_get(&p_subn->port_lid_tbl, lid); @@ -1020,7 +1020,20 @@ static void populate_fwd_tbls(lash_t * p_lash) if (p_dst_sw == p_sw) { uint8_t egress_port = port->p_node->sw ? 0 : port->p_physp->p_remote_physp->port_num; +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_sw->new_lft[lid] = egress_port; +#else + if (!osm_switch_set_new_lft_entry(p_sw, lid, + egress_port, + p_subn)) { + OSM_LOG(p_log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D05: " + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif + OSM_LOG(p_log, OSM_LOG_VERBOSE, "LASH fwd MY SRC SRC GUID 0x%016" PRIx64 " src lash id (%d), src lid no (%u) src lash port (%d) " @@ -1038,7 +1051,19 @@ static void populate_fwd_tbls(lash_t * p_lash) uint8_t physical_egress_port = get_next_port(sw, lash_egress_port); +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_sw->new_lft[lid] = physical_egress_port; +#else + if (!osm_switch_set_new_lft_entry(p_sw, lid, + physical_egress_port, + p_subn)) { + OSM_LOG(p_log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D06:" + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif OSM_LOG(p_log, OSM_LOG_VERBOSE, "LASH fwd SRC GUID 0x%016" PRIx64 " src lash id (%d), " diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 629f628..0a34e8f 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -298,7 +298,17 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t * p_mgr, We have selected the port for this LID. Write it to the forwarding tables. */ +#ifndef ENABLE_OSM_FT_HEAP_OPTIMIZATION p_sw->new_lft[lid_ho] = port; +#else + if (!osm_switch_set_new_lft_entry(p_sw, lid_ho, port, p_mgr->p_subn)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_SYS, + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A0F: " + "osm_switch_set_new_lft_entry realloc failed - exiting\n"); + exit(1); + } +#endif if (!is_ignored_by_port_prof) { struct osm_remote_node *rem_node_used; osm_switch_count_path(p_sw, port); @@ -443,7 +453,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); /* Initialize LIDs in buffer to invalid port number. */ - memset(p_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_sw->new_lft, OSM_NO_PATH, p_sw->new_lft_size); if (p_mgr->p_subn->opt.lmc) alloc_ports_priv(p_mgr); @@ -492,7 +502,8 @@ static boolean_t set_next_lft_block(IN osm_switch_t * p_sw, IN osm_sm_t * p_sm, OSM_LOG_ENTER(p_sm->p_log); for (; - (sts = osm_switch_get_lft_block(p_sw, block_id_ho, p_block)); + (sts = osm_switch_get_lft_block(p_sw, p_sm->p_subn, block_id_ho, + p_block)); block_id_ho++) { if (!p_sw->need_update && !p_sm->p_subn->need_update && !memcmp(p_block, From sashak at voltaire.com Sat Aug 29 09:25:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 29 Aug 2009 19:25:55 +0300 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables setup flow In-Reply-To: References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> Message-ID: <20090829162555.GA21238@me> On 12:03 Fri 28 Aug , Hal Rosenstock wrote: > > lash_core: ERR 4D02: Lane requirements (9) exceed available lanes (8) with > starting lane (0) > ucast_mgr_route: lash: cannot build fwd tables. > osm_ucast_mgr_process: minhop tables configured on all switches > ERR 331D: LFT of switch 0xguid is not up to date. > > Prior to this change, the LFTs were pushed for this fallback case (and no > ERR 331D occured). Nice catch. Such addition is needed to make a fallback to work properly: diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index b7e3893..39d825c 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -1007,6 +1007,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr) /* If configured routing algorithm failed, use default MinHop */ osm_ucast_mgr_build_lid_matrices(p_mgr); ucast_mgr_build_lfts(p_mgr); + osm_ucast_mgr_set_fwd_tables(p_mgr); p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; } Sasha From sashak at voltaire.com Sat Aug 29 09:28:52 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 29 Aug 2009 19:28:52 +0300 Subject: [ofa-general] [PATCH v2] opensm/osm_ucast_mgr.c: simplify fwd tables setup flow In-Reply-To: <20090829162555.GA21238@me> References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> <20090829162555.GA21238@me> Message-ID: <20090829162852.GB21238@me> Simplify (and unify) forwarding tables setup decision flow. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_mgr.c | 8 ++------ 1 files changed, 2 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 629f628..45a4a7e 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item, } } - set_fwd_tbl_top(p_mgr, p_sw); - if (p_mgr->p_subn->opt.lmc) free_ports_priv(p_mgr); @@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr) cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl, p_mgr); - ucast_mgr_pipeline_fwd_tbl(p_mgr); - cl_qlist_remove_all(&p_mgr->port_order_list); return 0; @@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t * osm) osm->routing_engine_used = osm_routing_engine_type(r->name); - if (r->ucast_build_fwd_tables) - osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); + osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); return 0; } @@ -1063,6 +1058,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr) /* If configured routing algorithm failed, use default MinHop */ osm_ucast_mgr_build_lid_matrices(p_mgr); ucast_mgr_build_lfts(p_mgr); + osm_ucast_mgr_set_fwd_tables(p_mgr); p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; } -- 1.6.4 From sashak at voltaire.com Sat Aug 29 09:35:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 29 Aug 2009 19:35:45 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: Add SM priority changed into trap 144 description In-Reply-To: <20090828134452.GA20014@comcast.net> References: <20090828134452.GA20014@comcast.net> Message-ID: <20090829163545.GC21238@me> On 09:44 Fri 28 Aug , Hal Rosenstock wrote: > > Per MgtWG RefID #4503 > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From roel.kluin at gmail.com Sat Aug 29 13:25:38 2009 From: roel.kluin at gmail.com (Roel Kluin) Date: Sat, 29 Aug 2009 22:25:38 +0200 Subject: [ofa-general] [PATCH] IB: dereference of dev->ibdev.iwcm in c2_register_device() Message-ID: <4A998EC2.70500@gmail.com> dev->ibdev.iwcm allocation may fail, prevent a dereference. Signed-off-by: Roel Kluin --- diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index f1948fa..0f90fe6 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -851,6 +851,10 @@ int c2_register_device(struct c2_dev *dev) dev->ibdev.post_recv = c2_post_receive; dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); + if (dev->ibdev.iwcm == NULL) { + ret = -ENOMEM; + goto out1; + } dev->ibdev.iwcm->add_ref = c2_add_ref; dev->ibdev.iwcm->rem_ref = c2_rem_ref; dev->ibdev.iwcm->get_qp = c2_get_qp; From sashak at voltaire.com Sat Aug 29 13:41:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 29 Aug 2009 23:41:18 +0300 Subject: [ofa-general] Re: [PATCH] Duplicated file man/umad_get_mad.3 in libibumad/Makefile.am In-Reply-To: <4A952478.7060407@bull.net> References: <4A952478.7060407@bull.net> Message-ID: <20090829204118.GD21238@me> On 14:03 Wed 26 Aug , Vincent Ficet wrote: > > Hello, > > the file man/umad_get_mad.3 was listed twice in libibumad/Makefile.am resulting in the following error: > > /usr/bin/install: will not overwrite just-created `/home/vficet/work/infiniband/I686/usr/share/man/man3/umad_get_mad.3' with `man/umad_get_mad.3' > > This patch removes the duplicated entry. > > Cheers, > > Vincent > > > Signed-off-by: Jean-Vincent Ficet Applied. Thanks. Sasha From sashak at voltaire.com Sat Aug 29 13:44:38 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 29 Aug 2009 23:44:38 +0300 Subject: [ofa-general] Re: [PATCH] Remove duplicated umad_get_mad.3 from Makefile.am In-Reply-To: <20090828190251.GA8633@obsidianresearch.com> References: <20090828190251.GA8633@obsidianresearch.com> Message-ID: <20090829204438.GE21238@me> On 13:02 Fri 28 Aug , Jason Gunthorpe wrote: > Fixes builds on FC11. > > Signed-off-by: Jason Gunthorpe Thanks Jason, similar patch was already posted by Jean-Vincent Ficet, so the fix is applied. Sasha From sashak at voltaire.com Sat Aug 29 13:44:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 29 Aug 2009 23:44:57 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_helper.c: Only change method when > rather than >= In-Reply-To: <20090825232024.GA17650@comcast.net> References: <20090825232024.GA17650@comcast.net> Message-ID: <20090829204457.GG21238@me> On 19:20 Tue 25 Aug , Hal Rosenstock wrote: > > Also, cosmetic formatting change to combine lines like: > uint16_t host_attr; > host_attr = cl_ntoh16(attr); > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Aug 29 13:45:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 29 Aug 2009 23:45:08 +0300 Subject: [ofa-general] Re: osm_link_mgr.c:link_mgr_get_smsl question In-Reply-To: References: Message-ID: <20090829204508.GH21238@me> Hi Hal, On 14:38 Fri 07 Aug , Hal Rosenstock wrote: > > osm_link_mgr.c:link_mgr_get_smsl has the following: > > /* Find osm_port of the source = p_physp */ > slid = osm_physp_get_base_lid(p_physp); > p_src_port = > cl_ptr_vector_get(&sm->p_subn->port_lid_tbl, cl_ntoh16(slid)); > > /* Call lash to find proper SL */ > sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port); > > It may be that this code is invoked prior to the LID being assigned How is it possible? In the code I can see that link_mgr_process() is always executed after lid_mgr run. > so > getting the p_src_port based on the LID yields NULL and then calling > osm_get_lash_sl causes a seg fault. > > I can see two ways to fix this: > 1. Replace with port GUID search > 2. Have osm_get_lash_sl handle NULL for p_src_port > Maybe you see other ways to deal with this. > > Do you have a preferred approach ? Hmm, SMSL will be irrelevant for a port where LID was not assigned, right? If so than it is probably just enough to add in link_mgr_get_smsl(): if (!p_src_port) return; But it would be really better to understand an error source before deciding about proper solution. Sasha From hal.rosenstock at gmail.com Sat Aug 29 15:59:14 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 29 Aug 2009 18:59:14 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables setup flow In-Reply-To: <20090829162555.GA21238@me> References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> <20090829162555.GA21238@me> Message-ID: On 8/29/09, Sasha Khapyorsky wrote: > > On 12:03 Fri 28 Aug , Hal Rosenstock wrote: > > > > lash_core: ERR 4D02: Lane requirements (9) exceed available lanes (8) > with > > starting lane (0) > > ucast_mgr_route: lash: cannot build fwd tables. > > osm_ucast_mgr_process: minhop tables configured on all switches > > ERR 331D: LFT of switch 0xguid is not up to date. > > > > Prior to this change, the LFTs were pushed for this fallback case (and no > > ERR 331D occured). > > Nice catch. > > Such addition is needed to make a fallback to work properly: > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index b7e3893..39d825c 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -1007,6 +1007,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr) > /* If configured routing algorithm failed, use default > MinHop */ > osm_ucast_mgr_build_lid_matrices(p_mgr); > ucast_mgr_build_lfts(p_mgr); > + osm_ucast_mgr_set_fwd_tables(p_mgr); Shouldn't this be osm_ucast_mgr_set_fwd_table ? -- Hal p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; > } > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Sat Aug 29 16:05:38 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 29 Aug 2009 19:05:38 -0400 Subject: [ofa-general] Re: [PATCH v2] opensm/osm_ucast_mgr.c: simplify fwd tables setup flow In-Reply-To: <20090829162852.GB21238@me> References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> <20090829162555.GA21238@me> <20090829162852.GB21238@me> Message-ID: On 8/29/09, Sasha Khapyorsky wrote: > > > Simplify (and unify) forwarding tables setup decision flow. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_mgr.c | 8 ++------ > 1 files changed, 2 insertions(+), 6 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 629f628..45a4a7e 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -463,8 +463,6 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * > p_map_item, > } > } > > - set_fwd_tbl_top(p_mgr, p_sw); > - > if (p_mgr->p_subn->opt.lmc) > free_ports_priv(p_mgr); > > @@ -977,8 +975,6 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * > p_mgr) > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, > ucast_mgr_process_tbl, > p_mgr); > > - ucast_mgr_pipeline_fwd_tbl(p_mgr); > - > cl_qlist_remove_all(&p_mgr->port_order_list); > > return 0; > @@ -1025,8 +1021,7 @@ static int ucast_mgr_route(struct osm_routing_engine > *r, osm_opensm_t * osm) > > osm->routing_engine_used = osm_routing_engine_type(r->name); > > - if (r->ucast_build_fwd_tables) > - osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); > + osm_ucast_mgr_set_fwd_table(&osm->sm.ucast_mgr); > > return 0; > } > @@ -1063,6 +1058,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr) > /* If configured routing algorithm failed, use default > MinHop */ > osm_ucast_mgr_build_lid_matrices(p_mgr); > ucast_mgr_build_lfts(p_mgr); > + osm_ucast_mgr_set_fwd_tables(p_mgr); osm_ucast_mgr_set_fwd_table(p_mgr); ? p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; > } > > -- > 1.6.4 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Sat Aug 29 23:59:03 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 30 Aug 2009 09:59:03 +0300 Subject: [ofa-general] [PATCHv5 0/10] RDMAoE support In-Reply-To: <20090824121307.GA3919@mtls03> References: <20090819171935.GA14411@mtls03> <20090824121307.GA3919@mtls03> Message-ID: <4A9A2337.3030500@mellanox.co.il> Eli Cohen wrote: > Roland, > > what about this series of patches? Would you like me to re-create them > over your xrc branch or would you rather take them before xrc? > > > Hi Roland We wait for your input how to proceed Thanks Tziporet From vlad at lists.openfabrics.org Sun Aug 30 03:05:46 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 30 Aug 2009 03:05:46 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090830-0200 daily build status Message-ID: <20090830100547.4013EF20436@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090830-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Sun Aug 30 03:02:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 13:02:53 +0300 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr.c: simplify fwd tables setup flow In-Reply-To: References: <20090807110811.GA23431@comcast.net> <20090825190141.GG28379@me> <20090828080756.GH28379@me> <20090829162555.GA21238@me> Message-ID: <20090830100253.GA21909@me> On 18:59 Sat 29 Aug , Hal Rosenstock wrote: > > > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > > index b7e3893..39d825c 100644 > > --- a/opensm/opensm/osm_ucast_mgr.c > > +++ b/opensm/opensm/osm_ucast_mgr.c > > @@ -1007,6 +1007,7 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * p_mgr) > > /* If configured routing algorithm failed, use default > > MinHop */ > > osm_ucast_mgr_build_lid_matrices(p_mgr); > > ucast_mgr_build_lfts(p_mgr); > > + osm_ucast_mgr_set_fwd_tables(p_mgr); > > > > Shouldn't this be osm_ucast_mgr_set_fwd_table ? Yes it should be, but I renamed this later, now it is osm_ucast_mgr_set_fwd_tables() (since it sets all tables and not per switch as before). By mistake I pushed this last change patch before renaming so we have broken patch in the history (thing I'm trying to avoid normally). Sasha From sashak at voltaire.com Sun Aug 30 03:08:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 13:08:26 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_lash.c: In lash_core, return status -1 for all errors In-Reply-To: <20090806182315.GB21698@comcast.net> References: <20090806182315.GB21698@comcast.net> Message-ID: <20090830100826.GB21909@me> On 14:23 Thu 06 Aug , Hal Rosenstock wrote: > > In lash_process, rename variable from return_status to status > Also, status is not really IB_SUCCESS or not (although that works) > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From jackm at dev.mellanox.co.il Sun Aug 30 03:31:51 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 30 Aug 2009 13:31:51 +0300 Subject: [ofa-general] [PATCH V3] mlx4: Do not allow ib userspace open following a fatal event Message-ID: <200908301331.51212.jackm@dev.mellanox.co.il> Userspace apps are supposed to release all ib device resources if they receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However, the app has no way of knowing when the device has come back up, except to repeatedly attempt ibv_open_device() until it succeeds. However, currently there is no protection against open succeeding when the device is in the midst of the removal following the fatal event. In this case, the open will succeed, but as a result the device waits in the middle of its removal until the new app releases its ib resources -- and the new app will not do so, since the open succeeded at a point following the fatal event generation. This patch adds an "active" flag to the device. The active flag is set to false (in the fatal event flow) before the "fatal" event is generated, so any subsequent ibv_dev_open() call to the device will fail until the device comes back up, thus preventing the above deadlock. V2: move active flag from net to hw/mlx4, and use only for fatal event flow. (per feedback from Roland). V3: fixed checkpatch.pl warnings. Signed-off-by: Jack Morgenstein --- Roland, Sorry about the checkpatch.pl oversight. No excuse, but that day I was particularly rushed -- I left for the airport that evening with my family to go on vacation for 2 weeks. I guess I cut some corners, and shouldn't have. diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index ae3d759..4effc19 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -342,6 +342,9 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev, struct mlx4_ib_alloc_ucontext_resp resp; int err; + if (!dev->ib_active) + return ERR_PTR(-EAGAIN); + resp.qp_tab_size = dev->dev->caps.num_qps; resp.bf_reg_size = dev->dev->caps.bf_reg_size; resp.bf_regs_per_page = dev->dev->caps.bf_regs_per_page; @@ -673,6 +676,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) goto err_reg; } + ibdev->ib_active = 1; + return ibdev; err_reg: @@ -729,6 +734,7 @@ static void mlx4_ib_event(struct mlx4_dev *dev, void *ibdev_ptr, break; case MLX4_DEV_EVENT_CATASTROPHIC_ERROR: + ibdev->ib_active = 0; ibev.event = IB_EVENT_DEVICE_FATAL; break; diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 8a7dd67..b22df97 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -175,6 +175,7 @@ struct mlx4_ib_dev { spinlock_t sm_lock; struct mutex cap_mask_mutex; + int ib_active; }; static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev) From sashak at voltaire.com Sun Aug 30 03:36:15 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 13:36:15 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm/osm_mesh.c: Remove edges in lash matrix In-Reply-To: <20090806223417.GA2997@comcast.net> References: <20090806223417.GA2997@comcast.net> Message-ID: <20090830103615.GC21909@me> Hi Hal, On 18:34 Thu 06 Aug , Hal Rosenstock wrote: > > @@ -773,6 +838,7 @@ static void seed_axes(lash_t *p_lash, int sw) > mesh_node_t *node = p_lash->switches[sw]->node; > int n = node->num_links; > int i, j, c; > + char buf[256], *p; > > OSM_LOG_ENTER(p_log); > if (!node->matrix || !node->dimension) > @@ -805,6 +871,12 @@ static void seed_axes(lash_t *p_lash, int sw) > } > } > > + for (i = 0; i < n; i++) { > + p = buf; > + print_axis(p_lash, p, sw, i); > + OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf); > + } > + As far as I can see it is only debug prints, so why is OSM_LOG_INFO here? Also please move whole chunk under: if (osm_log_is_active(p_log, OSM_LOG_DEBUG)) { char buf[256], *p; .... } > done: > OSM_LOG_EXIT(p_log); > } > @@ -878,6 +950,12 @@ static void make_geometry(lash_t *p_lash, int sw) > n = s1->node->num_links; > > /* > + * ignore chain fragments > + */ > + if (n < seed->node->num_links && n <= 2) > + continue; > + > + /* > * only process 'mesh' switches > */ > if (!s1->node->matrix) > @@ -908,7 +986,8 @@ static void make_geometry(lash_t *p_lash, int sw) > if (j == i) > continue; > > - if (s1->node->matrix[i][j] != 2) { > + if (s1->node->matrix[i][j] != 2 && > + s1->node->matrix[i][j] <= 4) { What does this ' <= 4' check? Sasha > if (s1->node->axes[j]) { > if (s1->node->axes[j] != opposite(seed, s1->node->axes[i])) { > OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 1 mismatch\n"); > From sashak at voltaire.com Sun Aug 30 04:26:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 14:26:24 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Parallelize (Stripe) MFT sets across switches In-Reply-To: <20090807164127.GA795@comcast.net> References: <20090807164127.GA795@comcast.net> Message-ID: <20090830112624.GD21909@me> Hi Hal, On 12:41 Fri 07 Aug , Hal Rosenstock wrote: > > Similar to previous patch to "Parallelize (Stripe) LFT sets across switches". > Currently, MADs are pipelined to a single switch first which effectively > serializes these requests. This patch pipelines the MFT set MADs across > switches first (before cycling to the next MFT block) so that multiple > switches can be responding concurrently. Speedup is dependent on number > of MFT blocks in use (number of MLIDs) which is dependent on the number > of multicast groups. > > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h > index 7ce28c5..e281842 100644 > --- a/opensm/include/opensm/osm_switch.h > +++ b/opensm/include/opensm/osm_switch.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -103,6 +103,8 @@ typedef struct osm_switch { > uint8_t *lft; > uint8_t *new_lft; > osm_mcast_tbl_t mcast_tbl; > + uint32_t mft_block_num; > + uint32_t mft_position; > unsigned endport_links; > unsigned need_update; > void *priv; > diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c > index 4dbbaa0..f91c6b6 100644 > --- a/opensm/opensm/osm_mcast_mgr.c > +++ b/opensm/opensm/osm_mcast_mgr.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. > * > @@ -325,15 +325,12 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) > { > osm_node_t *p_node; > osm_dr_path_t *p_path; > - osm_madw_context_t mad_context; > + osm_madw_context_t context; > ib_api_status_t status; > - uint32_t block_id_ho = 0; > - int16_t block_num = 0; > - uint32_t position = 0; > - uint32_t max_position; > + uint32_t block_id_ho; > osm_mcast_tbl_t *p_tbl; > ib_net16_t block[IB_MCAST_BLOCK_SIZE]; > - int ret = 0; > + int ret = -1; > > CL_ASSERT(sm); > > @@ -353,36 +350,34 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) > configuration. > */ > > - mad_context.mft_context.node_guid = osm_node_get_node_guid(p_node); > - mad_context.mft_context.set_method = TRUE; > + context.mft_context.node_guid = osm_node_get_node_guid(p_node); > + context.mft_context.set_method = TRUE; > > p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); > - max_position = p_tbl->max_position; > > - while (osm_mcast_tbl_get_block(p_tbl, block_num, > - (uint8_t) position, block)) { > - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > - "Writing MFT block 0x%X\n", block_id_ho); > + if (p_sw->mft_position <= p_tbl->max_position && > + osm_mcast_tbl_get_block(p_tbl, p_sw->mft_block_num, > + (uint8_t) p_sw->mft_position, block)) { > + > + block_id_ho = p_sw->mft_block_num + (p_sw->mft_position << 28); > > - block_id_ho = block_num + (position << 28); > + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > + "Writing MFT block %u position %u to switch 0x%" PRIx64 "\n", > + p_sw->mft_block_num, p_sw->mft_position, > + cl_ntoh64(context.lft_context.node_guid)); > > status = osm_req_set(sm, p_path, (void *)block, sizeof(block), > IB_MAD_ATTR_MCAST_FWD_TBL, > cl_hton32(block_id_ho), CL_DISP_MSGID_NONE, > - &mad_context); > + &context); > > - if (status != IB_SUCCESS) { > + if (status != IB_SUCCESS) > OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A02: " > - "Sending multicast fwd. tbl. block failed (%s)\n", > + "Sending MFT block failed (%s)\n", > ib_get_err_str(status)); > - ret = -1; > - } > > - if (++position > max_position) { > - position = 0; > - block_num++; > - } > - } > + } else > + ret = 0; > > OSM_LOG_EXIT(sm->p_log); > return ret; > @@ -1077,7 +1072,8 @@ int osm_mcast_mgr_process(osm_sm_t * sm) > cl_qmap_t *p_sw_tbl; > cl_qlist_t *p_list = &sm->mgrp_list; > osm_mgrp_t *p_mgrp; > - int i, ret = 0; > + osm_mcast_tbl_t *p_tbl; > + int sws_notdone, i, ret = 0; > > OSM_LOG_ENTER(sm->p_log); > > @@ -1114,11 +1110,30 @@ int osm_mcast_mgr_process(osm_sm_t * sm) > */ > p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { > - if (mcast_mgr_set_tbl(sm, p_sw)) > - ret = -1; > + p_sw->mft_block_num = 0; > + p_sw->mft_position = 0; > p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); > } > > + while (1) { > + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > + sws_notdone = 0; > + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { > + if (mcast_mgr_set_tbl(sm, p_sw)) > + sws_notdone++; > + p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); > + if (++p_sw->mft_position > p_tbl->max_position) { > + p_sw->mft_position = 0; > + p_sw->mft_block_num++; > + } > + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); > + } > + if (!sws_notdone) { > + ret = -1; > + break; > + } So osm_mcast_mgr_process() will always return -1 value? Why? > + } > + > while (!cl_is_qlist_empty(p_list)) { > cl_list_item_t *p = cl_qlist_remove_head(p_list); > free(p); > @@ -1142,9 +1157,10 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) > osm_switch_t *p_sw; > cl_qmap_t *p_sw_tbl; > osm_mgrp_t *p_mgrp; > + osm_mcast_tbl_t *p_tbl; > ib_net16_t mlid; > osm_mcast_mgr_ctxt_t *ctx; > - int ret = 0; > + int sws_notdone, ret = 0; > > OSM_LOG_ENTER(sm->p_log); > > @@ -1195,11 +1211,30 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) > p_sw_tbl = &sm->p_subn->sw_guid_tbl; > p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { > - if (mcast_mgr_set_tbl(sm, p_sw)) > - ret = -1; > + p_sw->mft_block_num = 0; > + p_sw->mft_position = 0; > p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); > } > > + while (1) { > + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > + sws_notdone = 0; > + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { > + if (mcast_mgr_set_tbl(sm, p_sw)) > + sws_notdone++; > + p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); > + if (++p_sw->mft_position > p_tbl->max_position) { > + p_sw->mft_position = 0; > + p_sw->mft_block_num++; > + } > + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); > + } > + if (!sws_notdone) { > + ret = -1; > + break; Ditto. > + } > + } > + Could you consolidate this code which is equivalent with one in osm_mcast_mgr_process() in single function say mcast_mgr_set_mftables()? Also similar to LFTs case it would be nice to simplify this tables setup loop. Sasha > osm_dump_mcast_routes(sm->p_subn->p_osm); > > exit: > From hal.rosenstock at gmail.com Sun Aug 30 04:32:41 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 30 Aug 2009 07:32:41 -0400 Subject: [ofa-general] Re: osm_link_mgr.c:link_mgr_get_smsl question In-Reply-To: <20090829204508.GH21238@me> References: <20090829204508.GH21238@me> Message-ID: Hi Sasha, On 8/29/09, Sasha Khapyorsky wrote: > Hi Hal, > > On 14:38 Fri 07 Aug , Hal Rosenstock wrote: > > > > osm_link_mgr.c:link_mgr_get_smsl has the following: > > > > /* Find osm_port of the source = p_physp */ > > slid = osm_physp_get_base_lid(p_physp); > > p_src_port = > > cl_ptr_vector_get(&sm->p_subn->port_lid_tbl, > cl_ntoh16(slid)); > > > > /* Call lash to find proper SL */ > > sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port); > > > > It may be that this code is invoked prior to the LID being assigned > > How is it possible? In the code I can see that link_mgr_process() is > always executed after lid_mgr run. When nodes use gPXE, the LID is not passed from the gPXE to the Linux environment. > > so > > getting the p_src_port based on the LID yields NULL and then calling > > osm_get_lash_sl causes a seg fault. > > > > I can see two ways to fix this: > > 1. Replace with port GUID search > > 2. Have osm_get_lash_sl handle NULL for p_src_port > > Maybe you see other ways to deal with this. > > > > Do you have a preferred approach ? > > Hmm, SMSL will be irrelevant for a port where LID was not assigned, > right? Of course. > If so than it is probably just enough to add in link_mgr_get_smsl(): > > if (!p_src_port) > return; OK. -- Hal But it would be really better to understand an error source before deciding > about proper solution. > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Aug 30 04:36:30 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 14:36:30 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: Add support for MulticastFDBTop In-Reply-To: <20090826140202.GA19158@comcast.net> References: <20090826140202.GA19158@comcast.net> Message-ID: <20090830113630.GE21909@me> On 10:02 Wed 26 Aug , Hal Rosenstock wrote: > > Add support for SwitchInfo:MulticastFDBTop and > PortInfo:CapabilityMask.IsMulticastFDBTopSupported > > Added by MgtWG errata #4505-4508 > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 30 04:53:16 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 14:53:16 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: <20090826140350.GB19158@comcast.net> References: <20090826140350.GB19158@comcast.net> Message-ID: <20090830115316.GF21909@me> On 10:03 Wed 26 Aug , Hal Rosenstock wrote: > > Add support for SwitchInfo:MulticastFDBTop > Added by MgtWG errata #4505-4508 and #4640 > > If MulticastFDBTop is set to other than 0, only fetch MulticastForwardingTable > blocks up through MulticastFDBTop rather than MulticastFDBCap > > If MulticastFDBTop is set to 0xbfff, this means no entries (per #4640) > > Signed-off-by: Hal Rosenstock > --- > diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c > index 106c934..f3ebe56 100644 > --- a/infiniband-diags/src/ibroute.c > +++ b/infiniband-diags/src/ibroute.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. > + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, > char *s; > uint64_t nodeguid; > uint32_t mod; > - unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock; > + unsigned block, i, j, e, nports, cap, top, chunks, > + startblock, lastblock; > int n = 0; > > if ((s = check_switch(portid, &nports, &nodeguid, sw, nd))) > return s; > > mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap); > + mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top); > > if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1) > endlid = IB_MIN_MCAST_LID + cap - 1; > + if (!dump_all && top && top < endlid) { > + if (top < IB_MIN_MCAST_LID - 1 || top == 0xffff) I don't understand what does this "top == 0xffff" check? Shouldn't be something like (top > IB_MIN_MCAST_LID + cap - 1 && top != 0xbfff) instead? > + IBWARN("illegal top mlid %x", top); > + else > + endlid = top; > + } And where is the case of "no entries" (top = 0xbfff) handled (as declared in change log)? Sasha > > if (!startlid) > startlid = IB_MIN_MCAST_LID; > @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, > printf(" MLid\n"); > } > if (ibverbose) > - printf("Switch multicast mlid capability is %d\n", cap); > + printf("Switch multicast mlid capability is %d top is %d\n", > + cap, top); > > chunks = ALIGN(nports + 1, 16) / 16; > > From sashak at voltaire.com Sun Aug 30 05:00:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:00:11 +0300 Subject: [ofa-general] Re: osm_link_mgr.c:link_mgr_get_smsl question In-Reply-To: References: <20090829204508.GH21238@me> Message-ID: <20090830120011.GG21909@me> On 07:32 Sun 30 Aug , Hal Rosenstock wrote: > > > > > > osm_link_mgr.c:link_mgr_get_smsl has the following: > > > > > > /* Find osm_port of the source = p_physp */ > > > slid = osm_physp_get_base_lid(p_physp); > > > p_src_port = > > > cl_ptr_vector_get(&sm->p_subn->port_lid_tbl, > > cl_ntoh16(slid)); > > > > > > /* Call lash to find proper SL */ > > > sl = osm_get_lash_sl(p_osm, p_src_port, p_sm_port); > > > > > > It may be that this code is invoked prior to the LID being assigned > > > > How is it possible? In the code I can see that link_mgr_process() is > > always executed after lid_mgr run. > > When nodes use gPXE, the LID is not passed from the gPXE to the Linux > environment. How is it related to gPXE? OpenSM's lid manager runs and assigns lids to all available endports, only after this link manager runs and try with SMSL - at this point all lids should be in place and p_subn->port_lid_tbl should be fine. Am I missing something? Sasha From sashak at voltaire.com Sun Aug 30 05:04:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:04:55 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Add infrastructure support for MulticastFDBTop In-Reply-To: <20090826140450.GC19158@comcast.net> References: <20090826140450.GC19158@comcast.net> Message-ID: <20090830120455.GH21909@me> On 10:04 Wed 26 Aug , Hal Rosenstock wrote: > @@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info { > ib_net16_t lids_per_port; > ib_net16_t enforce_cap; > uint8_t flags; > + uint8_t resvd; > + ib_net16_t mcast_top; > } PACK_SUFFIX ib_switch_info_t; > #include > /************/ > @@ -5908,7 +5910,7 @@ typedef struct _ib_switch_info_record { > ib_net16_t lid; > uint16_t resv0; > ib_switch_info_t switch_info; > - uint8_t pad[3]; > + uint8_t pad[1]; Why should be pad[1] here? In struct switch_info you are adding three bytes (resvd - 1 and mcast_top - 2), no? Sasha From sashak at voltaire.com Sun Aug 30 05:11:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:11:57 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags: Fix IB network discovery from switch node. In-Reply-To: <4A9548AA.4020900@gmail.com> References: <4A9548AA.4020900@gmail.com> Message-ID: <20090830121157.GI21909@me> On 17:37 Wed 26 Aug , Eli Dorfman (Voltaire) wrote: > Subject: [PATCH] Fix IB network discovery from switch node. > > Signed-off-by: Eli Dorfman Applied. Thanks. Please next time add descriptive change log to your patches. Sasha From sashak at voltaire.com Sun Aug 30 05:19:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:19:05 +0300 Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: Add CounterSelect2 field to PortCounters attribute In-Reply-To: <20090826155447.GA25235@comcast.net> References: <20090826155447.GA25235@comcast.net> Message-ID: <20090830121905.GJ21909@me> On 11:54 Wed 26 Aug , Hal Rosenstock wrote: > > Per MgtWG RefID #4527 > > Also, cosmetic commentary change > > Signed-off-by: Hal Rosenstock Applied. Thanks. Next time could you add more descriptive change log to your patches - "RefID #4527" by itself doesn't say a lot (and RefID texts is available only in member area of IBTA site). Sasha From hal.rosenstock at gmail.com Sun Aug 30 05:20:33 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 30 Aug 2009 08:20:33 -0400 Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: Add CounterSelect2 field to PortCounters attribute In-Reply-To: <20090830121905.GJ21909@me> References: <20090826155447.GA25235@comcast.net> <20090830121905.GJ21909@me> Message-ID: On 8/30/09, Sasha Khapyorsky wrote: > > On 11:54 Wed 26 Aug , Hal Rosenstock wrote: > > > > Per MgtWG RefID #4527 > > > > Also, cosmetic commentary change > > > > Signed-off-by: Hal Rosenstock > > Applied. Thanks. > > Next time could you add more descriptive change log to your patches - > "RefID #4527" by itself doesn't say a lot (and RefID texts is available > only in member area of IBTA site). There is a public version now. -- Hal Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Sun Aug 30 05:22:09 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 30 Aug 2009 08:22:09 -0400 Subject: [ofa-general] Re: [PATCH] opensm: Add infrastructure support for MulticastFDBTop In-Reply-To: <20090830120455.GH21909@me> References: <20090826140450.GC19158@comcast.net> <20090830120455.GH21909@me> Message-ID: On 8/30/09, Sasha Khapyorsky wrote: > > On 10:04 Wed 26 Aug , Hal Rosenstock wrote: > > @@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info { > > ib_net16_t lids_per_port; > > ib_net16_t enforce_cap; > > uint8_t flags; > > + uint8_t resvd; > > + ib_net16_t mcast_top; > > } PACK_SUFFIX ib_switch_info_t; > > #include > > /************/ > > @@ -5908,7 +5910,7 @@ typedef struct _ib_switch_info_record { > > ib_net16_t lid; > > uint16_t resv0; > > ib_switch_info_t switch_info; > > - uint8_t pad[3]; > > + uint8_t pad[1]; > > Why should be pad[1] here? In struct switch_info you are adding three > bytes (resvd - 1 and mcast_top - 2), no? Good catch. It was due to an initial version which didn't have the 16 bit MFTTop alignment. Do you want a v2 patch for this ? -- Hal Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Aug 30 05:26:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:26:54 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags: Fix IB network discovery from switch node. In-Reply-To: <20090830121157.GI21909@me> References: <4A9548AA.4020900@gmail.com> <20090830121157.GI21909@me> Message-ID: <20090830122654.GK21909@me> On 15:11 Sun 30 Aug , Sasha Khapyorsky wrote: > On 17:37 Wed 26 Aug , Eli Dorfman (Voltaire) wrote: > > Subject: [PATCH] Fix IB network discovery from switch node. > > > > Signed-off-by: Eli Dorfman > > Applied. Thanks. BTW, was need to rebase the patch against master. Sasha From sashak at voltaire.com Sun Aug 30 05:30:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:30:21 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/perfquery.c: Indicate whether PortXmitWait counter is supported In-Reply-To: <20090826161223.GA30257@comcast.net> References: <20090826161223.GA30257@comcast.net> Message-ID: <20090830123021.GL21909@me> On 12:12 Wed 26 Aug , Hal Rosenstock wrote: > > Indicate extended v. (normal) port counters in output > Also, some cosmetic formatting changes and commentary typo fixed > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 30 05:31:05 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:31:05 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/libibnetdisc: add missing '\n' to error message In-Reply-To: <20090826102957.bed66987.weiny2@llnl.gov> References: <20090826102957.bed66987.weiny2@llnl.gov> Message-ID: <20090830123105.GM21909@me> On 10:29 Wed 26 Aug , Ira Weiny wrote: > > From: Ira Weiny > Date: Fri, 21 Aug 2009 15:01:00 -0700 > Subject: [PATCH] infiniband-diags/libibnetdisc: add missing '\n' to error message > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 30 05:32:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:32:24 +0300 Subject: [ofa-general] Re: [PATCH] libibnetdisc: add retract_dpath function In-Reply-To: <20090826103142.660ac83b.weiny2@llnl.gov> References: <20090826103142.660ac83b.weiny2@llnl.gov> Message-ID: <20090830123224.GN21909@me> On 10:31 Wed 26 Aug , Ira Weiny wrote: > > From: Ira Weiny > Date: Wed, 26 Aug 2009 09:25:00 -0700 > Subject: [PATCH] libibnetdisc: add retract_dpath function > > When using combined routing some switches do not handle Hop Count of 0 > well. Detect when the drpath count is 0 and return to lid based > routing in this case. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sun Aug 30 05:35:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:35:25 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Add infrastructure support for MulticastFDBTop In-Reply-To: References: <20090826140450.GC19158@comcast.net> <20090830120455.GH21909@me> Message-ID: <20090830123525.GO21909@me> On 08:22 Sun 30 Aug , Hal Rosenstock wrote: > > > > Why should be pad[1] here? In struct switch_info you are adding three > > bytes (resvd - 1 and mcast_top - 2), no? > > > Good catch. It was due to an initial version which didn't have the 16 bit > MFTTop alignment. Do you want a v2 patch for this ? Yes please. Sasha From sashak at voltaire.com Sun Aug 30 05:43:16 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 15:43:16 +0300 Subject: [ofa-general] [PATCH] libibnetdisc: fix compilation warning In-Reply-To: <20090826103142.660ac83b.weiny2@llnl.gov> References: <20090826103142.660ac83b.weiny2@llnl.gov> Message-ID: <20090830124316.GP21909@me> Newly introduced retract_dpath() was declared as int but no any value was returned, this resulted in this warning: src/ibnetdisc.c: In function ‘retract_dpath’: src/ibnetdisc.c:186: warning: control reaches end of non-void function Fixing this by declaring retract_dpath() as void. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 0f6fc55..97e369c 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -175,7 +175,7 @@ static int add_port_to_dpath(ib_dr_path_t * path, int nextport) return path->cnt; } -static int retract_dpath(ib_portid_t * path) +static void retract_dpath(ib_portid_t * path) { path->drpath.cnt--; /* restore path */ if (path->drpath.cnt == 0 && path->lid) { -- 1.6.4.1 From hal.rosenstock at gmail.com Sun Aug 30 05:42:00 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 30 Aug 2009 08:42:00 -0400 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: <20090830115316.GF21909@me> References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me> Message-ID: On 8/30/09, Sasha Khapyorsky wrote: > On 10:03 Wed 26 Aug , Hal Rosenstock wrote: > > > > Add support for SwitchInfo:MulticastFDBTop > > Added by MgtWG errata #4505-4508 and #4640 > > > > If MulticastFDBTop is set to other than 0, only fetch > MulticastForwardingTable > > blocks up through MulticastFDBTop rather than MulticastFDBCap > > > > If MulticastFDBTop is set to 0xbfff, this means no entries (per #4640) > > > > Signed-off-by: Hal Rosenstock > > --- > > diff --git a/infiniband-diags/src/ibroute.c > b/infiniband-diags/src/ibroute.c > > index 106c934..f3ebe56 100644 > > --- a/infiniband-diags/src/ibroute.c > > +++ b/infiniband-diags/src/ibroute.c > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. > > + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, > unsigned startlid, > > char *s; > > uint64_t nodeguid; > > uint32_t mod; > > - unsigned block, i, j, e, nports, cap, chunks, startblock, > lastblock; > > + unsigned block, i, j, e, nports, cap, top, chunks, > > + startblock, lastblock; > > int n = 0; > > > > if ((s = check_switch(portid, &nports, &nodeguid, sw, nd))) > > return s; > > > > mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap); > > + mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top); > > > > if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1) > > endlid = IB_MIN_MCAST_LID + cap - 1; > > + if (!dump_all && top && top < endlid) { > > + if (top < IB_MIN_MCAST_LID - 1 || top == 0xffff) > > I don't understand what does this "top == 0xffff" check? MFTTop is only allowed up to 0xfffe so it's the max but I now see that gets checked later where endlid > IB_MAX_MCAST_LID. > Shouldn't be something like > > (top > IB_MIN_MCAST_LID + cap - 1 && top != 0xbfff) > > instead? Yes. > > + IBWARN("illegal top mlid %x", top); > > + else > > + endlid = top; > > + } > > And where is the case of "no entries" (top = 0xbfff) handled (as > declared in change log)? This is handled by the block loop inside of dump_multicast_tables. -- Hal > Sasha > > > > > if (!startlid) > > startlid = IB_MIN_MCAST_LID; > > @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, > unsigned startlid, > > printf(" MLid\n"); > > } > > if (ibverbose) > > - printf("Switch multicast mlid capability is %d\n", cap); > > + printf("Switch multicast mlid capability is %d top is > %d\n", > > + cap, top); > > > > chunks = ALIGN(nports + 1, 16) / 16; > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Sun Aug 30 05:51:50 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 30 Aug 2009 08:51:50 -0400 Subject: [ofa-general] [PATCHv2] opensm: Add infrastructure support for MulticastFDBTop Message-ID: <20090830125150.GA2079@comcast.net> Add support for SwitchInfo:MulticastFDBTop Added by MgtWG errata #4505-4508 Add OpenSM infrastructure support to ib_types.h and osm_helper.c Signed-off-by: Hal Rosenstock --- Changes since v1: Removed erroneous pad byte left remaining in ib_switch_info_record_t diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index fe3f051..9e38a6d 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. * @@ -4492,7 +4492,7 @@ typedef struct _ib_port_info { #define IB_PORT_CAP_HAS_LINK_SPEED_WIDTH_PAIRS_TBL (CL_HTON32(0x08000000)) #define IB_PORT_CAP_RESV28 (CL_HTON32(0x10000000)) #define IB_PORT_CAP_RESV29 (CL_HTON32(0x20000000)) -#define IB_PORT_CAP_RESV30 (CL_HTON32(0x40000000)) +#define IB_PORT_CAP_HAS_MCAST_FDB_TOP (CL_HTON32(0x40000000)) #define IB_PORT_CAP_RESV31 (CL_HTON32(0x80000000)) /****f* IBA Base: Types/ib_port_info_get_port_state @@ -5899,6 +5899,8 @@ typedef struct _ib_switch_info { ib_net16_t lids_per_port; ib_net16_t enforce_cap; uint8_t flags; + uint8_t resvd; + ib_net16_t mcast_top; } PACK_SUFFIX ib_switch_info_t; #include /************/ @@ -5908,7 +5910,6 @@ typedef struct _ib_switch_info_record { ib_net16_t lid; uint16_t resv0; ib_switch_info_t switch_info; - uint8_t pad[3]; } PACK_SUFFIX ib_switch_info_record_t; #include diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 3692474..b5a29c2 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 HNR Consulting. All rights reserved. * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. @@ -764,9 +764,9 @@ static void dbg_get_capabilities_str(IN char *p_buf, IN const uint32_t buf_size, &total_len) != IB_SUCCESS) return; } - if (p_pi->capability_mask & IB_PORT_CAP_RESV30) { + if (p_pi->capability_mask & IB_PORT_CAP_HAS_MCAST_FDB_TOP) { if (dbg_do_line(&p_local, buf_size, p_prefix_str, - "IB_PORT_CAP_RESV30\n", + "IB_PORT_CAP_HAS_MCAST_FDB_TOP\n", &total_len) != IB_SUCCESS) return; } @@ -1512,7 +1512,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log, "\t\t\t\tlife_state..............0x%X\n" "\t\t\t\tlids_per_port...........%u\n" "\t\t\t\tpartition_enf_cap.......0x%X\n" - "\t\t\t\tflags...................0x%X\n", + "\t\t\t\tflags...................0x%X\n" + "\t\t\t\tmcast_top...............0x%X\n", cl_ntoh16(p_si->lin_cap), cl_ntoh16(p_si->rand_cap), cl_ntoh16(p_si->mcast_cap), @@ -1522,7 +1523,8 @@ void osm_dump_switch_info(IN osm_log_t * p_log, p_si->def_mcast_not_port, p_si->life_state, cl_ntoh16(p_si->lids_per_port), - cl_ntoh16(p_si->enforce_cap), p_si->flags); + cl_ntoh16(p_si->enforce_cap), p_si->flags, + cl_ntoh16(p_si->mcast_top)); } } From hnrose at comcast.net Sun Aug 30 06:25:49 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 30 Aug 2009 09:25:49 -0400 Subject: [ofa-general] [PATCHv2] infiniband-diags/ibroute: Add support for MulticastFDBTop Message-ID: <20090830132549.GA13950@comcast.net> Add support for SwitchInfo:MulticastFDBTop Added by MgtWG errata #4505-4508 and 4640 If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable blocks up through MulticastFDBTop rather than MulticastFDBCap If MulticastFDBTop set to 0xbfff, this means no entries (per 4640) Signed-off-by: Hal Rosenstock --- Changes since v1: Fixed top range check diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 106c934..1112b87 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, char *s; uint64_t nodeguid; uint32_t mod; - unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock; + unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock, + top; int n = 0; if ((s = check_switch(portid, &nports, &nodeguid, sw, nd))) return s; mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap); + mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top); if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1) endlid = IB_MIN_MCAST_LID + cap - 1; + if (!dump_all && top && top < endlid) { + if (top < IB_MIN_MCAST_LID - 1 || top > IB_MIN_MCAST_LID + cap - 1) + IBWARN("illegal top mlid %x", top); + else + endlid = top; + } if (!startlid) startlid = IB_MIN_MCAST_LID; @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, printf(" MLid\n"); } if (ibverbose) - printf("Switch multicast mlid capability is %d\n", cap); + printf("Switch multicast mlid capability is %d top is 0x%x\n", + cap, top); chunks = ALIGN(nports + 1, 16) / 16; From sashak at voltaire.com Sun Aug 30 07:16:48 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 17:16:48 +0300 Subject: [ofa-general] Re: [PATCHv2] opensm: Add infrastructure support for MulticastFDBTop In-Reply-To: <20090830125150.GA2079@comcast.net> References: <20090830125150.GA2079@comcast.net> Message-ID: <20090830141648.GA15546@me> On 08:51 Sun 30 Aug , Hal Rosenstock wrote: > > Add support for SwitchInfo:MulticastFDBTop > Added by MgtWG errata #4505-4508 > > Add OpenSM infrastructure support to ib_types.h and osm_helper.c > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From jackm at dev.mellanox.co.il Sun Aug 30 08:35:13 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 30 Aug 2009 18:35:13 +0300 Subject: [ofa-general] Fwd: OFED-1.5-alpha4 installation problem In-Reply-To: References: Message-ID: <200908301835.13593.jackm@dev.mellanox.co.il> On Wednesday 26 August 2009 13:17, Sneha Mistry wrote: > Hi, > > I am new be to Infiniband and trying to install OFED-1.5-alpha4 on > opensuse 10.3 . > Kernel version is  2.6.26-2-686 . 1. OFED 1.5 is not supported on OpenSuse 10.3 -- it is supported on OpenSuse 11. 2. You are correct in that the release notes indicate that 10.3 is supported -- this was an oversight, which will be corrected in the next OFED 1.5 release candidate (the notes will then indicate support for OpenSuse 11, not 10.3). 3. The kernel you are running is evidently 2.6.22.5-31 (from the log below), not 2.6.26-2-686. This is indeed the OpenSuse 10.3 kernel. > But it gives me error  message. > > Failed to build ofa_kernel RPM > See /tmp/OFED.29482.logs/ofa_kernel.rpmbuild.log > > I checked release note it says suse 10.3 is supported. > > Output of uname -a is > Linux linux-ljhr 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC > i686 i686 i386 GNU/Linux > > Last few line of log is as given. > > make[1]: Entering directory `/usr/src/linux-2.6.22.5-31-obj/i386/default' > make -C ../../../linux-2.6.22.5-31 > O=../linux-2.6.22.5-31-obj/i386/default modules > make -C /usr/src/linux-2.6.22.5-31-obj/i386/default \ > KBUILD_SRC=/usr/src/linux-2.6.22.5-31 \ > KBUILD_EXTMOD="/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5" -f > /usr/src/linux-2.6.22.5-31/Makefile modules > test -e include/linux/autoconf.h -a -e include/config/auto.conf || ( \ > echo; \ > echo " ERROR: Kernel configuration is invalid."; \ > echo " include/linux/autoconf.h or include/config/auto.conf > are missing."; \ > echo " Run 'make oldconfig && make prepare' on kernel src to > fix it."; \ > echo; \ > /bin/false) > mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions > rm -f /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/.tmp_versions/* > make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build > obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5 > make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build > obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband > make -f /usr/src/linux-2.6.22.5-31/scripts/Makefile.build > obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core > gcc -m32 -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.addr.o.d > -nostdinc -isystem /usr/lib/gcc/i586-suse-linux/4.2.1/include > -D__KERNEL__ \ > -D__OFED_BUILD__ \ > -include include/linux/autoconf.h \ > -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h \ > -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.22_suse10_3/include/ > \ > \ > \ > -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \ > -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \ > -I/usr/local/include/scst \ > -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \ > -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \ > -Iinclude \ > -Iinclude2 -I/usr/src/linux-2.6.22.5-31/include \ > -I/usr/src/linux-2.6.22.5-31/arch//include \ > -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core > -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs > -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common > -Os -pipe -msoft-float -mregparm=3 -freg-struct-return > -mpreferred-stack-boundary=2 -march=i586 -mtune=generic -ffreestanding > -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 > -DCONFIG_AS_CFI_SIGNAL_FRAME=1 > -I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-generic > -Iinclude/asm-i386/mach-generic > -I/usr/src/linux-2.6.22.5-31/include/asm-i386/mach-default > -Iinclude/asm-i386/mach-default -fomit-frame-pointer -g > -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign > -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)" > -D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/.tmp_addr.o > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c > In file included from > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_addr.h:41, > from > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:46: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In > function ‘ib_dma_mapping_error’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677: > warning: passing argument 1 of ‘dma_mapping_error’ makes integer from > pointer without a cast > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1677: > error: too many arguments to function ‘dma_mapping_error’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716: > warning: ‘struct dma_attrs’ declared inside parameter list > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1716: > warning: its scope is only this definition or declaration, which is > probably not what you want > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In > function ‘ib_dma_map_single_attrs’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1718: > error: implicit declaration of function ‘dma_map_single_attrs’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1725: > warning: ‘struct dma_attrs’ declared inside parameter list > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In > function ‘ib_dma_unmap_single_attrs’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1727: > error: implicit declaration of function ‘dma_unmap_single_attrs’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1728: > warning: ‘return’ with a value, in function returning void > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1803: > warning: ‘struct dma_attrs’ declared inside parameter list > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In > function ‘ib_dma_map_sg_attrs’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1805: > error: implicit declaration of function ‘dma_map_sg_attrs’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: At top level: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1811: > warning: ‘struct dma_attrs’ declared inside parameter list > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h: In > function ‘ib_dma_unmap_sg_attrs’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/rdma/ib_verbs.h:1813: > error: implicit declaration of function ‘dma_unmap_sg_attrs’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: > In function ‘rdma_translate_ip’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122: > error: ‘init_net’ undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122: > error: (Each undeclared identifier is reported only once > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:122: > error: for each function it appears in.) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:123: > error: too many arguments to function ‘ip_dev_find’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134:33: > error: macro "for_each_netdev" passed 2 arguments, but takes just 1 > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134: > error: ‘for_each_netdev’ undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:134: > error: expected ‘;’ before ‘{’ token > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: > In function ‘addr_send_arp’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191: > error: ‘init_net’ undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191: > warning: passing argument 2 of ‘ip_route_output_key’ from incompatible > pointer type > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:191: > error: too many arguments to function ‘ip_route_output_key’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:206: > error: too many arguments to function ‘ip6_route_output’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: > In function ‘addr4_resolve_remote’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232: > error: ‘init_net’ undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232: > warning: passing argument 2 of ‘ip_route_output_key’ from incompatible > pointer type > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:232: > error: too many arguments to function ‘ip_route_output_key’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: > In function ‘addr6_resolve_remote’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281: > error: ‘init_net’ undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:281: > error: too many arguments to function ‘ip6_route_output’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c: > In function ‘addr_resolve_local’: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368: > error: ‘init_net’ undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:368: > error: too many arguments to function ‘ip_dev_find’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:372: > error: implicit declaration of function ‘ipv4_is_zeronet’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:376: > error: implicit declaration of function ‘ipv4_is_loopback’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394:33: > error: macro "for_each_netdev" passed 2 arguments, but takes just 1 > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:394: > error: ‘for_each_netdev’ undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:395: > error: expected ‘;’ before ‘if’ > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.c:410: > error: implicit declaration of function ‘ipv6_addr_loopback’ > make[6]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core/addr.o] > Error 1 > make[5]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/core] > Error 2 > make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband] > Error 2 > make[3]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2 > make[2]: *** [modules] Error 2 > make[1]: *** [modules] Error 2 > make[1]: Leaving directory `/usr/src/linux-2.6.22.5-31-obj/i386/default' > make: *** [kernel] Error 2 > error: Bad exit status from /var/tmp/rpm-tmp.64786 (%build) > > > RPM build errors: > user vlad does not exist - using root > group vlad does not exist - using root > user vlad does not exist - using root > group vlad does not exist - using root > Bad exit status from /var/tmp/rpm-tmp.64786 (%build) > > Regards, > sgm > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Sun Aug 30 08:36:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 18:36:19 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me> Message-ID: <20090830153619.GB15546@me> On 08:42 Sun 30 Aug , Hal Rosenstock wrote: > > This is handled by the block loop inside of dump_multicast_tables. Where? I don't see this. Should not it to show nothing ("no entries") when top = 0xbfff and dump_all is not set? Sasha From sashak at voltaire.com Sun Aug 30 08:40:38 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 30 Aug 2009 18:40:38 +0300 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: <20090830132549.GA13950@comcast.net> References: <20090830132549.GA13950@comcast.net> Message-ID: <20090830154038.GC15546@me> On 09:25 Sun 30 Aug , Hal Rosenstock wrote: > > Add support for SwitchInfo:MulticastFDBTop > Added by MgtWG errata #4505-4508 and 4640 > > If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable > blocks up through MulticastFDBTop rather than MulticastFDBCap > > If MulticastFDBTop set to 0xbfff, this means no entries (per 4640) > > Signed-off-by: Hal Rosenstock > --- > Changes since v1: > Fixed top range check > > diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c > index 106c934..1112b87 100644 > --- a/infiniband-diags/src/ibroute.c > +++ b/infiniband-diags/src/ibroute.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. > + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, > char *s; > uint64_t nodeguid; > uint32_t mod; > - unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock; > + unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock, > + top; > int n = 0; > > if ((s = check_switch(portid, &nports, &nodeguid, sw, nd))) > return s; > > mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap); > + mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top); > > if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1) > endlid = IB_MIN_MCAST_LID + cap - 1; > + if (!dump_all && top && top < endlid) { > + if (top < IB_MIN_MCAST_LID - 1 || top > IB_MIN_MCAST_LID + cap - 1) Looking more at this it seems for me that test 'top > IB_MIN_MCAST_LID + cap - 1' will be never true (and actually not needed) - this test will be performed only when top < endlid and endlid is verified one line before to be in '< IB_MIN_MCAST_LID + cap - 1' range. Sasha > + IBWARN("illegal top mlid %x", top); > + else > + endlid = top; > + } > > if (!startlid) > startlid = IB_MIN_MCAST_LID; > @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, > printf(" MLid\n"); > } > if (ibverbose) > - printf("Switch multicast mlid capability is %d\n", cap); > + printf("Switch multicast mlid capability is %d top is 0x%x\n", > + cap, top); > > chunks = ALIGN(nports + 1, 16) / 16; > > From jackm at dev.mellanox.co.il Sun Aug 30 08:45:38 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 30 Aug 2009 18:45:38 +0300 Subject: [ofa-general] Number of devices returned by ibv_get_device_list() In-Reply-To: <122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com> References: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com> <122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com> Message-ID: <200908301845.38642.jackm@dev.mellanox.co.il> On Wednesday 26 August 2009 02:03, MANIKANTAN KALAIYA wrote: > Resending to the mailing list... > > We have Ofed1.3.1 installed, one of the sub packages is libibverbs version 1.1.1. We have a small program that lists the number of IB cards available in the system through ibv_get_device_list(). See below for the sample code. libibverbs reads the number of devices ONCE, at calling process startup (as part of its initialization). To get a new device count, you need to restart your program. - Jack > The system has two IB cards, the value returned by ibv_get_device_list() in 'num_devices' is two, as expected. > > However, when we disable one of the cards using the modprobe command, the program continues to return two cards present (monitoring is continuous in a while loop). > Killing and restarting the sample test process results in reporting correct number of IB cards available (returns one after it is restarted). One of the prior versions was known to report the correct number of IB cards without requiring to restart the program. > > We would like to determine the number of cards present without having to go through a restart. Any inputs on this behavior is appreciated. > > modprobe command - "sudo modprobe -r ib_mthca" > > Test program: > ================================================= > #include > #include > > int main(int argc, char **argv) > { > int ret, num_devices; > struct ibv_device **dev_list; > > while(1) { > > dev_list = ibv_get_device_list(&num_devices); > > if (num_devices != 0) { > printf("IB ADAPTER AVAILABLE:%d\n", num_devices); > } > else { > printf("IB ADAPTER UNAVAILABLE\n"); > } > sleep(2); > ibv_free_device_list(dev_list); > } > > return(0); > } > ================================================= > > Thanks, > Mani. > From jackm at dev.mellanox.co.il Sun Aug 30 08:47:58 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 30 Aug 2009 18:47:58 +0300 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <4A90FAD8.6000701@mellanox.co.il> References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il> Message-ID: <200908301847.59143.jackm@dev.mellanox.co.il> On Sunday 23 August 2009 11:16, Tziporet Koren wrote: > Jeremy Enos wrote: > > Coming up on a year of Fedora 10 GA... Fedora 9 no longer maintained. > > No OFED support for FC10 yet creates a tough spot if trying to stay > > secure. Is there *any* version (1.5, etc) that will even build on FC10? > > thx- > > > > Jeremy > > > > > > > > I think OFED 1.5 might work on it but not sure. Which kernel version > FC10 use? > In general OFED 1.5 supports FC11 Actually, it supports FC12 (kernel 2.6.29). - Jack > Tziporet > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jackm at dev.mellanox.co.il Sun Aug 30 08:56:33 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 30 Aug 2009 18:56:33 +0300 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <200908301847.59143.jackm@dev.mellanox.co.il> References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il> <200908301847.59143.jackm@dev.mellanox.co.il> Message-ID: <200908301856.33259.jackm@dev.mellanox.co.il> On Sunday 30 August 2009 18:47, Jack Morgenstein wrote: > On Sunday 23 August 2009 11:16, Tziporet Koren wrote: > > Jeremy Enos wrote: > > > Coming up on a year of Fedora 10 GA... Fedora 9 no longer maintained. > > > No OFED support for FC10 yet creates a tough spot if trying to stay > > > secure. Is there *any* version (1.5, etc) that will even build on FC10? > > > thx- > > > > > > Jeremy > > > > > > > > > > > > > I think OFED 1.5 might work on it but not sure. Which kernel version > > FC10 use? > > In general OFED 1.5 supports FC11 > Actually, it supports FC12 (kernel 2.6.29). We had originally planned to support FC11 -- however, in the interim, FC12 was released -- based on kernel 2.6.29, which is supported -- so we decided to support FC12 instead. -Jack > - Jack > > > Tziporet > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From hal.rosenstock at gmail.com Sun Aug 30 09:35:54 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 30 Aug 2009 12:35:54 -0400 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: <20090830153619.GB15546@me> References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me> <20090830153619.GB15546@me> Message-ID: On 8/30/09, Sasha Khapyorsky wrote: > > On 08:42 Sun 30 Aug , Hal Rosenstock wrote: > > > > This is handled by the block loop inside of dump_multicast_tables. > > Where? I don't see this. Should not it to show nothing ("no entries") > when top = 0xbfff and dump_all is not set? Doesn't the loop: for (block = startblock; block <= lastblock; block++) terminates without any blocks read ? So it shows no entries. Do you mean to print "no entries" ? -- Hal Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Sun Aug 30 09:32:07 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 30 Aug 2009 12:32:07 -0400 Subject: [ofa-general] [PATCHv3] infiniband-diags/ibroute: Add support for MulticastFDBTop Message-ID: <20090830163207.GA17406@comcast.net> Add support for SwitchInfo:MulticastFDBTop Added by MgtWG errata #4505-4508 and 4640 If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable blocks up through MulticastFDBTop rather than MulticastFDBCap If MulticastFDBTop set to 0xbfff, this means no entries (per 4640) Signed-off-by: Hal Rosenstock --- Changes since v2: Removed redundant clause in top range check Changes since v1: Fixed top range check diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 106c934..00df1ec 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. + * Copyright (c) 2009 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -140,16 +141,24 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, char *s; uint64_t nodeguid; uint32_t mod; - unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock; + unsigned block, i, j, e, nports, cap, chunks, startblock, lastblock, + top; int n = 0; if ((s = check_switch(portid, &nports, &nodeguid, sw, nd))) return s; mad_decode_field(sw, IB_SW_MCAST_FDB_CAP_F, &cap); + mad_decode_field(sw, IB_SW_MCAST_FDB_TOP_F, &top); if (!endlid || endlid > IB_MIN_MCAST_LID + cap - 1) endlid = IB_MIN_MCAST_LID + cap - 1; + if (!dump_all && top && top < endlid) { + if (top < IB_MIN_MCAST_LID - 1) + IBWARN("illegal top mlid %x", top); + else + endlid = top; + } if (!startlid) startlid = IB_MIN_MCAST_LID; @@ -187,7 +196,8 @@ char *dump_multicast_tables(ib_portid_t * portid, unsigned startlid, printf(" MLid\n"); } if (ibverbose) - printf("Switch multicast mlid capability is %d\n", cap); + printf("Switch multicast mlid capability is %d top is 0x%x\n", + cap, top); chunks = ALIGN(nports + 1, 16) / 16; From jenos at ncsa.uiuc.edu Sun Aug 30 10:41:01 2009 From: jenos at ncsa.uiuc.edu (Jeremy Enos) Date: Sun, 30 Aug 2009 12:41:01 -0500 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <200908301856.33259.jackm@dev.mellanox.co.il> References: <4A8E4854.2060909@ncsa.uiuc.edu> <4A90FAD8.6000701@mellanox.co.il> <200908301847.59143.jackm@dev.mellanox.co.il> <200908301856.33259.jackm@dev.mellanox.co.il> Message-ID: <4A9AB9AD.80803@ncsa.uiuc.edu> Is it supposed to support FC10 as well then, or just fc12? Actually- it wouldn't matter if I couldn't use 1.5. I just want *some* version that supports FC10. Is there one? thx- Jeremy Jack Morgenstein wrote: > On Sunday 30 August 2009 18:47, Jack Morgenstein wrote: > >> On Sunday 23 August 2009 11:16, Tziporet Koren wrote: >> >>> Jeremy Enos wrote: >>> >>>> Coming up on a year of Fedora 10 GA... Fedora 9 no longer maintained. >>>> No OFED support for FC10 yet creates a tough spot if trying to stay >>>> secure. Is there *any* version (1.5, etc) that will even build on FC10? >>>> thx- >>>> >>>> Jeremy >>>> >>>> >>>> >>>> >>> I think OFED 1.5 might work on it but not sure. Which kernel version >>> FC10 use? >>> In general OFED 1.5 supports FC11 >>> >> Actually, it supports FC12 (kernel 2.6.29). >> > We had originally planned to support FC11 -- however, in the interim, FC12 was > released -- based on kernel 2.6.29, which is supported -- so we decided to support > FC12 instead. > > -Jack > > >> - Jack >> >> >>> Tziporet >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > From gopalakk at cse.ohio-state.edu Sun Aug 30 19:38:10 2009 From: gopalakk at cse.ohio-state.edu (Karthik Gopalakrishnan) Date: Sun, 30 Aug 2009 22:38:10 -0400 Subject: [ofa-general] Number of devices returned by ibv_get_device_list() In-Reply-To: <200908301845.38642.jackm@dev.mellanox.co.il> References: <122E98244B88344D9AFE4F6AFF09706316F0F295@BL2PRD0102MB012.prod.exchangelabs.com> <122E98244B88344D9AFE4F6AFF09706316F0F2AB@BL2PRD0102MB012.prod.exchangelabs.com> <200908301845.38642.jackm@dev.mellanox.co.il> Message-ID: <92eddfb50908301938w533df6e9vb4579a538209d97@mail.gmail.com> Hi Jack. On Sun, Aug 30, 2009 at 11:45 AM, Jack Morgenstein wrote: > On Wednesday 26 August 2009 02:03, MANIKANTAN KALAIYA wrote: > > Resending to the mailing list... > > > > We have Ofed1.3.1 installed, one of the sub packages is libibverbs > version 1.1.1. We have a small program that lists the number of IB cards > available in the system through ibv_get_device_list(). See below for the > sample code. > > libibverbs reads the number of devices ONCE, at calling process startup (as > part of its initialization). > To get a new device count, you need to restart your program. > Does this mean PCI Hotplug is not supported for Infiniband Adapters? > > - Jack > > > The system has two IB cards, the value returned by ibv_get_device_list() > in 'num_devices' is two, as expected. > > > > However, when we disable one of the cards using the modprobe command, the > program continues to return two cards present (monitoring is continuous in a > while loop). > > Killing and restarting the sample test process results in reporting > correct number of IB cards available (returns one after it is restarted). > One of the prior versions was known to report the correct number of IB cards > without requiring to restart the program. > > > > We would like to determine the number of cards present without having to > go through a restart. Any inputs on this behavior is appreciated. > > > > modprobe command - "sudo modprobe -r ib_mthca" > > > > Test program: > > ================================================= > > #include > > #include > > > > int main(int argc, char **argv) > > { > > int ret, num_devices; > > struct ibv_device **dev_list; > > > > while(1) { > > > > dev_list = ibv_get_device_list(&num_devices); > > > > if (num_devices != 0) { > > printf("IB ADAPTER AVAILABLE:%d\n", num_devices); > > } > > else { > > printf("IB ADAPTER UNAVAILABLE\n"); > > } > > sleep(2); > > ibv_free_device_list(dev_list); > > } > > > > return(0); > > } > > ================================================= > > > > Thanks, > > Mani. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Mon Aug 31 00:49:54 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 31 Aug 2009 10:49:54 +0300 Subject: [ofa-general] [PATCH] IPoIB: check multicast address format In-Reply-To: <4A96ABCE.2030204@Voltaire.COM> References: <20090821000431.GA5713@obsidianresearch.com> <4A94FB67.6050600@voltaire.com> <20090826180457.GR406@obsidianresearch.com> <4A96ABCE.2030204@Voltaire.COM> Message-ID: <4A9B80A2.5010602@voltaire.com> Moni Shoua wrote: > Jason Gunthorpe wrote: >> Is this true? That is pretty ugly, but probably manageable.. > Unfortunately, losing routes is a side effect of closing the device Moni, I tend to agree with Jason's about this being OTOH ugly but OTOH manageable, maybe you can send a patch to the kernel bonding document that states to re-set non trivial routes for ipoib bonds after their initial establishment (will save you some support cases...) Or. From ogerlitz at voltaire.com Mon Aug 31 00:52:38 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 31 Aug 2009 10:52:38 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: matching PR query to QoS level with pkey In-Reply-To: <4A94EBB5.7050107@dev.mellanox.co.il> References: <4A8D4A6F.9050404@dev.mellanox.co.il> <4A90DC04.3020906@voltaire.com> <4A910609.3040305@dev.mellanox.co.il> <4A94DE99.5050308@voltaire.com> <4A94EBB5.7050107@dev.mellanox.co.il> Message-ID: <4A9B8146.7080800@voltaire.com> Yevgeny Kliteynik wrote: > Nope, just the other way around. Yevgeny, we want to do some testing/validation to understand better what's goes on here, will get back to you soon Or. From jackm at dev.mellanox.co.il Mon Aug 31 02:17:43 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 31 Aug 2009 12:17:43 +0300 Subject: [ofa-general] Fedora 10 OFED support plans In-Reply-To: <4A9AB9AD.80803@ncsa.uiuc.edu> References: <4A8E4854.2060909@ncsa.uiuc.edu> <200908301856.33259.jackm@dev.mellanox.co.il> <4A9AB9AD.80803@ncsa.uiuc.edu> Message-ID: <200908311217.43954.jackm@dev.mellanox.co.il> > >>> I think OFED 1.5 might work on it but not sure. Which kernel version > >>> FC10 use? > >>> In general OFED 1.5 supports FC11 > >>> > >> Actually, it supports FC12 (kernel 2.6.29). > >> > > We had originally planned to support FC11 -- however, in the interim, FC12 was > > released -- based on kernel 2.6.29, which is supported -- so we decided to support > > FC12 instead. > > > > -Jack Actually, Tziporet is correct. FC11 is built on kernel 2.6.29.4-167. OFED 1.5 supports FC11 (I confused this with OpenSuse) -- No FC12 as yet. There is no support for FC10. sorry about the mistake. -Jack From vlad at lists.openfabrics.org Mon Aug 31 03:03:41 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 31 Aug 2009 03:03:41 -0700 (PDT) Subject: [ofa-general] ofa_1_5_kernel 20090831-0200 daily build status Message-ID: <20090831100341.8430AE30149@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.16.60-0.21-smp Log: /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.16.60-0.21-smp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.16.60-0.21-smp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-67.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-67.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_clear_bit': /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:301: error: implicit declaration of function 'generic___clear_le_bit' /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c: In function 'rds_cong_test_bit': /home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.c:312: error: implicit declaration of function 'generic_test_le_bit' make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds/cong.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check/net/rds] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20090831-0200_linux-2.6.9-78.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-78.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hnrose at comcast.net Mon Aug 31 06:39:34 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 31 Aug 2009 09:39:34 -0400 Subject: [ofa-general] [PATCHv2] opensm: Parallelize (Stripe) MFT sets across switches Message-ID: <20090831133934.GA10155@comcast.net> Similar to previous patch to "Parallelize (Stripe) LFT sets across switches". Currently, MADs are pipelined to a single switch first which effectively serializes these requests. This patch pipelines the MFT set MADs across switches first (before cycling to the next MFT block) so that multiple switches can be responding concurrently. Speedup is dependent on number of MFT blocks in use (number of MLIDs) which is dependent on the number of multicast groups. Signed-off-by: Hal Rosenstock --- Changes since v1: Fixed loop which stripes MFT block across switches Changed routine name from mcast_mgr_set_tbl to mcast_mgr_set_mft_block and added block_num and position parameters Consolidate code into mcast_mgr_set_mftables diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 7ce28c5..e281842 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -103,6 +103,8 @@ typedef struct osm_switch { uint8_t *lft; uint8_t *new_lft; osm_mcast_tbl_t mcast_tbl; + uint32_t mft_block_num; + uint32_t mft_position; unsigned endport_links; unsigned need_update; void *priv; diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index 4dbbaa0..708d837 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -321,16 +321,14 @@ static osm_switch_t *mcast_mgr_find_root_switch(osm_sm_t * sm, /********************************************************************** **********************************************************************/ -static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) +static int mcast_mgr_set_mft_block(osm_sm_t * sm, IN osm_switch_t * p_sw, + uint32_t block_num, uint32_t position) { osm_node_t *p_node; osm_dr_path_t *p_path; - osm_madw_context_t mad_context; + osm_madw_context_t context; ib_api_status_t status; - uint32_t block_id_ho = 0; - int16_t block_num = 0; - uint32_t position = 0; - uint32_t max_position; + uint32_t block_id_ho; osm_mcast_tbl_t *p_tbl; ib_net16_t block[IB_MCAST_BLOCK_SIZE]; int ret = 0; @@ -353,23 +351,25 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) configuration. */ - mad_context.mft_context.node_guid = osm_node_get_node_guid(p_node); - mad_context.mft_context.set_method = TRUE; + context.mft_context.node_guid = osm_node_get_node_guid(p_node); + context.mft_context.set_method = TRUE; p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); - max_position = p_tbl->max_position; - while (osm_mcast_tbl_get_block(p_tbl, block_num, - (uint8_t) position, block)) { - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Writing MFT block 0x%X\n", block_id_ho); + if (osm_mcast_tbl_get_block(p_tbl, block_num, + (uint8_t) position, block)) { block_id_ho = block_num + (position << 28); + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Writing MFT block %u position %u to switch 0x%" PRIx64 "\n", + block_num, position, + cl_ntoh64(context.lft_context.node_guid)); + status = osm_req_set(sm, p_path, (void *)block, sizeof(block), IB_MAD_ATTR_MCAST_FWD_TBL, cl_hton32(block_id_ho), CL_DISP_MSGID_NONE, - &mad_context); + &context); if (status != IB_SUCCESS) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A02: " @@ -377,11 +377,6 @@ static int mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * p_sw) ib_get_err_str(status)); ret = -1; } - - if (++position > max_position) { - position = 0; - block_num++; - } } OSM_LOG_EXIT(sm->p_log); @@ -1071,9 +1066,55 @@ Exit: /********************************************************************** **********************************************************************/ -int osm_mcast_mgr_process(osm_sm_t * sm) +static int mcast_mgr_set_mftables(osm_sm_t * sm) { + cl_qmap_t *p_sw_tbl = &sm->p_subn->sw_guid_tbl; osm_switch_t *p_sw; + osm_mcast_tbl_t *p_tbl; + int block_notdone, ret = 0; + int16_t block_num, max_block = -1; + + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { + p_sw->mft_block_num = 0; + p_sw->mft_position = 0; + p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); + if (osm_mcast_tbl_get_max_block_in_use(p_tbl) > max_block) + max_block = osm_mcast_tbl_get_max_block_in_use(p_tbl); + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); + } + + /* Stripe the MFT blocks across the switches */ + for (block_num = 0; block_num <= max_block; block_num++) { + block_notdone = 1; + while (block_notdone) { + block_notdone = 0; + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { + if (p_sw->mft_block_num == block_num) { + block_notdone = 1; + if (mcast_mgr_set_mft_block(sm, p_sw, + p_sw->mft_block_num, + p_sw->mft_position)) + ret = -1; + p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); + if (++p_sw->mft_position > p_tbl->max_position) { + p_sw->mft_position = 0; + p_sw->mft_block_num++; + } + } + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); + } + } + } + + return ret; +} + +/********************************************************************** + **********************************************************************/ +int osm_mcast_mgr_process(osm_sm_t * sm) +{ cl_qmap_t *p_sw_tbl; cl_qlist_t *p_list = &sm->mgrp_list; osm_mgrp_t *p_mgrp; @@ -1112,12 +1153,7 @@ int osm_mcast_mgr_process(osm_sm_t * sm) /* Walk the switches and download the tables for each. */ - p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); - while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { - if (mcast_mgr_set_tbl(sm, p_sw)) - ret = -1; - p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); - } + ret = mcast_mgr_set_mftables(sm); while (!cl_is_qlist_empty(p_list)) { cl_list_item_t *p = cl_qlist_remove_head(p_list); @@ -1139,8 +1175,6 @@ exit: int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) { cl_qlist_t *p_list = &sm->mgrp_list; - osm_switch_t *p_sw; - cl_qmap_t *p_sw_tbl; osm_mgrp_t *p_mgrp; ib_net16_t mlid; osm_mcast_mgr_ctxt_t *ctx; @@ -1192,13 +1226,7 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm) /* Walk the switches and download the tables for each. */ - p_sw_tbl = &sm->p_subn->sw_guid_tbl; - p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); - while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) { - if (mcast_mgr_set_tbl(sm, p_sw)) - ret = -1; - p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); - } + ret = mcast_mgr_set_mftables(sm); osm_dump_mcast_routes(sm->p_subn->p_osm); From sashak at voltaire.com Mon Aug 31 09:44:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 31 Aug 2009 19:44:56 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me> <20090830153619.GB15546@me> Message-ID: <20090831164456.GA24631@me> On 12:35 Sun 30 Aug , Hal Rosenstock wrote: > > Doesn't the loop: > for (block = startblock; block <= lastblock; block++) > terminates without any blocks read ? So it shows no entries. Sorry, I still don't understand. Let's suppose that top = 0xbfff, cap = 1024, startlid = 0xc000, endlid = 0xc030 and dump_all = 0. What will prevent MFT entries printing? This will ignore a value of 'top' or I'm missing something? > Do you mean to > print "no entries" ? No, of course not that :) Sasha From sashak at voltaire.com Mon Aug 31 09:45:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 31 Aug 2009 19:45:20 +0300 Subject: [ofa-general] Re: [PATCHv3] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: <20090830163207.GA17406@comcast.net> References: <20090830163207.GA17406@comcast.net> Message-ID: <20090831164520.GB24631@me> On 12:32 Sun 30 Aug , Hal Rosenstock wrote: > > Add support for SwitchInfo:MulticastFDBTop > Added by MgtWG errata #4505-4508 and 4640 > > If MulticastFDBTop set to other than 0, only fetch MulticastForwardingTable > blocks up through MulticastFDBTop rather than MulticastFDBCap > > If MulticastFDBTop set to 0xbfff, this means no entries (per 4640) > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hal.rosenstock at gmail.com Mon Aug 31 10:42:44 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 31 Aug 2009 13:42:44 -0400 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibroute: Add support for MulticastFDBTop In-Reply-To: <20090831164456.GA24631@me> References: <20090826140350.GB19158@comcast.net> <20090830115316.GF21909@me> <20090830153619.GB15546@me> <20090831164456.GA24631@me> Message-ID: On 8/31/09, Sasha Khapyorsky wrote: > > On 12:35 Sun 30 Aug , Hal Rosenstock wrote: > > > > Doesn't the loop: > > for (block = startblock; block <= lastblock; block++) > > terminates without any blocks read ? So it shows no entries. > > Sorry, I still don't understand. Let's suppose that top = 0xbfff, > cap = 1024, startlid = 0xc000, endlid = 0xc030 and dump_all = 0. > What will prevent MFT entries printing? This will ignore a value of > 'top' or I'm missing something? Wouldn't endlid be set to top for this case (since top < endlid) ? It ignores endlid and not top in this case. -- Hal > Do you mean to > > print "no entries" ? > > No, of course not that :) > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Mon Aug 31 12:21:34 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 31 Aug 2009 15:21:34 -0400 Subject: [ofa-general] [PATCH] osmtest: Add SA get PathRecord stress test Message-ID: <20090831192134.GA12094@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/man/osmtest.8 b/opensm/man/osmtest.8 index fa0cd52..f0d6323 100644 --- a/opensm/man/osmtest.8 +++ b/opensm/man/osmtest.8 @@ -1,4 +1,4 @@ -.TH OSMTEST 8 "August 11, 2008" "OpenIB" "OpenIB Management" +.TH OSMTEST 8 "August 31, 2009" "OpenIB" "OpenIB Management" .SH NAME osmtest \- InfiniBand subnet manager and administration (SM/SA) test program @@ -108,9 +108,10 @@ Stress test options are as follows: OPT Description --- ----------------- - -s1 - Single-MAD response SA queries + -s1 - Single-MAD (RMPP) response SA queries -s2 - Multi-MAD (RMPP) response SA queries -s3 - Multi-MAD (RMPP) Path Record SA queries + -s4 - Single-MAD (non RMPP) get Path Record SA queries Without -s, stress testing is not performed .TP diff --git a/opensm/osmtest/include/osmtest_base.h b/opensm/osmtest/include/osmtest_base.h index 7c33da3..cda3a31 100644 --- a/opensm/osmtest/include/osmtest_base.h +++ b/opensm/osmtest/include/osmtest_base.h @@ -56,11 +56,12 @@ #define STRESS_SMALL_RMPP_THR 100000 /* - Take long times when quering big clusters (over 40 nodes) , an average of : 0.25 sec for query + Take long times when querying big clusters (over 40 nodes), an average of : 0.25 sec for query each query receives 1000 records */ #define STRESS_LARGE_RMPP_THR 4000 #define STRESS_LARGE_PR_RMPP_THR 20000 +#define STRESS_GET_PR 100000 extern const char *const p_file; diff --git a/opensm/osmtest/main.c b/opensm/osmtest/main.c index bb2d6bc..4bb9f82 100644 --- a/opensm/osmtest/main.c +++ b/opensm/osmtest/main.c @@ -143,9 +143,10 @@ void show_usage() " Stress test options are as follows:\n" " OPT Description\n" " --- -----------------\n" - " -s1 - Single-MAD response SA queries\n" + " -s1 - Single-MAD (RMPP) response SA queries\n" " -s2 - Multi-MAD (RMPP) response SA queries\n" " -s3 - Multi-MAD (RMPP) Path Record SA queries\n" + " -s4 - Single-MAD (non RMPP) get Path Record SA queries\n" " Without -s, stress testing is not performed\n\n"); printf("-M\n" "--Multicast_Mode\n" @@ -499,6 +500,9 @@ int main(int argc, char *argv[]) case 3: printf("Large Path Record SA queries\n"); break; + case 4: + printf("SA Get Path Record queries\n"); + break; default: printf("Unknown value %u (ignored)\n", opt.stress); diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c index 986a8d2..8357d90 100644 --- a/opensm/osmtest/osmtest.c +++ b/opensm/osmtest/osmtest.c @@ -2882,6 +2882,151 @@ Exit: /********************************************************************** **********************************************************************/ +ib_api_status_t +osmtest_stress_path_recs_by_lid(IN osmtest_t * const p_osmt, + IN int mode, + OUT uint32_t * const p_num_recs, + OUT uint32_t * const p_num_queries) +{ + osmtest_req_context_t context; + ib_path_rec_t *p_rec; + cl_status_t status; + ib_net16_t dlid, slid; + int num_recs, i; + + OSM_LOG_ENTER(&p_osmt->log); + + memset(&context, 0, sizeof(context)); + + slid = cl_ntoh16(p_osmt->local_port.lid); + if (!mode) + dlid = cl_ntoh16(p_osmt->local_port.sm_lid); + else + dlid = cl_ntoh16(p_osmt->local_port.lid); + + /* + * Do a blocking query for the PathRecord. + */ + status = osmtest_get_path_rec_by_lid_pair(p_osmt, slid, dlid, &context); + if (status != IB_SUCCESS) { + OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 000A: " + "osmtest_get_path_rec_by_lid_pair failed (%s)\n", + ib_get_err_str(status)); + goto Exit; + } + + /* + * Populate the database with the received records. + */ + num_recs = context.result.result_cnt; + *p_num_recs += num_recs; + ++*p_num_queries; + + if (osm_log_is_active(&p_osmt->log, OSM_LOG_VERBOSE)) { + OSM_LOG(&p_osmt->log, OSM_LOG_VERBOSE, + "Received %u records\n", num_recs); + + for (i = 0; i < num_recs; i++) { + p_rec = osmv_get_query_path_rec(context.result.p_result_madw, 0); + osm_dump_path_record(&p_osmt->log, p_rec, OSM_LOG_VERBOSE); + } + } + +Exit: + /* + * Return the IB query MAD to the pool as necessary. + */ + if (context.result.p_result_madw != NULL) { + osm_mad_pool_put(&p_osmt->mad_pool, + context.result.p_result_madw); + context.result.p_result_madw = NULL; + } + + OSM_LOG_EXIT(&p_osmt->log); + return (status); +} + +/********************************************************************** + **********************************************************************/ +static ib_api_status_t osmtest_stress_get_pr(IN osmtest_t * const p_osmt, + IN int mode) +{ + ib_api_status_t status = IB_SUCCESS; + uint64_t num_recs = 0; + uint64_t num_queries = 0; + uint32_t delta_recs; + uint32_t delta_queries; + uint32_t print_freq = 0; + int num_timeouts = 0; + struct timeval start_tv, end_tv; + long sec_diff, usec_diff; + + OSM_LOG_ENTER(&p_osmt->log); + gettimeofday(&start_tv, NULL); + printf("-I- Start time is : %09ld:%06ld [sec:usec]\n", + start_tv.tv_sec, (long)start_tv.tv_usec); + + while ((num_queries < STRESS_GET_PR) && (num_timeouts < 100)) { + delta_recs = 0; + delta_queries = 0; + + status = osmtest_stress_path_recs_by_lid(p_osmt, mode, + &delta_recs, + &delta_queries); + if (status != IB_SUCCESS) + goto Exit; + + num_recs += delta_recs; + num_queries += delta_queries; + + print_freq += delta_recs; + if (print_freq > 5000) { + gettimeofday(&end_tv, NULL); + printf("%" PRIu64 " records, %" PRIu64 " queries\n", + num_recs, num_queries); + if (end_tv.tv_usec > start_tv.tv_usec) { + sec_diff = end_tv.tv_sec - start_tv.tv_sec; + usec_diff = end_tv.tv_usec - start_tv.tv_usec; + } else { + sec_diff = end_tv.tv_sec - start_tv.tv_sec - 1; + usec_diff = + 1000000 - (start_tv.tv_usec - + end_tv.tv_usec); + } + printf("-I- End time is : %09ld:%06ld [sec:usec]\n", + end_tv.tv_sec, (long)end_tv.tv_usec); + printf("-I- Querying %" PRId64 + " path_rec queries took %04ld:%06ld [sec:usec]\n", + num_queries, sec_diff, usec_diff); + print_freq = 0; + } + } + +Exit: + gettimeofday(&end_tv, NULL); + printf("-I- End time is : %09ld:%06ld [sec:usec]\n", + end_tv.tv_sec, (long)end_tv.tv_usec); + if (end_tv.tv_usec > start_tv.tv_usec) { + sec_diff = end_tv.tv_sec - start_tv.tv_sec; + usec_diff = end_tv.tv_usec - start_tv.tv_usec; + } else { + sec_diff = end_tv.tv_sec - start_tv.tv_sec - 1; + usec_diff = 1000000 - (start_tv.tv_usec - end_tv.tv_usec); + } + + printf("-I- Querying %" PRId64 + " path_rec queries took %04ld:%06ld [sec:usec]\n", + num_queries, sec_diff, usec_diff); + if (num_timeouts > 50) { + status = IB_TIMEOUT; + } + /* Exit: */ + OSM_LOG_EXIT(&p_osmt->log); + return (status); +} + +/********************************************************************** + **********************************************************************/ static void osmtest_prepare_db_generic(IN osmtest_t * const p_osmt, IN cl_qmap_t * const p_tbl) @@ -7247,6 +7392,16 @@ ib_api_status_t osmtest_run(IN osmtest_t * const p_osmt) goto Exit; } break; + case 4: /* SA Get PR to SA LID */ + status = osmtest_stress_get_pr(p_osmt, 0); + if (status != IB_SUCCESS) { + OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, + "ERR 014B: " + "SA Get PR stress test failed (%s)\n", + ib_get_err_str(status)); + goto Exit; + } + break; default: OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 0144: " From donald.j.meyer at intel.com Mon Aug 31 12:29:36 2009 From: donald.j.meyer at intel.com (Meyer, Donald J) Date: Mon, 31 Aug 2009 12:29:36 -0700 Subject: [ofa-general] question about partitioning IB networks Message-ID: <6203933669E90E4AB42B5BC4EDE38D350C7D048C32@orsmsx510.amr.corp.intel.com> I am trying to partition my IB network but I don't seem to be able to understand the opensm man page. First it says "The default partition has P_Key value 0x7fff. OpenSM´s port will have full membership in default partition. All other end ports will have partial membership." but I don't see the difference defined between full and partial membership anywhere. Is it possible the reference was to full and limited membership instead? Does this partition have to exist on all CA's so the SM can "talk" them? Also it says the default partition will be created "unconditionally even when partition configuration file does not exist or cannot be accessed." Will it also be created if the partition configuration file exists but does not have a default partition defined? Second, I see where CA's can be members of multiple partitions (have multiple P_keys). If a CA is in multiple partitions (has multiple P_Keys assigned to it), which partition does it "send" on when the CA has packets to send if more than one partition can reach the destination CA? Also do switches (or any non CA's) have to have P_Keys assigned for any reason? Just as a sanity check, my interpretation so far is that my network should have a partition configuration file similar to the following. Can anyone tell me if I have this correct? In this example configuration, I am trying to create two partitions. One with rack one and two, the other with rack three and four: #Default partition (for SM control of the CA's) Default=0x7fff,ipoib,rate=7:ALL=limited; #rack1 rack1=0x111,ipoib,rate=7,defmember=full:; #rack2 rack2=0x111,ipoib,rate=7,defmember=full:; #rack3 rack3=0x112,ipoib,rate=7,defmember=full:; #rack4 rack4=0x112,ipoib,rate=7,defmember=full:; Thanks, Don Meyer Senior Network/System Engineer/Programmer US+ (253) 371-9532 iNet 8-371-9532 *Other names and brands may be claimed as the property of others -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Aug 31 14:08:45 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 31 Aug 2009 14:08:45 -0700 Subject: [ofa-general] Re: [PATCH V3] mlx4: Do not allow ib userspace open following a fatal event In-Reply-To: <200908301331.51212.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 30 Aug 2009 13:31:51 +0300") References: <200908301331.51212.jackm@dev.mellanox.co.il> Message-ID: Applied, thanks for redoing this. From rdreier at cisco.com Mon Aug 31 14:10:44 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 31 Aug 2009 14:10:44 -0700 Subject: [ofa-general] [PATCH] IB: dereference of dev->ibdev.iwcm in c2_register_device() In-Reply-To: <4A998EC2.70500@gmail.com> (Roel Kluin's message of "Sat, 29 Aug 2009 22:25:38 +0200") References: <4A998EC2.70500@gmail.com> Message-ID: > --- a/drivers/infiniband/hw/amso1100/c2_provider.c > +++ b/drivers/infiniband/hw/amso1100/c2_provider.c > @@ -851,6 +851,10 @@ int c2_register_device(struct c2_dev *dev) > dev->ibdev.post_recv = c2_post_receive; > > dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); > + if (dev->ibdev.iwcm == NULL) { > + ret = -ENOMEM; > + goto out1; > + } > dev->ibdev.iwcm->add_ref = c2_add_ref; > dev->ibdev.iwcm->rem_ref = c2_rem_ref; > dev->ibdev.iwcm->get_qp = c2_get_qp; Looks like a real fix to me -- but then don't we need to kfree() this memory if any of the later initialization fails (to avoid a leak)? From rdreier at cisco.com Mon Aug 31 14:25:59 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 31 Aug 2009 14:25:59 -0700 Subject: [ofa-general] Re: [PATCH] IB/ehca: Construct MAD redirect replies from request MAD In-Reply-To: <200908261337.56128.fenkes@de.ibm.com> (Joachim Fenkes's message of "Wed, 26 Aug 2009 13:37:55 +0200") References: <200908261337.56128.fenkes@de.ibm.com> Message-ID: this seems reasonable to me, applied, thanks. From worleys at gmail.com Mon Aug 31 16:04:05 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 31 Aug 2009 17:04:05 -0600 Subject: [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs In-Reply-To: References: Message-ID: On Wed, Aug 12, 2009 at 12:15 AM, Bart Van Assche wrote: > On Tue, Aug 11, 2009 at 11:52 PM, Chris Worley wrote: >> I setup my target exactly as you prescribe... but my initiator is >> still Windows (version of WInOF at top): performance as relayed by >> IOMeter starts high and the average slowly decreases.  Watching the >> instantaneous throughput, there seem to be longer and longer lags of >> poor performance. between moments of good performance.  I need to run >> this against a Linux initiator to see if the problems are w/ WinOF. >> >> Using OFED 1.4.1 (w/ the stock RHEL kernel) on the target, the >> performance was steady and getting close to acceptable.  In a 15 hour >> test that cycles through sequential and random LBA's and R/W mixes >> from block sizes from 1MB to 512B, it worked well and got decent >> performance until it hit 1KB sequential reads which hung IOMeter; no >> messages on the Linux side (all looked okay).  IBSRP on the Windows >> side just said "a reset to device was issued" every 15 to 30 seconds >> after the problem started. I reloaded the IB stack on the Linux side, >> and was able to get it restarted. >> >> Still a lot of combinations to test. > > Which trace settings are you using on the target ? Enabling the proper > trace settings via /proc/scsi_tgt/trace_level might reveal whether you > are e.g. hitting the QUEUE_FULL condition. See also scst/README. I've found a good kernel/scst mix to easily repeat this; I can get it to repeatedly hang w/ 8K block transfers running Ubuntu 9.04 w/ the 2.6.27-14-server kernel on _both_ target and initiator (i.e. no WinOF or OFED at all) and SCST rev 1062 on the target using one drive (performance is >600MB/s, >80K IOPS, on the 8KB block sizes being used). Although the problem doesn't occur in Windows until blocks are <2KB and the RHEL5.2/OFED configuration does not repeat the issue using a Linux initiator, it seems like a very similar hang, so I'm hoping it's the same issue. To repeat the issue, I run 8KB block random reads w/ 64 threads, running AIO calls w/ a depth of 64 (using "fio" on the initiator): # fio --rw=randrw --bs=8k --rwmixread=100 --numjobs=64 --iodepth=64 --sync=0 --direct=1 --randrepeat=0 --ioengine=libaio --filename=/dev/sdn --name=test --loops=10000 --size=16091503001 The "size" represents 10% of the drive. It doesn't seem to ever happen on writes, but I've seen it happen on mixed reads/writes. With tracing set to "default", there was still nothing in the target logs at the time of the hang. With tracing set thusly on the target: echo "all" >/proc/scsi_tgt/trace_level echo "all" >/proc/scsi_tgt/vdisk/trace_level The last few lines of dmesg look like: [255354.313411] 0: 28 00 01 84 54 90 00 00 10 00 00 00 00 00 00 00 (...T........... [255354.313420] [0]: scst: scst_cmd_init_done:214:tag=62, lun=0, CDB len=16, queue_type=1 (cmd ffff880102b4a568) [255354.313443] [26358]: scst: scst_pre_parse:417:op_name (cmd ffff880102b4a3a0), direction=2 (expected 2, set yes), transfer_len=16 (expected len 8192), flags=1 [255354.313420] [0]: scst_cmd_init_done:216:Recieving CDB: [255354.313452] [8602]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880102b49e48 (sg_cnt 0, sg ffff880132579f60, sg[0].page ffffe200042b7180) [255354.313457] [8604]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880102b4a010 (sg_cnt 0, sg ffff8802e9806f60, sg[0].page ffffe2000bc129c0) [255354.313426] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F [255354.313426] 0: 28 00 01 bc 5d 10 00 00 10 00 00 00 00 00 00 00 (...]........... [255354.313468] [26358]: scst: scst_pre_parse:417:op_name (cmd ffff880102b4a568), direction=2 (expected 2, set yes), transfer_len=16 (expected len 8192), flags=1 [255354.313484] [8602]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880102b4a1d8 (sg_cnt 0, sg ffff8802e98064c0, sg[0].page ffffe2000bc633c0) [255354.313551] [8604]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880102b4a3a0 (sg_cnt 0, sg ffff88018a877060, sg[0].page ffffe20004300200) [255354.313556] [8602]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880102b4a568 (sg_cnt 0, sg ffff880142581100, sg[0].page ffffe20004066d40) ... and there's a section like: [255354.310177] 0: 28 00 01 25 df 50 00 00 10 00 00 00 00 00 00 00 (..%.P.......... [255354.310177] [0]: scst: scst_cmd_init_done:214:tag=57, lun=0, CDB len=16, queue_type=1 (cmd ffff8801642e2730) [255354.310177] [0]: scst_cmd_init_done:216:Recieving CDB: [255354.310177] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F [255354.310177] 0: 28 00 01 5e 22 c0 00 00 10 00 00 00 00 00 00 00 (..^"........... [255354.310966] [26369]: scst: scst_pre_parse:417:op_name (cmd ffff880168a9e3a0), direction=2 (expected 2, set yes), transfer_len=16 (expected len 8192), flags=1 [255354.310973] [26361]: scst: scst_pre_parse:417:op_name (cmd ffff880168a9e010), direction=2 (expected 2, set yes), transfer_len=16 (expected len 8192), flags=1 [255354.310980] [26365]: scst: scst_pre_parse:417:op_name (cmd ffff880168a9e1d8), direction=2 (expected 2, set yes), transfer_len=16 (expected len 8192), flags=1 [255354.310986] [26359]: scst: scst_pre_parse:417:op_name (cmd ffff880168a9de48), direction=2 (expected 2, set yes), transfer_len=16 (expected len 8192), flags=1 ... [255354.311221] [8604]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880168a9e1d8 (sg_cnt 0, sg ffff880173ca8060, sg[0].page ffffe20004325d00) [255354.311226] [8602]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880168a9ee50 (sg_cnt 0, sg ffff880173ca8c40, sg[0].page ffffe20005847ec0) [255354.311233] [8604]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880168a9dc80 (sg_cnt 0, sg ffff8802f0143c40, sg[0].page ffffe2000bc04880) [255354.311238] [8602]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880168a9e568 (sg_cnt 0, sg ffff8802f08361a0, sg[0].page ffffe2000bbf2400) [255354.311242] [8604]: scst: scst_xmit_response:3004:Xmitting data for cmd ffff880168a9d560 (sg_cnt 0, sg ffff88010acd74c0, sg[0].page ffffe200047e7280) ... but, prior to that, messages are unreadably garbled, as in: Aug 31 22:37:00 nameme kernel: t]9l ft48 r(09 ,83_5p s20 sg:303 _00s3]c_=cs _00ad0000e_003a6_0031_4(ea5 9arg )_2As_05s_8[7:c8[f3 _178 087gff0 .R nt]9i0tmpd1:ft st06s68 5i9[301602_106)o6 _001e4 0)0 .3E3_28a9102 pft0>e_o[.eo[<_2n05 98_0f8_i xpe1f0 D<98s np8one:21_0 30f3006=e_ ax R8gs=h62]= 2.pd_ pad555mlf 1_]f8=.05lf i7gxs_ac3 m_0c0:]5i3087[_ 5e sg,00[dc3e,_ 0[ ( 1<[t]F] ..eb 4t_ ah1,_1_]10.h45_]2,5__12C5o 37 d_.)b_g4f850s, t1e c80.ite.8pE ue2.4f[.ft0 5c5_1effft 5530 f len=16, 5v03,em_cs4e 05fc78.5r5. n ,45ft45ffl3e0.51_654.30350en.m C30 C3 e f.dtm0=2_1e0n]6qe d.>_ 76 d=f _esr_tp 9_50.tnf50[cs., Aug 31 22:37:00 nameme kernel: e .0 5 B , 45 0cmdtesafe4 3[m 3.rer7:[ 1b00s5 Aug 31 22:37:00 nameme kernel: ] 2a015ffs.35fff B__ a 6cmd9spre3se9_2e3806(3_csA_ 1 ns38ge0sre0 Aug 31 22:37:00 nameme kernel: 0330B005]08s3 __ r40r._5x,196.t b 7.(008ni] 0s09.r650t, <24]__ s1=in03 s0p c2>>[4ein.1:ooD..ps210a>[25534_r6,:t n4.]4(8 e2 .r c 2n1g9360]10>( 00 00 00 00[fd[2 [2g_re53 le_6c_md8t_ftc883tf03c m_0 :8r8fmd63m3:0] 25 c6>[2n_e:fa2e84_0 Aug 31 22:37:00 nameme kernel: c, Aug 31 22:37:00 nameme kernel: .=0>5f=1s5=1d6_(de:d 2l_25:0edg25fm>ff40 l440 e,AFg l)AF0 0o[1088. 1aggB 0n=d9(16a.5oeX6csf00s0: ._, (=10es_(1 7 5c___oR5st_42p3d 7 C9d=5_:(3__7mD4_ 0m4_ed 04,5.,[s55.d4c,,25=,c8__q,[(meet9303_mr0ue9m0u_032__fy2se Aug 31 22:37:00 nameme kernel: > y>i ... so other suggestions on trace settings would be appreciated. Thanks, Chris > > Bart. > From weiny2 at llnl.gov Mon Aug 31 17:01:44 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 31 Aug 2009 17:01:44 -0700 Subject: [ofa-general] Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object. In-Reply-To: <20090823120609.GG9547@me> References: <20090813204306.dffc3237.weiny2@llnl.gov> <20090816110200.GS25501@me> <20090817083023.da17378b.weiny2@llnl.gov> <20090823120609.GG9547@me> Message-ID: <20090831170144.da0e7185.weiny2@llnl.gov> Hey Sasha, On Sun, 23 Aug 2009 15:06:09 +0300 Sasha Khapyorsky wrote: > Hi Ira, > > On 08:30 Mon 17 Aug , Ira Weiny wrote: > > > > The immediate benefit is coming with the multi-threaded implementation where > > I plan on adding the following function.[*] The discussion on the list has digressed from this patch. I still think this patch is valid and adds a level of flexibility which is needed regardless of what is decided about libibmad. Do you agree? Also, the last patch in the series ([PATCH 5/5] infiniband-diags/libibnetdisc: remove members of the fabric struct which are used in the scan only) cleans up some stuff from the external interface. If you really don't want to introduce a context object, then I can regenerate that final patch without the context. Ira > > Ok, but could we discuss first how will multithreading architecture be > implemented with libibnetdisc: goals (in particular is it support for > multithreaded apps or just multithreaded discovery function), interaction > with caller application, etc.? > > One of the desired feature of this I could think would be to keep API > simple for single threaded stuff. > > Sasha -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 weiny2 at llnl.gov